进阶-第23__深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐

1、ngram和index-time搜索推荐原理

先建立ngram的分词然后再去搜索

什么是ngram

quick，5种长度下的ngram

ngram length=1，q u i c k

ngram length=2，qu ui ic ck

ngram length=3，qui uic ick

ngram length=4，quic uick

ngram length=5，quick

什么是edge ngram

quick，anchor首字母后进行ngram

qui

quic

quick

使用edge ngram将每个单词都进行进一步的分词切分，用切分后的ngram来实现前缀搜索推荐功能.

doc1 hello world

doc2 hello we

doc1 ,doc2

h * *

he * *

hel * *

hell * *

hello * *

w * *

wo *

wor *

worl *

world *

e *

hello world

min ngram = 1

max ngram = 3

hel

当你搜索hello w 时，hello 匹配到doc1,doc2后就不会继续往下走，w匹配到doc1也不会继续往下走。

hello w

hello --> hello，doc1 doc2

w --> w，doc1

doc1，hello和w，而且position也匹配，所以，ok，doc1返回，hello world

搜索的时候，不用再根据一个前缀，然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可，如果匹配上了，那么就好了; match，全文检索

2、实验一下ngram

先删除原先的索引

DELETE my_index

结果：

{

"acknowledged": true

}

新建立my_index

PUT /my_index

{

"settings": {

"analysis": {

"filter": {

"autocomplete_filter": {

"type": "edge_ngram",

"min_gram": 1,

"max_gram": 20

}

"analyzer": {

"autocomplete": {

"type": "custom",

"tokenizer": "standard",

"filter": [

"lowercase",

"autocomplete_filter"

]

}

结果：

{

"acknowledged": true,

"shards_acknowledged": true

}

查看

GET /my_index/_analyze

{

"analyzer": "autocomplete",

"text": "quick brown"

}

结果:

{

"tokens": [

{

"token": "q",

"start_offset": 0,

"end_offset": 5,

"type": "word",

"position": 0

{

"token": "qu",

"start_offset": 0,

"end_offset": 5,

"type": "word",

"position": 0

{

"token": "qui",

"start_offset": 0,

"end_offset": 5,

"type": "word",

"position": 0

{

"token": "quic",

"start_offset": 0,

"end_offset": 5,

"type": "word",

"position": 0

{

"token": "quick",

"start_offset": 0,

"end_offset": 5,

"type": "word",

"position": 0

{

"token": "b",

"start_offset": 6,

"end_offset": 11,

"type": "word",

"position": 1

{

"token": "br",

"start_offset": 6,

"end_offset": 11,

"type": "word",

"position": 1

{

"token": "bro",

"start_offset": 6,

"end_offset": 11,

"type": "word",

"position": 1

{

"token": "brow",

"start_offset": 6,

"end_offset": 11,

"type": "word",

"position": 1

{

"token": "brown",

"start_offset": 6,

"end_offset": 11,

"type": "word",

"position": 1

}

]

}

建立mapping

PUT /my_index/_mapping/my_type

{

"properties": {

"title": {

"type": "string",

"analyzer": "autocomplete",

"search_analyzer": "standard"//搜索的时候还是standard 标准的分词器，比如：hello w 搜索时分词为hello + w ；而进行搜索时不需要进行ngram或edge_ngram分词，否则会变慢。

}

结果：

{

"acknowledged": true

}

比如：

添加数据时，创建如下：

hello world

hel

hell

hello

wor

worl

world

这时候你搜索：

hello w

hel

hell

hello

hello w --> hello --> w

添加测试数据

PUT /my_index/my_type/1

{

"title":"hello world"

}

PUT /my_index/my_type/2

{

"title":"hello we"

}

PUT /my_index/my_type/3

{

"title":"hello win"

}

PUT /my_index/my_type/4

{

"title":"hello dog"

}

测试

测试一：

GET /my_index/my_type/_search

{

"query": {

"match_phrase": {

"title": "hello w"

}

结果：

{

"took": 20,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 3,

"max_score": 1.1983768,

"hits": [

{

"_index": "my_index",

"_type": "my_type",

"_id": "2",

"_score": 1.1983768,

"_source": {

"title": "hello we"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "1",

"_score": 0.8271048,

"_source": {

"title": "hello world"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "3",

"_score": 0.797104,

"_source": {

"title": "hello win"

}

]

}

测试二：

GET /my_index/my_type/_search

{

"query": {

"match": {

"title": "hello w"

}

结果：

{

"took": 7,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

"hits": {

"total": 4,

"max_score": 1.1983768,

"hits": [

{

"_index": "my_index",

"_type": "my_type",

"_id": "2",

"_score": 1.1983768,

"_source": {

"title": "hello we"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "1",

"_score": 0.8271048,

"_source": {

"title": "hello world"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "3",

"_score": 0.797104,

"_source": {

"title": "hello win"

}

{

"_index": "my_index",

"_type": "my_type",

"_id": "4",

"_score": 0.2495691,

"_source": {

"title": "hello dog"

}

]

}

如果用match，只有hello的也会出来，全文检索，只是分数比较低

推荐使用match_phrase，要求每个term都有，而且position刚好靠着1位，符合我们的期望的