进阶-第23__深度探秘搜索技术_实战通过ngram分词机制实现index-time搜索推荐

1、ngram和index-time搜索推荐原理

先建立ngram的分词然后再去搜索

什么是ngram

 

quick,5种长度下的ngram

 

ngram length=1,q   u   i   c    k

ngram length=2,qu  ui  ic   ck

ngram length=3,qui     uic   ick

ngram length=4,quic    uick

ngram length=5,quick

 

什么是edge ngram

 

quick,anchor首字母后进行ngram

 

q

qu

qui

quic

quick

 

使用edge ngram将每个单词都进行进一步的分词切分,用切分后的ngram来实现前缀搜索推荐功能.

 

doc1   hello world

doc2   hello we

               doc1     ,doc2

h                *         *

he               *         *

hel               *         *

hell              *         *

hello              *         *

 

w                            *         *

wo              *

wor              *

worl             *

world            *

e                                     *

 

hello  world

 

min ngram = 1

max ngram = 3

 

h

he

hel

当你搜索hello w 时,hello 匹配到doc1,doc2后就不会继续往下走,w匹配到doc1也不会继续往下走。

hello w

 

hello --> hello,doc1  doc2

w --> w,doc1

 

doc1,hello和w,而且position也匹配,所以,ok,doc1返回,hello world

 

搜索的时候,不用再根据一个前缀,然后扫描整个倒排索引了; 简单的拿前缀去倒排索引中匹配即可,如果匹配上了,那么就好了; match,全文检索

 

2、实验一下ngram

先删除原先的索引

DELETE my_index

结果:

{

  "acknowledged": true

}

 

新建立my_index

PUT /my_index

{

    "settings": {

        "analysis": {

            "filter": {

                "autocomplete_filter": {

                    "type":     "edge_ngram",

                    "min_gram": 1,

                    "max_gram": 20

                }

            },

            "analyzer": {

                "autocomplete": {

                    "type":      "custom",

                    "tokenizer": "standard",

                    "filter": [

                        "lowercase",

                        "autocomplete_filter"

                    ]

                }

            }

        }

    }

}

结果:

{

  "acknowledged": true,

  "shards_acknowledged": true

}

 

查看

GET /my_index/_analyze

{

  "analyzer": "autocomplete",

  "text": "quick brown"

}

结果:

{

  "tokens": [

    {

      "token": "q",

      "start_offset": 0,

      "end_offset": 5,

      "type": "word",

      "position": 0

    },

    {

      "token": "qu",

      "start_offset": 0,

      "end_offset": 5,

      "type": "word",

      "position": 0

    },

    {

      "token": "qui",

      "start_offset": 0,

      "end_offset": 5,

      "type": "word",

      "position": 0

    },

    {

      "token": "quic",

      "start_offset": 0,

      "end_offset": 5,

      "type": "word",

      "position": 0

    },

    {

      "token": "quick",

      "start_offset": 0,

      "end_offset": 5,

      "type": "word",

      "position": 0

    },

    {

      "token": "b",

      "start_offset": 6,

      "end_offset": 11,

      "type": "word",

      "position": 1

    },

    {

      "token": "br",

      "start_offset": 6,

      "end_offset": 11,

      "type": "word",

      "position": 1

    },

    {

      "token": "bro",

      "start_offset": 6,

      "end_offset": 11,

      "type": "word",

      "position": 1

    },

    {

      "token": "brow",

      "start_offset": 6,

      "end_offset": 11,

      "type": "word",

      "position": 1

    },

    {

      "token": "brown",

      "start_offset": 6,

      "end_offset": 11,

      "type": "word",

      "position": 1

    }

  ]

}

 

建立mapping

PUT /my_index/_mapping/my_type

{

  "properties": {

      "title": {

          "type":     "string",

          "analyzer": "autocomplete",

          "search_analyzer": "standard"//搜索的时候还是standard 标准的分词器,比如:hello w 搜索时分词为hello + w ;而进行搜索时不需要进行ngram或edge_ngram分词,否则会变慢。

      }

  }

}

结果:

{

  "acknowledged": true

}

比如:

添加数据时,创建如下:

hello world

 

h

he

hel

hell

hello        

 

w                      

wo

wor

worl

world

这时候你搜索:

hello w

 

h

he

hel

hell

hello

 

w

 

hello w --> hello --> w

添加测试数据

PUT /my_index/my_type/1

{

 "title":"hello world"

}

PUT /my_index/my_type/2

{

 "title":"hello we"

}

PUT /my_index/my_type/3

{

 "title":"hello win"

}

PUT /my_index/my_type/4

{

 "title":"hello dog"

}

 

 

 

测试

测试一:     

GET /my_index/my_type/_search

{

  "query": {

    "match_phrase": {

      "title": "hello w"

    }

  }

}

结果:

{

  "took": 20,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

  },

  "hits": {

    "total": 3,

    "max_score": 1.1983768,

    "hits": [

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "2",

        "_score": 1.1983768,

        "_source": {

          "title": "hello we"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "1",

        "_score": 0.8271048,

        "_source": {

          "title": "hello world"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "3",

        "_score": 0.797104,

        "_source": {

          "title": "hello win"

        }

      }

    ]

  }

}

 

测试二:

GET /my_index/my_type/_search

{

  "query": {

    "match": {

      "title": "hello w"

    }

  }

}

结果:

{

  "took": 7,

  "timed_out": false,

  "_shards": {

    "total": 5,

    "successful": 5,

    "failed": 0

  },

  "hits": {

    "total": 4,

    "max_score": 1.1983768,

    "hits": [

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "2",

        "_score": 1.1983768,

        "_source": {

          "title": "hello we"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "1",

        "_score": 0.8271048,

        "_source": {

          "title": "hello world"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "3",

        "_score": 0.797104,

        "_source": {

          "title": "hello win"

        }

      },

      {

        "_index": "my_index",

        "_type": "my_type",

        "_id": "4",

        "_score": 0.2495691,

        "_source": {

          "title": "hello dog"

        }

      }

    ]

  }

}

 

如果用match,只有hello的也会出来,全文检索,只是分数比较低

推荐使用match_phrase,要求每个term都有,而且position刚好靠着1位,符合我们的期望的

 

 

 

 

 

猜你喜欢

转载自blog.csdn.net/qq_35524586/article/details/88431728