elasticsearch系统分析器及自定义分析器

一、系统自带的分析器：
（1）standard 分析器
standard 分析器是用于全文字段的默认分析器。
它考虑了以下几点：
standard 分词器，在词层级上分割输入的文本。
standard 标记过滤器，被设计用来整理分词器触发的所有标记（但是目前什么都没做）。
lowercase 标记过滤器，将所有标记转换为小写。
stop 标记过滤器，删除所有可能会造成搜索歧义的停用词，如 a，the，and，is。
（2）keyword分析器
（3）whitespace分析器

1.系统自带的字符过滤器：
（1） html_strip 字符过滤器来删除所有的 HTML 标签，并且将 HTML 实体转换成对应的 Unicode 字符，比如将 Á 转成 Á。

2.系统自带的分词器：
（1）[keyword 分词器]输出和它接收到的相同的字符串，不做任何分词处理。
（2）[whitespace 分词器]只通过空格来分割文本。
（3）[pattern 分词器]可以通过正则表达式来分割文本

3.系统自带的标记过滤器：
（1）[lowercase 标记过滤器]
（2）[stop 标记过滤器]
（3）[stemmer 标记过滤器]将单词转化为他们的根形态（root form）。
（4）[ascii_folding 标记过滤器]会删除变音符号，比如从 très 转为 tres。
（5）[ngram] 和 [edge_ngram]可以让标记更适合特殊匹配情况或自动完成

二、创建自定义分析器
（可以在 analysis 字段下配置字符过滤器char_filter，分词器tokenizer和标记过滤器filter）：
分析器是三个顺序执行的组件的结合（字符过滤器，分词器，标记过滤器）。

--创建自定义分析器的语法格式
PUT /testindex
{
    "settings": {
        "analysis": {
            "char_filter": { ... custom character filters ... },
            "tokenizer":   { ...    custom tokenizers     ... },
            "filter":      { ...   custom token filters   ... },
            "analyzer":    { ...    custom analyzers      ... }
        }
    }
}

–demo1:创建了一个新的分析器es_std，并使用预定义的西班牙语停用词：
（注：es_std 分析器不是全局的，它仅仅存在于我们定义的 testindex 索引中）

PUT /testindex
{
    "settings": {
        "analysis": {
            "analyzer": {
                "es_std": {
                    "type":      "standard",
                    "stopwords": "_spanish_"
                }
            }
        }
    }
}

–demo2:创建一个自定义分析器
实现功能如下：
用 html_strip 字符过滤器去除所有的 HTML 标签
将 & 替换成 and，使用一个自定义的 mapping 字符过滤器
使用 standard 分词器分割单词
使用 lowercase 标记过滤器将词转为小写
用 stop 标记过滤器去除一些自定义停用词。


PUT /testindex
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type":       "mapping",
                    "mappings": [ "&=> and "]
            }},
            "filter": {
                "my_stopwords": {
                    "type":       "stop",
                    "stopwords": [ "the", "a" ]
            }},
            "analyzer": {
                "my_analyzer": {
                    "type":         "custom",
                    "char_filter":  [ "html_strip", "&_to_and" ],
                    "tokenizer":    "standard",
                    "filter":       [ "lowercase", "my_stopwords" ]
            }}
}}}

三、测试新的分析器：


--demo1:
GET testindex/_analyze?analyzer=standard
{
  "text": "The quick & brown fox."
}

--demo2:
GET testindex/_analyze 
{
  "field": "name",
  "text": "The quick & Brown Foxes."
}

--demo3:
GET testindex/_analyze 
{
  "field": "name.english",
  "text": "The quick & Brown Foxes."
}

四、给指定字段配置分析器

--demo1:给指定字段message配置分析器
PUT /testindex/_mapping/testtable
{
    "properties": {
        "message": {
            "type":      "string",
            "analyzer":  "my_analyzer"
        }
    }
}

--demo2:
PUT /testindex
{
  "mappings": {
    "testtable": {
      "properties": {
        "name": { 
          "type": "text",
          "fields": {
            "english": { 
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

五、将分析器应用到索引中

在给目标索引建映射时，指定待分析的字段的分析器来使用我们构造的分析器。如：

PUT /testindex/_mapping/testtable
{
  "testtable": {
    "properties": {
      "name": {
        "type": "string",
        "analyzer": "custom"
      }
    }
  }
}
查询时也可以指定分析器。如：

POST /testindex/testtable/_search
{
  "query": {
    "match": {
      "name": {
        "query": "it's brown",
        "analyzer": "standard"
      }
    }
  }
}
或者在映射中分别指定他们。如：

PUT /testindex/_mapping/testtable
{
  "testtable": {
    "properties": {
      "name": {
        "type": "string",
        "index_analyzer": "custom",
        "search_analyzer": "standard" 
      }
    }
  }
}

然后索引一些文档，使用简单的 match 查询检查一下，如果发现问题，使用 Validate API 检查一下。如：

POST /testindex/testtable/_validate/query?explain
{
  "query": {
    "match": {
      "name": "it's brown"
    }
  }
}

elasticsearch系统分析器及自定义分析器

猜你喜欢