ES学习记录9.3——请求体搜索(得分说明Explain、版本、索引激励、不及格文档、命名查询、内部命中)

1. 开启打分说明(Explain)

在搜索时，可以开启评分说明"explain": true，请求响应中每个命中的文档会额外返回一个得分说明字段_explanation，如：

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
    "explain": true,
    "query" : {
        "term" : { "content" : "中国" }
    }
}
'

返回的结果为：

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 1.111892,
        "hits": [
            {
                "_shard": "[index][2]",
                "_node": "bKeGC-Q-SXuyyGlcarDrMg",
                "_index": "index",
                "_type": "fulltext",
                "_id": "4",
                "_score": 1.111892,
                "_source": {
                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                },
                // 返回得分说明
                "_explanation": {
                    "value": 1.111892,
                    "description": "weight(content:中国 in 0) [PerFieldSimilarity], result of:",
                    "details": [
                        {
                            "value": 1.111892,
                            "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                            "details": [
                                {
                                    "value": 0.98082924,
                                    "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "docFreq",
                                            "details": []
                                        },
                                        {
                                            "value": 3,
                                            "description": "docCount",
                                            "details": []
                                        }
                                    ]
                                },
                                {
                                    "value": 1.1336244,
                                    "description": "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                                    "details": [
                                        {
                                            "value": 1,
                                            "description": "termFreq=1.0",
                                            "details": []
                                        },
                                        {
                                            "value": 1.2,
                                            "description": "parameter k1",
                                            "details": []
                                        },
                                        {
                                            "value": 0.75,
                                            "description": "parameter b",
                                            "details": []
                                        },
                                        {
                                            "value": 19.666666,
                                            "description": "avgFieldLength",
                                            "details": []
                                        },
                                        {
                                            "value": 14,
                                            "description": "fieldLength",
                                            "details": []
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                }
            },
            // ...
        ]
    }
}

2. 开启版本说明(Version)

就是在搜索时，配置"version": true，指定命中结果中也返回文档版本号，如下：

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
	"version": true,
	"query": {
		"term": {"content": "中国"}
	}
}
'

返回的结果会带有命中结果的版本号version，如下：

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 1.111892,
        "hits": [
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "4",
                // 同时返回文档的版本号
                "_version": 1,
                "_score": 1.111892,
                "_source": {
                    "content": "中国驻洛杉矶领事馆遭亚裔男子枪击 嫌犯已自首"
                }
            },
            // ...
        ]
    }
}

3. 索引激励(Index Boost)

允许在搜索多个索引时为每个索引配置不同的提升级别，当来自一个索引的命中比来自另一个索引的命中更重要时，这非常方便(比如每个人都有一个索引社交图，就很适用，有点类似于提升索引搜索权重的意思)，下面是是一个小栗子：

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
	"indices_boost": {
		"index": 0.4,
		"index2": 1.5
	},
	"query": {
		"term": {"content": "原始数据"}
	}
}
'

上述indices_boost字段适用大括号赋值的方式已经在5.2.0版本中弃用，推荐适用数据的方式赋值：

"indices_boost": [
    {"index": 0.4},
    {"index2": 1.5}
]

返回的结果为：

{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 10,
        "successful": 10,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 2,
        "max_score": 0.5753642,
        "hits": [
            {
                "_index": "index2",
                "_type": "fulltext",
                "_id": "1",
                "_score": 0.5753642,
                "_source": {
                    "content": "如果有原始数据，建议重新创建索引"
                }
            },
            {
                "_index": "index",
                "_type": "fulltext",
                "_id": "11",
                "_score": 0.28626728,
                "_source": {
                    "content": "如果有原始数据，建议重新创建索引"
                }
            }
        ]
    }
}

注：这里经过实测，也并不是给索引的激励越强，搜索到的索引文档就保证越靠前，除非像上面给激励相差很大才会有很大的可能靠前，不同的激励只是一定的程度的认为干扰。上述在激励索引时，出现的索引名(如index、index2)可以适索引的别名或者通配符去匹配，如果匹配到多个索引，那就使用第一个匹配到的索引作为目标对象给予指定的激励，下面是栗子：

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
    "indices_boost" : [
        { "alias1" : 1.4 },
        { "index*" : 1.3 }
    ]
}
'

上述使用了别名alias1和通配符index*的方式去给两个索引(可能小于2个)进行激励，虽然通配符可能匹配到多个索引，但ES仅对一个匹配的索引激励，另外上述的方式可能同时匹配到同一个索引，此时的激励给多少就是先定义先生效，即别名alias1是先于通配符index*定义的，所以会对匹配的索给予1.4的激励，而不是1.3。

4. 忽略得分不及格得文档

在搜索时可以通过min_score字段指定一个最小得分(个人称之为“及格分”)，但凡搜索的文档得分小于这个数值的将从搜索结果中被剔除，下面是一个搜索案例：

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
    // 指定及格分
    "min_score": 0.2876821,
    "query" : {
        "term": {"content": "原始"}
    }
}
'

5. 命名查询(Named Query)【待理解】

每个过滤器和查询都可以在其顶级定义中接受一个_name字段参数，官网的栗子：

curl -X GET "localhost:9200/_search" -H 'Content-Type: application/json' -d'
{
    "query": {
        "bool" : {
            "should" : [
                {"match" : { "name.first" : {"query" : "shay", "_name" : "first"} }},
                {"match" : { "name.last" : {"query" : "banon", "_name" : "last"} }}
            ],
            "filter" : {
                "terms" : {
                    "name.last" : ["banon", "kimchy"],
                    "_name" : "test"
                }
            }
        }
    }
}
'

这一块不是很懂，先搁置。

6. 内部命中

父连接和嵌套功能允许返回在不同范围内具有匹配项的文档，在父/子情况下，基于子文档中的匹配返回父文档，或者基于父文档中的匹配返回子文档；在嵌套的情况中，文档将会基于内部嵌套对象的匹配然后返回。在上述两种情况中(嵌套和父/子两种情况)，不同范围内的实际匹配(将会返回文档)将会被隐藏。在许多情况下，知道哪些内部嵌套对象（在嵌套的情况下）或子/父文档（在父/子的情况下）导致返回某些信息非常有用的，内部命中功能可以用到这一块，这个功能会在搜索响应中返回每次搜索的命中，这会导致搜索匹配在不同范围内匹配。内部命中inner_hits可以在nested、has_child和has_parent查询和过滤器中定义，定义的结构大致如下：

"<query>" : {
    "inner_hits" : {
        <inner_hits_options>
    }
}

如果inner_hits定义在支持每个搜索命中的查询中时，它将会包含一个inner_hits的json对象，结构如下：

"hits": [
     {
        "_index": ...,
        "_type": ...,
        "_id": ...,
        "inner_hits": {
           "<inner_hits_name>": {
              "hits": {
                 "total": ...,
                 "hits": [
                    {
                       "_type": ...,
                       "_id": ...,
                       ...
                    },
                    ...
                 ]
              }
           }
        },
        ...
     },
     ...
]

内部命中支持下面的参数：

from：返回常规搜索中每个inner_hits的第一次命中的偏移量；
size：每个inner_hits返回的最大命中数，默认情况下，返回命中的前三个；
sort：表示每个内部命中inner_hits应该怎么排序，默认情况下，按分数排序；
name：用于响应中特定内部命中定义的名称，在单个搜索请求中定义了多个内部命中时很有用，默认值取决于内部命中定义的时哪种查询；对于has_child查询和过滤器就是子类型，has_parent查询和过滤器就是父类型，同样的，嵌套查询和过滤器就是嵌套路径；

除此之外，内部命中还支持每个文档中某些属性，包含：高亮、得分说明、源过滤、脚本字段、版本展示。

嵌套的内部命中

嵌套模式的内部命中inner_hits可以将嵌套的内部对象作为内部命中包含在搜索命中，下面是一套完整的流程：

// 创建结构化索引
curl -X PUT "localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "_doc": {
      "properties": {
        "comments": {
          "type": "nested"
        }
      }
    }
  }
}
'
// 向索引中存放文档
curl -X PUT "localhost:9200/test/_doc/1?refresh" -H 'Content-Type: application/json' -d'
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "number": 1
    },
    {
      "author": "nik9000",
      "number": 2
    }
  ]
}
'

// 搜索
curl -X POST "localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": {"comments.number" : 2}
      },
      // 嵌套查询中的内部命中定义，没有其他属性需要定义
      "inner_hits": {}
    }
  }
}
'

// 搜索结果
{
    "took": 44,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "title": "Test title",
                    "comments": [
                        {
                            "author": "kimchy",
                            "number": 1
                        },
                        {
                            "author": "nik900",
                            "number": 2
                        }
                    ]
                },
                "inner_hits": {
                    // 是搜索请求中内部命中定义的名称，可以通过name字段自定义
                    "comments": {
                        "hits": {
                            "total": 1,
                            "max_score": 1,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_type": "_doc",
                                    "_id": "1",
                                    "_nested": {
                                        "field": "comments",
                                        "offset": 1
                                    },
                                    "_score": 1,
                                    "_source": {
                                        "author": "nik900",
                                        "number": 2
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}

在上面的例子中，_nested元数据是很重要的，因为它定义了内部命中来自哪个内部嵌套对象(这个栗子就中就指明了内部命中来自于comments字段)，这个字段定义嵌套匹配来自的对象数组字段以及相对在_source中的位置的偏移量。由于排序和评分，inner_hits中命中对象的实际位置通常不同于嵌套内部对象的定义位置。默认情况下，内部嵌套inner_hits中的命中对象也会返回_source字段，是否要包_source字段可以通过_source过滤功能选择返回或禁用部分源。如果在嵌套级别定义了存储字段，则还可以通过字段功能返回这些字段。一个重要的默认值是在inner_hits内的hits中返回的_source是相对于_nested元数据的，因此在上面的栗子中，每个嵌套命中只返回注释部分，而不是包含注释的顶级文档的整个源。

嵌套的内部命中和_source

嵌套文档没有_source字段，因为整个文档源与根文档一起存储在它的_source字段下。如果仅仅想要包含嵌套文档的源，将根文档的源进行解析，并将嵌套文档的相关bit作为内部命中的源包含在内部命中中。对每个匹配的嵌套文档执行这些操作会影响整个搜索请求所花费的时间，尤其是当size和内部命中的size设置为高于默认值时。为了避免从嵌套内部命中提取相对昂贵的源消耗较大的资源耗费，可以禁用包括源并且仅依赖于文档值的字段。比如：

// 创建结构化索引
curl -X PUT "localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "_doc": {
      "properties": {
        "comments": {
          "type": "nested"
        }
      }
    }
  }
}
'
// 向索引中存放文档
curl -X PUT "localhost:9200/test/_doc/1?refresh" -H 'Content-Type: application/json' -d'
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "text": "comment text"
    },
    {
      "author": "nik9000",
      "text": "words words words"
    }
  ]
}
'
// 搜索
curl -X POST "localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "nested": {
      "path": "comments",
      "query": {
        "match": {"comments.text" : "words"}
      },
      "inner_hits": {
        "_source" : false,
        "docvalue_fields" : [
          {
            "field": "comments.text.keyword",
            "format": "use_field_mapping"
          }
        ]
      }
    }
  }
}
'

嵌套对象字段和内部命中的等级水平

如果映射具有多级分层嵌套对象字段，可以通过点标记的路径方式访问每个级别(即xx.xxx.xxx)。例如，如果有一个包含votes嵌套字段的comments嵌套字段，伴随着根命中结果应该直接返回votes，则可以如下定义：

curl -X PUT "localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "_doc": {
      "properties": {
        "comments": {
          "type": "nested",
          "properties": {
            "votes": {
              "type": "nested"
            }
          }
        }
      }
    }
  }
}
'
curl -X PUT "localhost:9200/test/_doc/1?refresh" -H 'Content-Type: application/json' -d'
{
  "title": "Test title",
  "comments": [
    {
      "author": "kimchy",
      "text": "comment text",
      "votes": []
    },
    {
      "author": "nik9000",
      "text": "words words words",
      "votes": [
        {"value": 1 , "voter": "kimchy"},
        {"value": -1, "voter": "other"}
      ]
    }
  ]
}
'
curl -X POST "localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "nested": {
      "path": "comments.votes",
        "query": {
          "match": {
            "comments.votes.voter": "kimchy"
          }
        },
        "inner_hits" : {}
    }
  }
}
'

// 返回的结果
{
  ...,
  "hits": {
    "total": 1,
    "max_score": 0.6931472,
    "hits": [
      {
        "_index": "test",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.6931472,
        "_source": ...,
        "inner_hits": {
          // 仅对嵌套的内部命中支持间接引用
          "comments.votes": {
            "hits": {
              "total": 1,
              "max_score": 0.6931472,
              "hits": [
                {
                  "_index": "test",
                  "_type": "_doc",
                  "_id": "1",
                  "_nested": {
                    "field": "comments",
                    "offset": 1,
                    "_nested": {
                      "field": "votes",
                      "offset": 0
                    }
                  },
                  "_score": 0.6931472,
                  "_source": {
                    "value": 1,
                    "voter": "kimchy"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

父/子内部命中

父/子关系的嵌套inner_hits可以用来包含父元素或子元素，下面是一套栗子：

curl -X PUT "localhost:9200/test" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": {
          "type": "join",
          "relations": {
            "my_parent": "my_child"
          }
        }
      }
    }
  }
}
'
curl -X PUT "localhost:9200/test/_doc/1?refresh" -H 'Content-Type: application/json' -d'
{
  "number": 1,
  "my_join_field": "my_parent"
}
'
curl -X PUT "localhost:9200/test/_doc/2?routing=1&refresh" -H 'Content-Type: application/json' -d'
{
  "number": 1,
  "my_join_field": {
    "name": "my_child",
    "parent": "1"
  }
}
'
curl -X POST "localhost:9200/test/_search" -H 'Content-Type: application/json' -d'
{
  "query": {
    "has_child": {
      "type": "my_child",
      "query": {
        "match": {
          "number": 1
        }
      },
      // 内部命中定义如嵌套那一套栗子中一样
      "inner_hits": {}
    }
  }
}
'

// 搜索结果
{
    ...,
    "hits": {
        "total": 1,
        "max_score": 1.0,
        "hits": [
            {
                "_index": "test",
                "_type": "_doc",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "number": 1,
                    "my_join_field": "my_parent"
                },
                "inner_hits": {
                    "my_child": {
                        "hits": {
                            "total": 1,
                            "max_score": 1.0,
                            "hits": [
                                {
                                    "_index": "test",
                                    "_type": "_doc",
                                    "_id": "2",
                                    "_score": 1.0,
                                    "_routing": "1",
                                    "_source": {
                                        "number": 1,
                                        "my_join_field": {
                                            "name": "my_child",
                                            "parent": "1"
                                        }
                                    }
                                }
                            ]
                        }
                    }
                }
            }
        ]
    }
}