Bucket aggregations 桶聚合

Bucket聚合不像metrics聚合那样计算字段上的度量，而是创建文档的Bucket。每个bucket都与一个标准（取决于聚合类型）相关联，该标准确定当前上下文中的文档是否“落入”其中。换句话说，bucket有效地定义了文档集。除了bucket本身，bucket聚合还计算并返回“落入”每个bucket的文档数。
与度量聚合不同，Bucket聚合可以保存子聚合。这些子聚合将针对由其“父”bucket聚合创建的bucket进行聚合。
有不同的bucket聚合器，每个都有不同的“bucketing”策略。有的定义单个bucket，有的定义固定数量的多个bucket，还有的在聚合过程中动态创建bucket。

时间相关聚合

Date histogram aggregation
这种多bucket聚合只能与日期或日期范围值一起使用。因为在Elasticsearch中，日期在内部表示为long值，可以使用日期/时间表达式指定间隔。基于时间的数据需要特殊的支持，因为基于时间的间隔并不总是固定的长度。
日历感知间隔可以理解日光节约会改变特定日期的长度，月份有不同的天数，闰秒可以固定在特定年份上。
相比之下，固定间隔总是国际单位的倍数，并且不会根据日历上下文而更改。calendar_interval类型如下：

minute
hour
day
week
month
quarter
year

数据：

PUT my_index/_doc/1?refresh
{
  "date": "2015-10-01T00:30:00Z"
}

PUT my_index/_doc/2?refresh
{
  "date": "2015-10-01T01:30:00Z"
}

PUT my_index/_doc/2?refresh
{
  "date": "2015-10-02T11:05:00Z"
}

聚合：

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field":     "date",
        "calendar_interval":  "day"
      }
    }
  }
}

按天来聚合数量，结果：

"aggregations" : {
    "by_day" : {
      "buckets" : [
        {
          "key_as_string" : "2015-10-01T00:00:00.000Z",
          "key" : 1443657600000,
          "doc_count" : 2
        },
        {
          "key_as_string" : "2015-10-02T00:00:00.000Z",
          "key" : 1443744000000,
          "doc_count" : 1
        }
      ]
    }
  }

10-1有2条，10-2有1条

time_zone时区位移1小时：

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field":     "date",
        "calendar_interval":  "day",
        "time_zone": "-01:00"
      }
    }
  }
}

结果：

"aggregations" : {
    "by_day" : {
      "buckets" : [
        {
          "key_as_string" : "2015-09-30T00:00:00.000-01:00",
          "key" : 1443574800000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2015-10-01T00:00:00.000-01:00",
          "key" : 1443661200000,
          "doc_count" : 1
        },
        {
          "key_as_string" : "2015-10-02T00:00:00.000-01:00",
          "key" : 1443747600000,
          "doc_count" : 1
        }
      ]
    }
  }

因为ID为1的记录时间为 2015-10-01T00:30:00Z ，向前1小时，那就是往前一天了，所以9-30就有1条记录了。

移位6小时：

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field":     "date",
        "calendar_interval":  "day",
        "offset":    "+6h"
      }
    }
  }
}

结果：

扫描二维码关注公众号，回复： 8596353 查看本文章

"aggregations" : {
    "by_day" : {
      "buckets" : [
        {
          "key_as_string" : "2015-09-30T06:00:00.000Z",
          "key" : 1443592800000,
          "doc_count" : 2
        },
        {
          "key_as_string" : "2015-10-01T06:00:00.000Z",
          "key" : 1443679200000,
          "doc_count" : 0
        },
        {
          "key_as_string" : "2015-10-02T06:00:00.000Z",
          "key" : 1443765600000,
          "doc_count" : 1
        }
      ]
    }
  }

日期范围聚合 Date Range Aggregation

专用于日期值的范围聚合。此聚合与普通范围聚合的主要区别在于，from和to值可以用日期数学表达式表示，还可以指定返回from和to响应字段的日期格式。请注意，此聚合包括每个范围的from值，而不包括to值。

POST /sales/_search?size=0
{
   "aggs": {
       "range": {
           "date_range": {
               "field": "date",
               "missing": "1976/11/30",
               "ranges": [
                  {
                    "key": "Older",
                    "to": "2016/02/01"
                  }, 
                  {
                    "key": "Newer",
                    "from": "2016/02/01",
                    "to" : "now/d"
                  }
              ]
          }
      }
   }
}

聚合早于2016/02/01（不含）的到"Older"
聚合从2016/02/01（包含）到现在（不含）的到"Newer"
更多用法见：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-daterange-aggregation.html

Filter Aggregation

顾名思义，用于过滤后聚合，这将用于将当前聚合上下文缩小到一组特定的文档。
比如只聚合hat类型的平均价格

POST /sales/_search?size=0
{
    "aggs" : {
        "t_shirts" : {
            "filter" : { "term": { "type": "hat" } },
            "aggs" : {
                "avg_price" : { "avg" : { "field" : "price" } }
            }
        }
    }
}

结果：

"aggregations" : {
    "t_shirts" : {
      "doc_count" : 3,
      "avg_price" : {
        "value" : 92.5
      }
    }
  }

帽子的平均价格92.5，数量有3个

Filters Aggregation

定义一个多bucket聚合，其中每个bucket与一个过滤器关联。每个bucket将收集与其关联筛选器匹配的所有文档。

PUT /logs/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "warning:authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }
{ "index" : { "_id" : 4 } }
{ "body" : "error: database disconnectioned " }

聚合：

GET /logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }
    }
  }
}

结果：

"aggregations" : {
    "messages" : {
      "buckets" : {
        "errors" : {
          "doc_count" : 2
        },
        "warnings" : {
          "doc_count" : 2
        }
      }
    }
  }

为啥errors有2个？warnings有2个（难道不是3个吗）?想一下为什么。
增加一条记录

PUT logs/_doc/5?refresh
{
"body": "info: user Bob logged out"
}

聚合，增加一个其他项"other_bucket_key"

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "other_bucket_key": "other_messages",
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }
    }
  }
}

结果：

"aggregations" : {
    "messages" : {
      "buckets" : {
        "errors" : {
          "doc_count" : 2
        },
        "warnings" : {
          "doc_count" : 2
        },
        "other_messages" : {
          "doc_count" : 1
        }
      }
    }
  }

Global Aggregation

定义搜索执行上下文中所有文档的单个存储桶。此上下文由您正在搜索的索引和文档类型定义，但不受搜索查询本身的影响。
也就是说要对索引全部聚合，也聚合单个搜索。

POST /sales/_search?size=0
{
    "query" : {
        "match" : { "type" : "hat" }
    },
    "aggs" : {
        "all_products" : {
            "global" : {}, 
            "aggs" : { 
                "avg_price" : { "avg" : { "field" : "price" } }
            }
        },
        "hats": { "avg" : { "field" : "price" } }
    }
}

可以看到，搜索命中3个，全部产品7个

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "all_products" : {
      "doc_count" : 7,
      "avg_price" : {
        "value" : 92.25
      }
    },
    "hats" : {
      "value" : 92.5
    }
  }
}

间隔直方图聚合 Histogram Aggregation

从文档中提取的数值或数值范围值。它动态地在值上构建固定大小（也称为间隔）的存储桶。例如，如果文档有一个包含价格（数字）的字段，我们可以将此聚合配置为动态构建间隔为5的存储桶（如果价格为5美元）。执行聚合时，将对每个文档的价格字段进行评估，并将其舍入到最接近的存储桶-例如，如果价格为32，存储桶大小为5，则舍入将产生30，因此文档将“落入”与键30关联的存储桶中。为了使其更正式，下面是使用的舍入函数：

bucket_key = Math.floor((value - offset) / interval) * interval + offset

以10元一个间隔进行个数统计

POST /sales/_search?size=0
{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 10
            }
        }
    }
}

80元（到90元不含）的有2个，90元（到100元不含）的有1个，100元以上的有1个

"aggregations" : {
    "prices" : {
      "buckets" : [
        {
          "key" : 80.0,
          "doc_count" : 2
        },
        {
          "key" : 90.0,
          "doc_count" : 1
        },
        {
          "key" : 100.0,
          "doc_count" : 1
        }
      ]
    }
  }

范围聚合 Range Aggregation

允许用户定义一组范围，每个范围代表一个bucket。在聚合过程中，将根据每个bucket范围和“bucket”相关/匹配文档检查从每个文档提取的值。
请注意，此聚合包括每个范围的from值，而不包括to值。

聚合：

GET /sales/_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "ranges" : [
                    { "to" : 80.0 },
                    { "from" : 80.0, "to" : 90.0 },
                    { "from" : 90.0 }
                ]
            }
        }
    }
}

小于80的0个，80-90（不含）2个，90（含）以上的2个

"aggregations" : {
    "price_ranges" : {
      "buckets" : [
        {
          "key" : "*-80.0",
          "to" : 80.0,
          "doc_count" : 0
        },
        {
          "key" : "80.0-90.0",
          "from" : 80.0,
          "to" : 90.0,
          "doc_count" : 2
        },
        {
          "key" : "90.0-*",
          "from" : 90.0,
          "doc_count" : 2
        }
      ]
    }
  }

带key标签的写法：

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "keyed" : true,
                "ranges" : [
                    { "key" : "cheap", "to" : 100 },
                    { "key" : "average", "from" : 100, "to" : 200 },
                    { "key" : "expensive", "from" : 200 }
                ]
            }
        }
    }
}

带状态的范围分组

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "ranges" : [
                    { "to" : 100 },
                    { "from" : 100, "to" : 200 },
                    { "from" : 200 }
                ]
            },
            "aggs" : {
                "price_stats" : {
                    "stats" : { "field" : "price" }
                }
            }
        }
    }
}

状态结果：

{
  ...
  "aggregations": {
    "price_ranges": {
      "buckets": [
        {
          "key": "*-100.0",
          "to": 100.0,
          "doc_count": 2,
          "price_stats": {
            "count": 2,
            "min": 10.0,
            "max": 50.0,
            "avg": 30.0,
            "sum": 60.0
          }
        },
        {
          "key": "100.0-200.0",
          "from": 100.0,
          "to": 200.0,
          "doc_count": 2,
          "price_stats": {
            "count": 2,
            "min": 150.0,
            "max": 175.0,
            "avg": 162.5,
            "sum": 325.0
          }
        },
        {
          "key": "200.0-*",
          "from": 200.0,
          "doc_count": 3,
          "price_stats": {
            "count": 3,
            "min": 200.0,
            "max": 200.0,
            "avg": 200.0,
            "sum": 600.0
          }
        }
      ]
    }
  }
}

内置排序

这些排序模式是桶固有的能力：它们操作桶生成的数据，比如 doc_count 。它们共享相同的语法，但是根据使用桶的不同会有些细微差别。
让我们做一个 terms 聚合但是按 doc_count 值的升序排序：

GET /cars/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "_count" : "asc" 
              }
            }
        }
    }
}

首先需要对text字段允许fielddata，运行聚合操作会报错，官方的说法是text是会分词，如果text中一个文本为New York，那么就会被分成2个桶，一个New桶，一个York桶，那么显然不能聚合操作，要么你把该类型替换成keyword类型，因为keyword类型是不会分词的，可以用来做聚合操作。

PUT /cars/_mapping
{
"properties": {
"color": { 
"type": "text",
"fielddata": true
}
}
}

POST /cars/_bulk
{ "index": {}}
{ "price" : 10000, "color" : "red", "make" : "honda", "sold" : "2014-10-28" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 30000, "color" : "green", "make" : "ford", "sold" : "2014-05-18" }
{ "index": {}}
{ "price" : 15000, "color" : "blue", "make" : "toyota", "sold" : "2014-07-02" }
{ "index": {}}
{ "price" : 12000, "color" : "green", "make" : "toyota", "sold" : "2014-08-19" }
{ "index": {}}
{ "price" : 20000, "color" : "red", "make" : "honda", "sold" : "2014-11-05" }
{ "index": {}}
{ "price" : 80000, "color" : "red", "make" : "bmw", "sold" : "2014-01-01" }
{ "index": {}}
{ "price" : 25000, "color" : "blue", "make" : "ford", "sold" : "2014-02-12" }

用关键字 _count ，我们可以按 doc_count 值的升序排序。

我们为聚合引入了一个 order 对象，它允许我们可以根据以下几个值中的一个值进行排序：

_count
- 　　按文档数排序。对 terms 、 histogram 、 date_histogram 有效。
_term
- 　　按词项的字符串值的字母顺序排序。只在 terms 内使用。
_key
- 　　按每个桶的键值数值排序（理论上与 _term 类似）。只在 histogram 和 date_histogram 内使用。

按度量排序

有时，我们会想基于度量计算的结果值进行排序。在我们的汽车销售分析仪表盘中，我们可能想按照汽车颜色创建一个销售条状图表，但按照汽车平均售价的升序进行排序。
我们可以增加一个度量，再指定 order 参数引用这个度量即可：

GET /cars/_search
{
    "size" : 0,
    "aggs" : {
        "colors" : {
            "terms" : {
              "field" : "color",
              "order": {
                "avg_price" : "asc" 
              }
            },
            "aggs": {
                "avg_price": {
                    "avg": {"field": "price"} 
                }
            }
        }
    }
}

计算每种颜色桶的平均售价。
桶按照计算平均值的升序排序。

结果：

"aggregations" : {
    "colors" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "blue",
          "doc_count" : 2,
          "avg_price" : {
            "value" : 20000.0
          }
        },
        {
          "key" : "green",
          "doc_count" : 2,
          "avg_price" : {
            "value" : 21000.0
          }
        },
        {
          "key" : "red",
          "doc_count" : 4,
          "avg_price" : {
            "value" : 32500.0
          }
        }
      ]
    }
  }

我们可以采用这种方式用任何度量排序，只需简单的引用度量的名字。不过有些度量会输出多个值。 extended_stats 度量是一个很好的例子：它输出好几个度量值。

邻接矩阵 Adjacency Matrix Aggregation

返回邻接矩阵形式的桶聚集。该请求提供一个命名筛选器表达式的集合，类似于筛选器聚合请求。响应中的每个bucket表示相交过滤器矩阵中的非空单元格。
给定名为A、B和C的筛选器，响应将返回具有以下名称的存储桶：

	A	B	C
A	A	A&B	A&C
B		B	B&C
C			C

PUT /emails/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "accounts" : ["hillary", "sidney"]}
{ "index" : { "_id" : 2 } }
{ "accounts" : ["hillary", "donald"]}
{ "index" : { "_id" : 3 } }
{ "accounts" : ["vladimir", "donald"]}
{ "index" : { "_id" : 4 } }
{ "accounts" : ["vladimir", "sidney"]}

结果：

"aggregations" : {
    "interactions" : {
      "buckets" : [
        {
          "key" : "grpA",
          "doc_count" : 3
        },
        {
          "key" : "grpA&grpB",
          "doc_count" : 1
        },
        {
          "key" : "grpA&grpC",
          "doc_count" : 1
        },
        {
          "key" : "grpB",
          "doc_count" : 2
        },
        {
          "key" : "grpB&grpC",
          "doc_count" : 1
        },
        {
          "key" : "grpC",
          "doc_count" : 2
        }
      ]
    }
  }

ElasticSearch基础6：Bucket桶聚合