场景:
用户通过数据集dataset分组,并通过event_no字段去重进行数据去重后统计。
使用es-sql实现等价去重查询:
SELECT dataset,count(DISTINCT event_no) as count from json_archives_qc/info group by dataset
为什么我要说类似等价呢? 因为从精确性、性能等角度还是和普通sql有很大区别的~!!!
使用cardinality聚合函数(只支持40000以内统计结果的精确度)
dsl:
{
"from": 0,
"size": 0,
"fields": "dataset",
"aggregations": {
"dataset": {
"terms": {
"field": "dataset",
"size": 200
},
"aggregations": {
"count": {
"cardinality": {
"field": "event_no",
"precision_threshold": 40000
}
}
}
}
}
}
详细代码:
SearchRequestBuilder builder = transportClient.prepareSearch("json_archives_qc");
builder.setTypes("info");
builder.setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
AggregationBuilder terms = AggregationBuilders.terms("dataset").field("dataset").size(200);
CardinalityBuilder childTerms = AggregationBuilders.cardinality("count").field("event_no").precisionThreshold(40000);
terms.subAggregation(childTerms);
builder.addAggregation(terms);
builder.setSize(0);
builder.setFrom(0);
SearchResponse response = builder.get();
StringTerms longTerms = response.getAggregations().get("dataset");
for (Terms.Bucket item : longTerms.getBuckets()) {
InternalCardinality extendedStats = item.getAggregations().get("count");
Map<String, Object> temp = new HashMap<>();
temp.put("dataset", item.getKeyAsString());
temp.put("count", extendedStats.getValue());
resultList.add(temp);
num1 += extendedStats.getValue();
}
主要代码:InternalCardinality extendedStats = item.getAggregations().get("count"); //获取去重过滤后的统计结果; 需要使用InternalCardinality 进行获取数据, 之前直接通过item.getAggregations().get("count") 去获取去重后结果,一直无法获取成功;
优点:性能快,亿级别的记录在1秒内完成
缺点:返回结果只能保证最大40000条记录的精确,统计结果超过40000的话会存在5%的误差,不适合需要精确去重场景
精度要求场景:
未研究哈,后续继续研究下,有问题的同学可以留言互相探讨