Redis HyperLogLog
Redis 在 2.8.9 版本添加了 HyperLogLog 结构。Redis HyperLogLog 是用来做基数统计的算法,HyperLogLog 的优点是,在输入元素的数量或者体积非常非常大时,计算基数所需的空间总是固定 的、并且是很小的。
在 Redis 里面,每个 HyperLogLog 键只需要花费 12 KB 内存,就可以计算接近 2^64 个不同元素的基 数。这和计算基数时,元素越多耗费内存就越多的集合形成鲜明对比。
但是,因为 HyperLogLog 只会根据输入元素来计算基数,而不会储存输入元素本身,所以 HyperLogLog 不能像集合那样,返回输入的各个元素。
HyperLogLog
Hyper LogLog计数器就是估算Nmax为基数的数据集仅需使用loglog(Nmax)+O(1) bits就可以。如线性计数器的Hyper LogLog计数器允许设计人员指定所需的精度值,在Hyper LogLog的情况下,这是通过定义所需的相对标准差和预期要计数的最大基数。大部分计数器通过一个输入数据流M,并应用一个哈希函数设置h(M)来工作。这将产生一个S = h(M) of {0,1}^∞字符串的可观测结果。通过分割哈希输入流成m个子字符串,并对每个子输入流保持m的值可观测 ,这就是相当一个新Hyper LogLog(一个子m就是一个新的Hyper LogLog)。利用额外的观测值的平均值,产生一个计数器,其精度随着m的增长而提高,这只需要对输入集合中的每个元素执行几步操作就可以完成。其结果是,这个计数器可以仅使用1.5 kb的空间计算精度为2%的十亿个不同的数据元素。与执行 HashSet所需的120 兆字节进行比较,这种算法的效率很明显。
误差率
合并测试case
import java.io.IOException; import java.text.DecimalFormat; import java.text.NumberFormat; import com.clearspring.analytics.stream.cardinality.CardinalityMergeException; import com.clearspring.analytics.stream.cardinality.HyperLogLog; public class HyperLogLogMerge { private static NumberFormat nf = new DecimalFormat("####.#####%"); public static void main(String[] args) throws IOException, CardinalityMergeException { int record = 5000000; HyperLogLog aHyperLogLog = exec(18,record,"xx"); HyperLogLog bHyperLogLog = exec(18,record,"yy"); HyperLogLog cHyperLogLog = exec(18,1200,"yy"); //执行合并 HyperLogLog tHyperLogLog1 = (HyperLogLog)aHyperLogLog.merge(bHyperLogLog); //重复数据合并 HyperLogLog tHyperLogLog2 = (HyperLogLog)tHyperLogLog1.merge(cHyperLogLog); System.out.printf("merge total1 = %d \n",tHyperLogLog1.cardinality()); System.out.printf("merge total2 = %d \n",tHyperLogLog2.cardinality()); } public static HyperLogLog exec(int log2m , int record,String keySuffix )throws IOException { HyperLogLog hyperLogLog1 = new HyperLogLog(log2m); for(int i=0;i<record;i++) { hyperLogLog1.offer(i+"haperloglog"+keySuffix); } double errorRate = (double)(record-hyperLogLog1.cardinality())/(double)record; System.out.printf("log2m=%d ,object.size=%f kb ,cardinality/record=%d/%d ,errorRate=%s \n", log2m,((double)hyperLogLog1.getBytes().length)/1024d, hyperLogLog1.cardinality(),record,nf.format(errorRate)); return hyperLogLog1; } } log2m=18 ,object.size=170.675781 kb ,cardinality/record=4987682/5000000 ,errorRate=0.24636% log2m=18 ,object.size=170.675781 kb ,cardinality/record=4989420/5000000 ,errorRate=0.2116% log2m=18 ,object.size=170.675781 kb ,cardinality/record=1201/1200 ,errorRate=-0.083333% merge total1 = 10004502 merge total2 = 10004502
内存使用测试Case
import java.io.IOException; import java.text.DecimalFormat; import java.text.NumberFormat; import com.clearspring.analytics.stream.cardinality.HyperLogLog; public class HyperLogLogzMemory { private static NumberFormat nf = new DecimalFormat("####.#####%"); public static void main(String[] args) throws IOException { int record = 5000000; exec(12,record); exec(15,record); exec(18,record); exec(21,record); exec(25,record); } public static void exec(int log2m , int record )throws IOException { HyperLogLog hyperLogLog1 = new HyperLogLog(log2m); for(int i=0;i<record;i++) { hyperLogLog1.offer(i+"haperloglog"); } double errorRate = (double)(record-hyperLogLog1.cardinality())/(double)record; System.out.printf("log2m=%d ,object.size=%f kb ,cardinality/record=%d/%d ,errorRate=%s \n", log2m,((double)hyperLogLog1.getBytes().length)/1024d, hyperLogLog1.cardinality(),record,nf.format(errorRate)); } } log2m=12 ,object.size=2.675781 kb ,cardinality/record=4901798/5000000 ,errorRate=1.96404% log2m=15 ,object.size=21.343750 kb ,cardinality/record=4971119/5000000 ,errorRate=0.57762% log2m=18 ,object.size=170.675781 kb ,cardinality/record=4991409/5000000 ,errorRate=0.17182% log2m=21 ,object.size=1365.343750 kb ,cardinality/record=4992590/5000000 ,errorRate=0.1482% log2m=25 ,object.size=21845.343750 kb ,cardinality/record=5000356/5000000 ,errorRate=-0.00712%
引用HyperLogLog包
<dependency> <groupId>com.clearspring.analytics</groupId> <artifactId>stream</artifactId> <version>2.7.0</version> </dependency>
资料
http://www.redis.net.cn/tutorial/3513.html
https://github.com/addthis/stream-lib
http://blog.csdn.net/hguisu/article/details/8433731
http://highscalability.com/blog/2012/4/5/big-data-counting-how-to-count-a-billion-distinct-objects-us.html