BloomFilter
简单代码实现
首先我们先来看一下guava如何使用的
添加依赖
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>27.0.1-jre</version>
</dependency>
简单实现
package com.example.demo;
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.Funnels;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import java.nio.charset.Charset;
@SpringBootApplication
public class DemoApplication {
public static void main(String[] args) {
BloomFilter<CharSequence> bloomFilter = BloomFilter.create(
Funnels.stringFunnel(Charset.forName("utf-8")),
10000,0.0001);
for(int i =0;i<5000;i++){
bloomFilter.put(""+i);
}
System.out.println("数据写入完毕");
for(int i =0;i<10000;i++){
if(bloomFilter.mightContain(""+i)){
System.out.println(i+"存在");
}else {
System.out.println(i+"不存在");
}
}
SpringApplication.run(DemoApplication.class, args);
}
}
Guava底层使用的是long
源码分析
Guava的布隆过率涉及BloomFilter和BloomFilterStrategies两个类
废话不多说,直接上源码
BloomFilter有四个参数
/** The bit set of the BloomFilter (not necessarily power of 2!) */
private final LockFreeBitArray bits;
/** Number of hashes per element */
private final int numHashFunctions;
/** The funnel to translate Ts to bytes */
private final Funnel<? super T> funnel;
/** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */
private final Strategy strategy;
- Funnel这是Guava中定义的一个接口,它和PrimitiveSink配套使用,主要是把任意类型的数据转化成Java基本数据类型(primitive value,如char,byte,int……),默认用java.nio.ByteBuffer实现,最终均转化为byte数组;
- strategy是定义在BloomFilter类内部的接口,有三个方法,put(元素),mightContain(判定元素是否存在)和ordinal方法。此接口由BloomFilterStragies实现,BloomFilterStragies是一个枚举类型。
- numHashFunctions 哈希函数个数
- LockFreeBitArray 封装这对bit数组的各种操作,如set某个位为1,计算位的大小。(在BloomFilterStrategies中)
create
static <T> BloomFilter<T> create(
Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
checkNotNull(funnel);
checkArgument(
expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
checkNotNull(strategy);
if (expectedInsertions == 0) {
expectedInsertions = 1;
}
/*
* TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
* is proportional to -log(p), but there is not much of a point after all, e.g.
* optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
*/
long numBits = optimalNumOfBits(expectedInsertions, fpp);
int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
try {
return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
} catch (IllegalArgumentException e) {
throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
}
}
expectedInsertions预期元素个数
fpp允许的误差率
计算m的个数
一个数学运算
static long optimalNumOfBits(long n, double p) {
if (p == 0) {
p = Double.MIN_VALUE;
}
return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
}
hash函数个数
就是一个数学计算
static int optimalNumOfHashFunctions(long n, long m) {
// (m / n) * log(2), but avoid truncation due to division!
return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
}
置于公式如何得到的请看
https://editor.csdn.net/md/?articleId=105003199
存入元素(put)
Guava是通过BloomFilterStrategies来实现存放元素的。它是一个枚举类,有两个枚举值,分别对应32位hash映射和64位hash映射。
对于32位的映射是通过
long hash64=Hashing.murmur3_128().hashObject(object, funnel).asLong();
获得一个long型的hash值,然后高低位各取32位。来得到两个hash值。
long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
对于64位的映射,通过
byte[] bytes = Hashing.murmur3_128().hashObject(object,funnel).getBytesInternal();
获得一个长度为16的byte数组,即共128位,hash1为前8个,64位,hash2为后8个,64位
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
private /* static */ long lowerEight(byte[] bytes) {
return Longs.fromBytes(
bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}
private /* static */ long upperEight(byte[] bytes) {
return Longs.fromBytes(
bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}
为什么通过高低位来获得两个hash值
因为可以一次运算就得到两个hash值的话,就没必要通过两次运算来得到两个hash,可以节省时间。
获得多个Hash值 ,并存入
32位是通过hash1+i*hash2来获得多个hash值,当获得的hash值为负数时,取绝对值
for (int i = 1; i <= numHashFunctions; i++) {
int combinedHash = hash1 + (i * hash2);
// Flip all the bits if it's negative (guaranteed positive number)
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
bitsChanged |= bits.set(combinedHash % bitSize);
}
64位是通过hash1+=hash2来获得多个hash值
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
combinedHash += hash2;
}
32位映射和64位映射获得其他的Hash值的方式是一样的,只是描述的不一样,微小的差别是32位的第一个hash函数是hash1+hash2,第二个是hash1+2hash2以此类推,64位的第一个是hash1,第二个是hash1+hash2,第三个是hash1+2hash2。
博主感觉像两个人写的,哈哈哈
模Long.MAX_VALUE是为了防止hash值越界后,得到负值,对Long.MAX_VALUE(0111…1)取&后,把负号去掉,保证index恒为正。
为什么通过相加方法获得多个hash值
相除不行:除到0怎么办呢
相减也不行:减到负数定位不到在哪个位。
相乘:容易越界,越界后取绝对值貌似没什么影响,但感觉可控性不强,而且位置跨度太大了。很容易越界,不断致行取绝对值的操作,浪费性能。
判断是否存在某个key(mightContain)
32位
@Override
public <T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
for (int i = 1; i <= numHashFunctions; i++) {
int combinedHash = hash1 + (i * hash2);
// Flip all the bits if it's negative (guaranteed positive number)
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
if (!bits.get(combinedHash % bitSize)) {
return false;
}
}
return true;
}
64位
@Override
public <T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
return false;
}
combinedHash += hash2;
}
return true;
}
两个枚举实例的整体代码
MURMUR128_MITZ_32() {
@Override
public <T> boolean put(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
boolean bitsChanged = false;
for (int i = 1; i <= numHashFunctions; i++) {
int combinedHash = hash1 + (i * hash2);
// Flip all the bits if it's negative (guaranteed positive number)
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
bitsChanged |= bits.set(combinedHash % bitSize);
}
return bitsChanged;
}
@Override
public <T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
int hash1 = (int) hash64;
int hash2 = (int) (hash64 >>> 32);
for (int i = 1; i <= numHashFunctions; i++) {
int combinedHash = hash1 + (i * hash2);
// Flip all the bits if it's negative (guaranteed positive number)
if (combinedHash < 0) {
combinedHash = ~combinedHash;
}
if (!bits.get(combinedHash % bitSize)) {
return false;
}
}
return true;
}
},
/**
* This strategy uses all 128 bits of {@link Hashing#murmur3_128} when hashing. It looks different
* than the implementation in MURMUR128_MITZ_32 because we're avoiding the multiplication in the
* loop and doing a (much simpler) += hash2. We're also changing the index to a positive number by
* AND'ing with Long.MAX_VALUE instead of flipping the bits.
*/
MURMUR128_MITZ_64() {
@Override
public <T> boolean put(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
boolean bitsChanged = false;
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
combinedHash += hash2;
}
return bitsChanged;
}
@Override
public <T> boolean mightContain(
T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
long bitSize = bits.bitSize();
byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
long combinedHash = hash1;
for (int i = 0; i < numHashFunctions; i++) {
// Make the combined hash positive and indexable
if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
return false;
}
combinedHash += hash2;
}
return true;
}
private /* static */ long lowerEight(byte[] bytes) {
return Longs.fromBytes(
bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}
private /* static */ long upperEight(byte[] bytes) {
return Longs.fromBytes(
bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}
};
总结
- BloomFilter类的作用在于接收输入,利用公式完成对参数的估算,最后初始化Strategy接口的实例;
- BloomFilterStrategies是一个枚举类,具有两个实现了Strategy接口的成员,分别为MURMUR128_MITZ_32和MURMUR128_MITZ_64,另外封装了long型的数组作为布隆过滤器底层的bit数组,其中在get和set方法中完成核心的位运算。
HashCode
有IntHashCode,LongHashCode,BytesHashCode三个实现类,来实现hashCode.对应的hash值分别为int型(32位),Long型(64位),byte[]型