BloomFilter

简单代码实现
源码分析

create
计算m的个数
hash函数个数
存入元素(put)

为什么通过高低位来获得两个hash值
获得多个Hash值，并存入

为什么通过相加方法获得多个hash值
判断是否存在某个key(mightContain)
两个枚举实例的整体代码

总结

HashCode

简单代码实现

首先我们先来看一下guava如何使用的
添加依赖

  <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>27.0.1-jre</version>
 </dependency>

简单实现

package com.example.demo;

import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnel;
import com.google.common.hash.Funnels;
import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

import java.nio.charset.Charset;

@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) {
        BloomFilter<CharSequence> bloomFilter = BloomFilter.create(
                Funnels.stringFunnel(Charset.forName("utf-8")),
                10000,0.0001);

        for(int i =0;i<5000;i++){
            bloomFilter.put(""+i);
        }
        System.out.println("数据写入完毕");

        for(int i =0;i<10000;i++){
            if(bloomFilter.mightContain(""+i)){
                System.out.println(i+"存在");
            }else {
                System.out.println(i+"不存在");
            }
        }

        SpringApplication.run(DemoApplication.class, args);
    }

}

Guava底层使用的是long

源码分析

Guava的布隆过率涉及BloomFilter和BloomFilterStrategies两个类

废话不多说，直接上源码
BloomFilter有四个参数

  /** The bit set of the BloomFilter (not necessarily power of 2!) */
  private final LockFreeBitArray bits;

  /** Number of hashes per element */
  private final int numHashFunctions;

  /** The funnel to translate Ts to bytes */
  private final Funnel<? super T> funnel;

  /** The strategy we employ to map an element T to {@code numHashFunctions} bit indexes. */
  private final Strategy strategy;

Funnel这是Guava中定义的一个接口，它和PrimitiveSink配套使用，主要是把任意类型的数据转化成Java基本数据类型（primitive value，如char，byte，int……），默认用java.nio.ByteBuffer实现，最终均转化为byte数组；
strategy是定义在BloomFilter类内部的接口，有三个方法，put(元素)，mightContain（判定元素是否存在）和ordinal方法。此接口由BloomFilterStragies实现，BloomFilterStragies是一个枚举类型。
numHashFunctions 哈希函数个数
LockFreeBitArray 封装这对bit数组的各种操作，如set某个位为1，计算位的大小。（在BloomFilterStrategies中）

create

static <T> BloomFilter<T> create(
      Funnel<? super T> funnel, long expectedInsertions, double fpp, Strategy strategy) {
    checkNotNull(funnel);
    checkArgument(
        expectedInsertions >= 0, "Expected insertions (%s) must be >= 0", expectedInsertions);
    checkArgument(fpp > 0.0, "False positive probability (%s) must be > 0.0", fpp);
    checkArgument(fpp < 1.0, "False positive probability (%s) must be < 1.0", fpp);
    checkNotNull(strategy);

    if (expectedInsertions == 0) {
      expectedInsertions = 1;
    }
    /*
     * TODO(user): Put a warning in the javadoc about tiny fpp values, since the resulting size
     * is proportional to -log(p), but there is not much of a point after all, e.g.
     * optimalM(1000, 0.0000000000000001) = 76680 which is less than 10kb. Who cares!
     */
    long numBits = optimalNumOfBits(expectedInsertions, fpp);
    int numHashFunctions = optimalNumOfHashFunctions(expectedInsertions, numBits);
    try {
      return new BloomFilter<T>(new LockFreeBitArray(numBits), numHashFunctions, funnel, strategy);
    } catch (IllegalArgumentException e) {
      throw new IllegalArgumentException("Could not create BloomFilter of " + numBits + " bits", e);
    }
  }

expectedInsertions预期元素个数
fpp允许的误差率

计算m的个数

一个数学运算

  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

hash函数个数

就是一个数学计算

static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

置于公式如何得到的请看
https://editor.csdn.net/md/?articleId=105003199

存入元素(put)

Guava是通过BloomFilterStrategies来实现存放元素的。它是一个枚举类，有两个枚举值，分别对应32位hash映射和64位hash映射。

对于32位的映射是通过
long hash64=Hashing.murmur3_128().hashObject(object, funnel).asLong();
获得一个long型的hash值，然后高低位各取32位。来得到两个hash值。

      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

对于64位的映射，通过
byte[] bytes = Hashing.murmur3_128().hashObject(object,funnel).getBytesInternal();
获得一个长度为16的byte数组，即共128位，hash1为前8个，64位，hash2为后8个，64位

byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
long hash1 = lowerEight(bytes);
long hash2 = upperEight(bytes);
    
private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
}

private /* static */ long upperEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
}

为什么通过高低位来获得两个hash值

因为可以一次运算就得到两个hash值的话，就没必要通过两次运算来得到两个hash，可以节省时间。

获得多个Hash值，并存入

32位是通过hash1+i*hash2来获得多个hash值，当获得的hash值为负数时，取绝对值

	for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        bitsChanged |= bits.set(combinedHash % bitSize);
      }

64位是通过hash1+=hash2来获得多个hash值

long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }

32位映射和64位映射获得其他的Hash值的方式是一样的，只是描述的不一样，微小的差别是32位的第一个hash函数是hash1+hash2，第二个是hash1+2hash2以此类推，64位的第一个是hash1，第二个是hash1+hash2，第三个是hash1+2hash2。

博主感觉像两个人写的，哈哈哈

模Long.MAX_VALUE是为了防止hash值越界后，得到负值，对Long.MAX_VALUE（0111…1）取&后，把负号去掉，保证index恒为正。

为什么通过相加方法获得多个hash值

相除不行：除到0怎么办呢
相减也不行：减到负数定位不到在哪个位。
相乘：容易越界，越界后取绝对值貌似没什么影响，但感觉可控性不强，而且位置跨度太大了。很容易越界，不断致行取绝对值的操作，浪费性能。

判断是否存在某个key(mightContain)

32位

@Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        if (!bits.get(combinedHash % bitSize)) {
          return false;
        }
      }
      return true;
    }

64位

@Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
          return false;
        }
        combinedHash += hash2;
      }
      return true;
    }

两个枚举实例的整体代码

MURMUR128_MITZ_32() {
    @Override
    public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      boolean bitsChanged = false;
      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        bitsChanged |= bits.set(combinedHash % bitSize);
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        if (!bits.get(combinedHash % bitSize)) {
          return false;
        }
      }
      return true;
    }
  },
  /**
   * This strategy uses all 128 bits of {@link Hashing#murmur3_128} when hashing. It looks different
   * than the implementation in MURMUR128_MITZ_32 because we're avoiding the multiplication in the
   * loop and doing a (much simpler) += hash2. We're also changing the index to a positive number by
   * AND'ing with Long.MAX_VALUE instead of flipping the bits.
   */
  MURMUR128_MITZ_64() {
    @Override
    public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      boolean bitsChanged = false;
      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        bitsChanged |= bits.set((combinedHash & Long.MAX_VALUE) % bitSize);
        combinedHash += hash2;
      }
      return bitsChanged;
    }

    @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      byte[] bytes = Hashing.murmur3_128().hashObject(object, funnel).getBytesInternal();
      long hash1 = lowerEight(bytes);
      long hash2 = upperEight(bytes);

      long combinedHash = hash1;
      for (int i = 0; i < numHashFunctions; i++) {
        // Make the combined hash positive and indexable
        if (!bits.get((combinedHash & Long.MAX_VALUE) % bitSize)) {
          return false;
        }
        combinedHash += hash2;
      }
      return true;
    }

    private /* static */ long lowerEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[7], bytes[6], bytes[5], bytes[4], bytes[3], bytes[2], bytes[1], bytes[0]);
    }

    private /* static */ long upperEight(byte[] bytes) {
      return Longs.fromBytes(
          bytes[15], bytes[14], bytes[13], bytes[12], bytes[11], bytes[10], bytes[9], bytes[8]);
    }
  };

总结

BloomFilter类的作用在于接收输入，利用公式完成对参数的估算，最后初始化Strategy接口的实例；
BloomFilterStrategies是一个枚举类，具有两个实现了Strategy接口的成员，分别为MURMUR128_MITZ_32和MURMUR128_MITZ_64，另外封装了long型的数组作为布隆过滤器底层的bit数组，其中在get和set方法中完成核心的位运算。

HashCode

有IntHashCode,LongHashCode,BytesHashCode三个实现类，来实现hashCode.对应的hash值分别为int型（32位），Long型（64位），byte[]型

N_a_n

发布了12 篇原创文章 · 获赞 0 · 访问量 194

私信关注

Google Guava BloomFilter源码分析

BloomFilter

简单代码实现

源码分析

create

计算m的个数

hash函数个数

存入元素(put)

为什么通过高低位来获得两个hash值

获得多个Hash值，并存入

为什么通过相加方法获得多个hash值

判断是否存在某个key(mightContain)

两个枚举实例的整体代码

总结

HashCode

猜你喜欢

Google Guava BloomFilter源码分析

BloomFilter

简单代码实现

源码分析

create

计算m的个数

hash函数个数

存入元素(put)

为什么通过高低位来获得两个hash值

获得多个Hash值 ，并存入

为什么通过相加方法获得多个hash值

判断是否存在某个key(mightContain)

两个枚举实例的整体代码

总结

HashCode

猜你喜欢

获得多个Hash值，并存入