hbase MemStoreLAB代码浅析-1

本文基于 hbase 0.98x，如果发现源码与你的副本不符合，请检查代码版本。
首先看看 Memstore 的maybeCloneWithAllocator方法

Memstore#maybeCloneWithAllocator
private KeyValue maybeCloneWithAllocator(KeyValue kv) {
    if (allocator == null) {//如果没有启用 mslab，就返回原始对象。
      return kv;
    }

    int len = kv.getLength();//计算对象长度
    Allocation alloc = allocator.allocateBytes(len);//尝试分配内存
    if (alloc == null) {
      // The allocation was too large, allocator decided
      // not to do anything with it.
      return kv;//分配失败，只能返回原始对象。
    }
    assert alloc.getData() != null;
    System.arraycopy(kv.getBuffer(), kv.getOffset(), alloc.getData(), alloc.getOffset(), len);//分配成功则把原始数据拷贝到 mslab 中
    KeyValue newKv = new KeyValue(alloc.getData(), alloc.getOffset(), len);//创建新的 kv 对象，并添加 mslab 的数据引用指针。
    newKv.setMvccVersion(kv.getMvccVersion());
    return newKv;
  }

allocator就是是一个MemStoreLAB对象，MemStoreLAB以指针碰撞的方式分配连续的内存，每次都以2MB 的大小分配内存（可以通过hbase.hregion.memstore.mslab.chunksize配置），即一个 Chunck。
现在看看他的allocateBytes方法

  public Allocation allocateBytes(int size) {
    Preconditions.checkArgument(size >= 0, "negative size");

    // Callers should satisfy large allocations directly from JVM since they
    // don't cause fragmentation as badly.
    if (size > maxAlloc) {       //这个值由hbase.hregion.memstore.mslab.max.allocation指定，默认256KB
          return null;
    }

    while (true) {
      Chunk c = getOrMakeChunk();//这里用了一个类似于乐观锁

      // Try to allocate from this chunk
      int allocOffset = c.alloc(size);
      if (allocOffset != -1) {
        // We succeeded - this is the common case - small alloc
        // from a big buffer
        return new Allocation(c.data, allocOffset);
      }

      // not enough space!
      // try to retire this chunk
      tryRetireChunk(c);
    }
  }
   /**
     * Get the current chunk, or, if there is no current chunk,
     * allocate a new one from the JVM.
     */
    private Chunk getOrMakeChunk() {
        while (true) {
            // Try to get the chunk
            Chunk c = curChunk.get();
            if (c != null) {
                return c;
            }

            // No current chunk, so we want to allocate one. We race
            // against other allocators to CAS in an uninitialized chunk
            // (which is cheap to allocate)
            //如果有 chunkpool，则尝试先从 chunkpool 中分配，否则先创建一个。chunkpool 的结构与作用这里先按下不表
            c = (chunkPool != null) ? chunkPool.getChunk() : new Chunk(chunkSize);
            if (curChunk.compareAndSet(null, c)) {
                //尝试将当前活跃 chunk 设置为我们获取到的对象，如果成功，这初始化该对象，给对象分配数据内存在这里完成（init 方法），并将这个对象添加到持有队列中
                // we won race - now we need to actually do the expensive
                // allocation step
                c.init();
                this.chunkQueue.add(c);
                return c;
            } else if (chunkPool != null) {//如果设置失败，说明出现了条件竞争。则需要放回到 chunkpool 中等待复用。
                chunkPool.putbackChunk(c);
            }
            // someone else won race - that's fine, we'll try to grab theirs
            // in the next iteration of the loop.
        }
    }

这段代码很简单，就是通过 CAS 的方式，尝试分配一段内存。

public int alloc(int size) {
            while (true) {
                int oldOffset = nextFreeOffset.get();
                if (oldOffset == UNINITIALIZED) {
                    // The chunk doesn't have its data allocated yet.
                    // Since we found this in curChunk, we know that whoever
                    // CAS-ed it there is allocating it right now. So spin-loop
                    // shouldn't spin long!
                    Thread.yield();
                    continue;
                }
                if (oldOffset == OOM) {
                    // doh we ran out of ram. return -1 to chuck this away.
                    return -1;
                }

                if (oldOffset + size > data.length) {
                    return -1; // alloc doesn't fit
                }

                // Try to atomically claim this chunk
                if (nextFreeOffset.compareAndSet(oldOffset, oldOffset + size)) {
                    // we got the alloc
                    allocCount.incrementAndGet();
                    return oldOffset;
                }
                // we raced and lost alloc, try again
            }
        }

上一段getOrMakeChunk的代码中可以看到，先尝试将curChunk指向一个新的 chunk，再进行初始化工作。
因此，如果两个线程在这期间同时拿到了这个 chunk，则可能某一个线程拿到一个还没有初始化完毕的对象，即 init 方法还未来得及执行或未执行完。
因此这里单独考虑了这种情况，并将线程挂起。

这里说一个题外话，既然已经是 CAS，为什么不直接 continue 循环而是将线程挂起呢？大家都知道，一般来说java里面无锁算法更类似于乐观锁，本质上采用cas+ self spinning（即循环，一般装逼叫自旋，下文将启动装逼模式）的方式。
这样做的好处就是，让线程一直在内核中运行，自旋牺牲一定的 CPU 时间片，但是节省了切换线程上下文的开销。
通常情况下，这样的方式成本远远大于收益。但是，注意这里说的是通常情况下。在极端并发的情况下，通过简单的几次循环无法成功，这样就导致了两个问题：
其一是自旋次数太多，其运行期间消耗的资源可能超过线程上下文切换的开销。其二是自旋理论上来说是死循环，与线程挂起不同，他是消耗 CPU 时间片的，大家都见过死循环导致 CPU 利用率100%的情况了。
这就意味着自旋的线程不会主动释放 CPU 资源，导致其他的线程出现饥饿。
基于上面的分析，在高度或极端并发的情况下，采用自旋实现的无锁，效率会高于使用锁机制。
不要说 sleep，sleep 有两种参数，一个是毫秒，这个就不用说了，1ms 其实已经足够 CPU 完成海量的操作。
另一个是纳秒。很不幸，纳秒的获取精确度依赖操作系统的实现甚至是硬件时钟，最简单的例子，macos 操作系统上调用 System.nantoTimes 返回值总是1000的倍数。
因此这里采用了yield方式，让系统来决定线程的恢复时间。这样做最简单，也最明智。

至此看到了如何获取一个 chunk，以及成功后如何在 chunk上分配内存。下面分析下如果在 chunk 上分配内存失败的代码：

private void tryRetireChunk(Chunk c) {
        curChunk.compareAndSet(c, null);
        // If the CAS succeeds, that means that we won the race
        // to retire the chunk. We could use this opportunity to
        // update metrics on external fragmentation.
        //
        // If the CAS fails, that means that someone else already
        // retired the chunk for us.
    }

代码好简单，就是将当前可分配的 chunk，即 curChunk 置为空。
因此，在整个 chunk 的内存都被释放前，即便这个 chunk 还有能力容纳其他更小的 cell，但是他的内存也无法被使用了，造成了一定的内存浪费。
所以，我们在设计表结构的时候，需要注意两个问题：一是单个 cell 的尺寸不宜太大，或者同一个CF 下不同cell的容量差异过大，当然，千万不要超过256KB（可配置）。其二就是hbase.hregion.memstore.mslab.chunksize配置的大小，尽量取大于 cell 平均大小的整数倍的最小值，当然，这取决于第一点你是否做到了。

hbase MemStoreLAB代码浅析-1

猜你喜欢