解决org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode

问题描述

每次一运行MapReduce作业向HBase里面写数据,主节点的HMaster和HRegioServer进程就会挂掉。查看HBase日志发现,

WARN  [PEWorker-1] coordination.SplitLogManagerCoordination: Failed to check remaining tasks
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/splitWAL
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1532)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:278)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:500)
    at org.apache.hadoop.hbase.coordination.ZKSplitLogManagerCoordination.remainingTasksInCoordination(ZKSplitLogManagerCoordination.java:125)
    at org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:326)
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:258)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:293)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:232)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:224)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:129)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:52)
    at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
    at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)
2018-08-18 19:46:36,735 WARN  [PEWorker-1] coordination.SplitLogManagerCoordination: Failed to check remaining tasks
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/splitWAL
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1532)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:278)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:500)
    at org.apache.hadoop.hbase.coordination.ZKSplitLogManagerCoordination.remainingTasksInCoordination(ZKSplitLogManagerCoordination.java:125)
    at org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:326)
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:258)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:293)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:232)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:224)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:129)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:52)
    at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
    at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)
2018-08-18 19:46:36,860 WARN  [PEWorker-1] coordination.SplitLogManagerCoordination: Failed to check remaining tasks
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/splitWAL
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1532)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:278)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:500)
    at org.apache.hadoop.hbase.coordination.ZKSplitLogManagerCoordination.remainingTasksInCoordination(ZKSplitLogManagerCoordination.java:125)
    at org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:326)
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:258)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:293)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:232)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:224)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:129)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:52)
    at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
    at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)
2018-08-18 19:46:36,961 WARN  [PEWorker-1] coordination.SplitLogManagerCoordination: Failed to check remaining tasks
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase/splitWAL
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1532)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:278)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenNoWatch(ZKUtil.java:500)
    at org.apache.hadoop.hbase.coordination.ZKSplitLogManagerCoordination.remainingTasksInCoordination(ZKSplitLogManagerCoordination.java:125)
    at org.apache.hadoop.hbase.master.SplitLogManager.waitForSplittingCompletion(SplitLogManager.java:326)
    at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:258)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitLog(MasterWalManager.java:293)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:232)
    at org.apache.hadoop.hbase.master.MasterWalManager.splitMetaLog(MasterWalManager.java:224)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:129)
    at org.apache.hadoop.hbase.master.procedure.RecoverMetaProcedure.executeFromState(RecoverMetaProcedure.java:52)
    at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:184)
    at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1472)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1240)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$800(ProcedureExecutor.java:75)
    at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1760)

原因

        HBase进程默认触发GC的时机是当年老代内存达到90%的时候,这个百分比由 -XX:CMSInitiatingOccupancyFraction=N 这个参数来设置。concurrent mode failed发生在这样一个场景:当年老代内存达到90%的时候,CMS开始进行并发垃圾收集,于此同时,新生代还在迅速不断地晋升对象到年老代。当年老代CMS还未完成并发标记时,年老 代满了,悲剧就发生了。CMS因为没内存可用不得不暂停mark,并触发一次全jvm的stop the world(挂起所有线程),然后采用单线程拷贝方式清理所有垃圾对象,也就是full gc。而我们的bulk的最开始的操作就是各种删表,建表频繁的操作,就会使用掉大量master的年轻代的内存,就会发生上面发生的场景,发生full gc。

解决方案

把CMSInitiatingOccupancyFraction的值设置为70,这样年老代占到约70%时就开始执行CMS,这样就不会出现(或很少出现)Full GC了。

具体步骤

使用vim $HBASE_HOME/conf/hbase-env.sh打开文件,找到export HBASE_OPTS,在其位置上方(避免下文取不到该变量的值)添加export HBASE_LOG_DIR=${HBASE_HOME}/logs,然后设置HBASE_OPTS为export HBASE_OPTS="$HBASE_OPTS -verbose:gc -Xloggc:$HBASE_LOG_DIR/hbase.gc.log -XX:ErrorFile=$HBASE_LOG_DIR/hs_err_pid.lo    g -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX    :CMSInitiatingOccupancyFraction=70"

猜你喜欢

转载自blog.csdn.net/perfer258/article/details/81812531