JStorm出现com.lmax.disruptor.InsufficientCapacityException异常

出现如下异常

 [WARN 2018-08-29 10:12:07 TaskHeartbeatTrigger:118 run pool-6-thread-2] Failed to publish timer event to {topo_name}:47_taskHeartbeat
shade.storm.com.lmax.disruptor.InsufficientCapacityException

这个问题实际上就是com.lmax.disruptor.InsufficientCapacityException。disruptor是storm所采用的一种内存队列，报容量不足。发送这种timerevent的目的是检查task是否存活。这种情况下就是task的长时间没有从该队列取数，导致了这个异常的发生。
检查spout的情况，发现大量PendingNum，达到了PendingNum的上线，这也解释了为什么新的event无法被加入。

Task    Emitted Acked   SendTps RecvTps Process(us) Deser(us)   Ser(us) Exe(us) DeserQueue(%)   SerQueue(%) ExeQueue(%) CtrlQueue(%)    PendingNum  EmptyCpuRatio   BatchInterTime  TupleLifeCycle(us)
23  6   0   0.11    0   0   0   0   0   0.00    0.00    0.00    0.00    101 1   0   0
24  6   1   0.11    0.01    0   0   0   0   0.00    0.00    0.00    0.00    100 1   0   0
25  5   1   0.11    0.01    0   0   0   0   0.00    0.00    0.00    0.00    101 1   0   0

自然怀疑是后续bolt处理太慢，导致了问题，于是检查业务日志,令人震惊的事情发生了，bolt居然以精确的60秒一条的频率在处理事件。没办法，采用最原始的办法，一行行注释掉原来代码。
最后发现问题居然出现在日志上，发现原来该代码使用的是Kafka（8.2.1版本）收集日志，而我们正在搞的就是把Kafka升级到1.0.1，不兼容的API再次挖坑了。60s是kafka设置的超时时间。

总结以下，我们架构人员不搞业务代码，但往往也会负责业务代码的架构改造。此时一定要问清楚这个代码所依赖的外部组件。另外，从代码架构角度看，非核心功能应该提供报警和自动剥离机制，而非直接阻塞业务线程。

JStorm出现com.lmax.disruptor.InsufficientCapacityException异常

猜你喜欢