jvm，apache-commons-pool的PhantomReference引起的一次线上内存崩掉的分析

前一段时间，临部门的兄弟泰国站的项目，系统上线二天，或者重启之后系统总是莫名的shutdown，我对这方面比较感兴趣，也处理过一些这种问题，就写下处理的过程：

左边是没有修改之前的，右边是修改之后的，分析这个问题之前，我先介绍一下工具，用的是Mat(Memory Analyzer Tool)，我比较喜欢用这个，导入内存dump快照：

一般选择leak suspects report这个view就可以了，看下面的视图：

从上面的视图，可以看出com.mysql.jdbc.NonRegisteringDriver这个对相关占有了85.98的内存，主要是这个对象所持有的ConcurrentHashMap占有了绝大多数的内存。接下来转换视图，我一般用的是Histogram和Dominator_Tree这二个视图，把上面的类复制进去，看一下情况：

从这二个图上不难看出ConnectionPhantomReference这个对象太多，从代码里可以来看：

public class NonRegisteringDriver implements java.sql.Driver {
	private static final String ALLOWED_QUOTES = "\"'";

	private static final String REPLICATION_URL_PREFIX = "jdbc:mysql:replication://";

	private static final String URL_PREFIX = "jdbc:mysql://";

	private static final String MXJ_URL_PREFIX = "jdbc:mysql:mxj://";

	public static final String LOADBALANCE_URL_PREFIX = "jdbc:mysql:loadbalance://";

	protected static final ConcurrentHashMap<ConnectionPhantomReference, ConnectionPhantomReference> connectionPhantomRefs = new ConcurrentHashMap<ConnectionPhantomReference, ConnectionPhantomReference>();

protected static void trackConnection(Connection newConn) {
		
		ConnectionPhantomReference phantomRef = new ConnectionPhantomReference((ConnectionImpl) newConn, refQueue);
		connectionPhantomRefs.put(phantomRef, phantomRef);
	}

熟悉apache commons pool的不难看出来，使用的是common-pool的连接池，而这个方法是每创建一个连接就会放一个Connection对象在这个里面，这个虚引用的作用，就是在你外部关闭链接，但是没有释放资源，做一个保底操作，在gc的时候，把持有的资源释放掉：

public void run() {
		threadRef = this;
		while (running) {
			try {
				Reference<? extends ConnectionImpl> ref = NonRegisteringDriver.refQueue.remove(100);
				if (ref != null) {
					try {
						((ConnectionPhantomReference) ref).cleanup();
					} finally {
						NonRegisteringDriver.connectionPhantomRefs.remove(ref);
					}
				}

			} catch (Exception ex) {
				// no where to really log this if we're static
			}
		}
	}

在发生full gc的时候，会把对象放到refQueue中，最后会把连接所持有的资源释放掉，但是这个释放资源是巨耗时间的，所以内存计算导致docker崩掉并不稀罕。但是数据连接池都是有池化资源的概念的，资源循环利用，怎么可能出现如此显而易见的错误，这是不可能发生的基本上，但是事出必有因，只好进一步的分析问题，到底什么问题导致这个现象，网上搜了一下，基本上都是草草了之，只有表现，没有解释根本原因，所以我不得不自己看这个问题，我的第一个猜测就是长时间连接不用，超过waittime，被回收掉，然后又创建，就这样频繁回收和创建，这个猜测的理论必须是minpool是0才可以，但是minpool和maxpool并没有问题，都是5和5

问题又一度陷在了这个上面，所有的一切都不符合常理，我只能去看源码，看一下apache-commons-pool回收连接的代码，这个项目用的是commons-pool 1.x而不是2.x，这个真的很重要，下面贴下代码：

private class Evictor extends TimerTask {
        /**
         * Run pool maintenance.  Evict objects qualifying for eviction and then
         * invoke {@link GenericObjectPool#ensureMinIdle()}.
         */
        public void run() {
            try {
                evict();
            } catch(Exception e) {
                // ignored
            } catch(OutOfMemoryError oome) {
                // Log problem but give evictor thread a chance to continue in
                // case error is recoverable
                oome.printStackTrace(System.err);
            }
            try {
                ensureMinIdle();
            } catch(Exception e) {
                // ignored
            }
        }
    }

public void evict() throws Exception {
        assertOpen();
        synchronized (this) {
            if(_pool.isEmpty()) {
                return;
            }
            if (null == _evictionCursor) {
                _evictionCursor = (_pool.cursor(_lifo ? _pool.size() : 0));
            }
        }

        for (int i=0,m=getNumTests();i<m;i++) {
            final ObjectTimestampPair pair;
            synchronized (this) {
                if ((_lifo && !_evictionCursor.hasPrevious()) ||
                        !_lifo && !_evictionCursor.hasNext()) {
                    _evictionCursor.close();
                    _evictionCursor = _pool.cursor(_lifo ? _pool.size() : 0);
                }

                pair = _lifo ?
                        (ObjectTimestampPair) _evictionCursor.previous() :
                        (ObjectTimestampPair) _evictionCursor.next();

                _evictionCursor.remove();
                _numInternalProcessing++;
            }

            boolean removeObject = false;
            final long idleTimeMilis = System.currentTimeMillis() - pair.tstamp;
            if ((getMinEvictableIdleTimeMillis() > 0) &&
                    (idleTimeMilis > getMinEvictableIdleTimeMillis())) {
                removeObject = true;
            } else if ((getSoftMinEvictableIdleTimeMillis() > 0) &&
                    (idleTimeMilis > getSoftMinEvictableIdleTimeMillis()) &&
                    ((getNumIdle() + 1)> getMinIdle())) { // +1 accounts for object we are processing
                removeObject = true;
            }
            if(getTestWhileIdle() && !removeObject) {
                boolean active = false;
                try {
                    _factory.activateObject(pair.value);
                    active = true;
                } catch(Exception e) {
                    removeObject=true;
                }
                if(active) {
                    if(!_factory.validateObject(pair.value)) {
                        removeObject=true;
                    } else {
                        try {
                            _factory.passivateObject(pair.value);
                        } catch(Exception e) {
                            removeObject=true;
                        }
                    }
                }
            }

            if (removeObject) {
                try {
                    _factory.destroyObject(pair.value);
                } catch(Exception e) {
                    // ignored
                }
            }
            synchronized (this) {
                if(!removeObject) {
                    _evictionCursor.add(pair);
                    if (_lifo) {
                        // Skip over the element we just added back
                        _evictionCursor.previous();
                    }
                }
                _numInternalProcessing--;
            }
        }
    }

private void ensureMinIdle() throws Exception {
        // this method isn't synchronized so the
        // calculateDeficit is done at the beginning
        // as a loop limit and a second time inside the loop
        // to stop when another thread already returned the
        // needed objects
        int objectDeficit = calculateDeficit(false);
        for ( int j = 0 ; j < objectDeficit && calculateDeficit(true) > 0 ; j++ ) {
            try {
                addObject();
            } finally {
                synchronized (this) {
                    _numInternalProcessing--;
                    allocate();
                }
            }
        }
    }

问题就出在上面的代码当中，细心的小伙伴可能已经发现了问题：

if ((getMinEvictableIdleTimeMillis() > 0) &&
                    (idleTimeMilis > getMinEvictableIdleTimeMillis())) {
                removeObject = true;
            }

这代码不管三七二十一，只要检测到空闲时间过长，上去先把renoveObject变成true，先销毁，然后再ensureMinIdle，创建新连接，问题也就是这里了，这里的池化概念有问题，可能开发的理念不同，应该先判断是不是大于minpool，再判断空闲时间。然后修改成大一点的空闲时间检测，根据业务来决定，多大合适，2.x是没有这个问题的，我就不贴代码了，之后就是最上面的那个图片了。

修改完之后就是比较平稳的曲线了，其实这个问题的原因还有一个，就是我们上层还有一层redis，redis的能够承担大多数的读，写入mysql的量也不大，在国内站很久以前也有这种情况，不过现在他们组国内站的机器都500台，qps也很高了，基本上问题不大。上面就是我处理的过程，当然真正的处理过程没有这么轻松，其中的一些过程省略了，到此为止。。。。。。

jvm，apache-commons-pool的PhantomReference引起的一次线上内存崩掉的分析

猜你喜欢