浅谈Dubbo框架踩坑记之优雅重启问题

一、背景

最近生产环境引入Dubbo服务，每次上线重启服务，都会有超时报警，诡异的是，客户端和服务端重启都会有影响，量大了报警就愈发明显了。

大致报警信息如下:

cause: org.apache.dubbo.remoting.TimeoutException: Waiting server-side response timeout by scan timer. start time: 2021-09-09 11:59:56.822, end time: 2021-09-09 11:59:58.828, client elapsed: 0 ms, server elapsed: 2006 ms, timeout: 2000 ms, request: Request [id=307463, version=2.0.2, twoway=true, event=false, broken=false, data=null], channel: /XXXXXX:52149 -> /XXXXXX:20880] with root cause]

会是什么原因呢？

没有优雅停机？
重启瞬间，请求量太大，没有预热？
Dubbo启动成功后，SpringBoot还未启动成功，没有延迟暴露？
是否有参数配置不合理？

以上都有可能，经过将近半个月时间的阅读Dubbo框架源码、验证，终于找全了答案，特此呕心整理采坑记录。

二、说明

版本

组件	版本
Dubbo	2.7.7
Netty	4.0.36.Final
Zookeeper	3.4.9

基本情况

对于读请求，幂等的，我们是默认重试的，但是写请求，默认是不重试的。

默认超时时间2000ms。

服务都是docker容器，Dubbo客户端数量远大于服务提供端, 比例大概是10: 1

提示本文重点阐述服务重启相关的技术点和原理，对Dubbo框架基础，Netty基础，以及版本之前的区别不会展开讲解。

三、优雅重启关键技术点

针对上面问题，Dubbo框架也提供了解决方案，下面我们依次看下。

Dubbo优雅停机机制

Dubbo是通过JDK的ShutdownHook来完成优雅停机的, Dubbo 中实现的优雅停机机制主要包含6个步骤：

（1）收到 kill PID 进程退出信号，Spring 容器会触发容器销毁事件。

（2）provider 端会注销服务元数据信息(删除ZK节点)。

（3）consumer 会拉取最新服务提供者列表。

（4）provider 会发送 readonly 事件报文通知 consumer 服务不可用。

（5）服务端等待已经执行的任务结束并拒绝新任务执行。

优雅退出

核心代码：

  @Override
    public void close(final int timeout) {
        startClose();
        if (timeout > 0) {
            final long max = (long) timeout;
            final long start = System.currentTimeMillis();
            if (getUrl().getParameter(Constants.CHANNEL_SEND_READONLYEVENT_KEY, true)) {
                //发送 readonly 事件报文通知 consumer 服务不可用
                sendChannelReadOnlyEvent();
            }
            while (HeaderExchangeServer.this.isRunning()
                    && System.currentTimeMillis() - start < max) {
                try {
                    Thread.sleep(10);
                } catch (InterruptedException e) {
                    logger.warn(e.getMessage(), e);
                }
            }
        }
        doClose();
        server.close(timeout);
    }

相关配置

dubbo:
  application:
        shutwait: 10000 # 优雅退出等待时间，单位毫秒 默认等待 10s

Dubbo预热机制

Dubbo服务默认权重是100，Dubbo实际上是提供了一种伪预热机制，根据服务提供者运行时间计算权重，再使用负载均衡策略实现流量从小到大。下面我们就从 Dubbo 源码出发，观察服务预热具体实现方式，具体源码位于 AbstractLoadBalance#getWeight

 /**
     * Get the weight of the invoker's invocation which takes warmup time into account
     * if the uptime is within the warmup time, the weight will be reduce proportionally
     *
     * @param invoker    the invoker
     * @param invocation the invocation of this invoker
     * @return weight
     */
    int getWeight(Invoker<?> invoker, Invocation invocation) {
        int weight;
        URL url = invoker.getUrl();
        // Multiple registry scenario, load balance among multiple registries.
        if (REGISTRY_SERVICE_REFERENCE_PATH.equals(url.getServiceInterface())) {
            weight = url.getParameter(REGISTRY_KEY + "." + WEIGHT_KEY, DEFAULT_WEIGHT);
        } else {
            weight = url.getMethodParameter(invocation.getMethodName(), WEIGHT_KEY, DEFAULT_WEIGHT);
            if (weight > 0) {
                //获取服务启动时间 timestamp
                long timestamp = invoker.getUrl().getParameter(TIMESTAMP_KEY, 0L);
                if (timestamp > 0L) {
                    //使用当前时间减去服务提供者启动时间，计算服务提供者已运行时间 `uptime`
                    long uptime = System.currentTimeMillis() - timestamp;
                    if (uptime < 0) {
                        return 1;
                    }
                    //获取服务预热时间基数，默认是10分钟
                    int warmup = invoker.getUrl().getParameter(WARMUP_KEY, DEFAULT_WARMUP);
                    //如果服务启动时间 小于 warmup 则重新计算权重
                    if (uptime > 0 && uptime < warmup) {
                        //根据已运行时间动态计算服务预热过程的权重
                        weight = calculateWarmupWeight((int)uptime, warmup, weight);
                    }
                }
            }
        }
        return Math.max(weight, 0);
    }

下面看下计算权重算法

 /**
     * Calculate the weight according to the uptime proportion of warmup time
     * the new weight will be within 1(inclusive) to weight(inclusive)
     *
     * @param uptime the uptime in milliseconds
     * @param warmup the warmup time in milliseconds
     * @param weight the weight of an invoker
     * @return weight which takes warmup into account
     */
    static int calculateWarmupWeight(int uptime, int warmup, int weight) {
        int ww = (int) ( uptime / ((float) warmup / weight));
        return ww < 1 ? 1 : (Math.min(ww, weight));
    }

这里计算方式其实很简单，简单来说服务运行时间越久，权重越高，直到uptime = warmup时，恢复正常权重weight.

在默认情况下(Dubbo服务默认权重100，预热时间10分钟)

假如服务提供者已运行 1 分钟，那么 weight 最终结果为 10 。

假如服务提供者已运行 5 分钟，那么 weight 最终结果为 50 。

假如服务提供者已运行 11 分钟，超过默认预热时间的阈值 10分钟，那么将不会再计算，直接返回 weight 默认权重。

温馨提示: 负载均衡策略 consistenthash(一致性Hash) 不支持服务预热。

相关配置

dubbo:
    provider:
         delay: 5000 # 默认null不延迟, 单位毫秒

其它

解决完这些，重启服务还是有大量超时，通过排查客户端日志发现。

/XXX:57330 -> /XXXX:20880 is established., dubbo version: 2.7.7, current host: XXXX
2021-09-07 15:01:07.748 [NettyClientWorker-1-16] INFO  o.a.d.r.t.netty4.NettyClientHandler   -  [DUBBO] The connection of /XXXX:57332 -> /XXXX:20880 is established., dubbo version: 2.7.7, current host: XXXX

# 简单统计一下发现 客户端启动时建立了3600个长连接
$ less /u01/logs/order-service-api_XXX/dubbo.log  | grep NettyClientWorker- |grep  '2021-09-07 15' | wc -l
3600

带着这个疑问，查看源码发现。

DubboProtocol#getClients

private ExchangeClient[] getClients(URL url) {
        boolean useShareConnect = false;

        //获取配置连接数， 如果没有配置默认0
        int connections = url.getParameter(CONNECTIONS_KEY, 0);
        List<ReferenceCountExchangeClient> shareClients = null;
        // if not configured, connection is shared, otherwise, one connection for one service
        if (connections == 0) {
            //注意： 如果Provider 配置了connections， 就不会使用共享连接，Consumer就算配置了shareConnections也不会生效
            useShareConnect = true;

            /*
             * The xml configuration should have a higher priority than properties.
             */
            String shareConnectionsStr = url.getParameter(SHARE_CONNECTIONS_KEY, (String) null);
            connections = Integer.parseInt(StringUtils.isBlank(shareConnectionsStr) ? ConfigUtils.getProperty(SHARE_CONNECTIONS_KEY,
                    DEFAULT_SHARE_CONNECTIONS) : shareConnectionsStr);
            shareClients = getSharedClient(url, connections);
        }

        ExchangeClient[] clients = new ExchangeClient[connections];
        for (int i = 0; i < clients.length; i++) {
            if (useShareConnect) {
                clients[i] = shareClients.get(i);

            } else {
                //初始化创连接
                clients[i] = initClient(url);
            }
        }

        return clients;
    }

问题就在于，我们服务端配置了

dubbo:
  provider:
    connections: 200

解释下上面代码，如果没有配置 connections, 就会使用共享连接，共享连接个数由Consumer 配置 shareconnections 个数决定，默认 1个，反之，如果配置了connections, 就会给每一个service 建立 connections个数长连接。

下面我们再看看 initClient 过程

initClient(URL url) {

        // client type setting.
        String str = url.getParameter(CLIENT_KEY, url.getParameter(SERVER_KEY, DEFAULT_REMOTING_CLIENT));

        url = url.addParameter(CODEC_KEY, DubboCodec.NAME);
        // enable heartbeat by default
        url = url.addParameterIfAbsent(HEARTBEAT_KEY, String.valueOf(DEFAULT_HEARTBEAT));

        // BIO is not allowed since it has severe performance issue.
        if (str != null && str.length() > 0 && !ExtensionLoader.getExtensionLoader(Transporter.class).hasExtension(str)) {
            throw new RpcException("Unsupported client type: " + str + "," +
                    " supported client type is " + StringUtils.join(ExtensionLoader.getExtensionLoader(Transporter.class).getSupportedExtensions(), " "));
        }

        ExchangeClient client;
        try {
            // 是否配置了懒加载
            if (url.getParameter(LAZY_CONNECT_KEY, false)) {
                client = new LazyConnectExchangeClient(url, requestHandler);

            } else {
                //没有配置懒加载会初始化长连接
                client = Exchangers.connect(url, requestHandler);
            }

        } catch (RemotingException e) {
            throw new RpcException("Fail to create remoting client for service(" + url + "): " + e.getMessage(), e);
        }

        return client;
    }

从以上代码可以看出，如果没有配置懒加载，会直接初始化长连接。也就是说，每当我们消费端重启，会建立 Service个数 * 200 * 服务端docker服务数个长连接。我们service个数是3， docker服务个数6 刚好是3600个长连接。

那么，服务端重启呢，服务端重启 ZK 会通知到消费端(大概60台docker服务)，都会和新启动的docker服务建立连接，一个消费端建立 200 * 3，那么总共会建立 36000 个长连接。

由此可知，每次服务重启，都需要建立大量长连接，导致建连耗时时间特别长(大致计算了下，大概10s)。

优化: 将连接池数量改小，经过压测，配置 2 就够用了。

dubbo:
  provider:
    connections: 2

当然也可以服务端默认不配置，由消费端决定长连接个数。当需要长连接较多时，可以使用懒加载。服务端重启瞬间建立长连接总数建议不超过500。解决以上问题之后，重启超时问题终于算是解决了。

总结

Dubbo优雅重启问题，算是踩了个大坑，同时也说明了参数配置要知其所以然的重要性，不然可能导致不可预料的问题。

另外我们还踩了个线程池的坑，这个下篇文章再做介绍。

关注我，不迷路，欢迎点赞收藏。