记SpringBoot应用无法注册到Nacos的排查与反思

一. 概述

       公司最近在做组件的总体升级,本人需要负责其中的一些基础组件。其中有一项就是注册中心,配置中心的切换。目前公司主要使用的注册中心是Eureka,配置中心是Spring Cloud Config,经过讨论,最终决定使用Nacos进行替换。Nacos作为配置中心和注册中心, 提供了强大的动态服务发现、服务配置、服务管理等功能,而且提供了简洁大方的控制台供用户操作,Nacos在开源界有强大的生态圈,更多的介绍请参考官网:https://nacos.io/zh-cn/docs/what-is-nacos.html

因此,需要将Eureka上的旧服务迁移到Nacos上。在迁移的过程中,应用所使用的组件版本为:

组件名称 版本号
SpringBoot版本 2.0.1.RELEASE
SpringCloud版本 Finchley.RELEASE
Alibaba版本 2.0.1.RELEASE
Druid版本 1.1.0

改造后,发现SpringBoot应用启动后,直接卡住,日志如下所示:

2019-12-04 15:39:59.412  INFO 28684 --- [ost-startStop-1] o.s.web.context.ContextLoader            : Root WebApplicationContext: initialization completed in 2186 ms
2019-12-04 15:39:59.556  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.ServletRegistrationBean  : Servlet dispatcherServlet mapped to [/]
2019-12-04 15:39:59.558  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.ServletRegistrationBean  : Servlet statViewServlet mapped to [/druid/*]
2019-12-04 15:39:59.562  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.FilterRegistrationBean   : Mapping filter: 'characterEncodingFilter' to: [/*]
2019-12-04 15:39:59.563  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.FilterRegistrationBean   : Mapping filter: 'hiddenHttpMethodFilter' to: [/*]
2019-12-04 15:39:59.563  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.FilterRegistrationBean   : Mapping filter: 'httpPutFormContentFilter' to: [/*]
2019-12-04 15:39:59.563  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.FilterRegistrationBean   : Mapping filter: 'requestContextFilter' to: [/*]
2019-12-04 15:39:59.563  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.FilterRegistrationBean   : Mapping filter: 'webStatFilter' to urls: [/*]

       日志输出到  Mapping filter: 'webStatFilter' to urls: [/*] 后,就不再输出,而且在Nacos上查看发现应用并没有注册上去。下面开始排查。

二. 排查过程

1. 梳理排查思路

       一开始先思考,是应用代码的问题,还是环境的问题。该应用在本地可以正常启动并注册到Nacos上,而部署到服务器上时,却无法正常启动,说明很可能是环境的问题。接下来需要进行验证。可以先在另一台服务器上进行部署,如果也可以,那么证明是这台服务器的环境问题,如果也不行,那么证明有可能不是环境问题。此时就需要继续考虑其他因素。

       于是我尝试将应用部署到另一台服务器,结果应用这次能正常启动,验证了我的观点:第一次部署应用的服务器环境存在问题。环境问题可能是操作系统版本,软件版本,或者是网络策略等问题。但是这里比较棘手的一点是,应用是直接卡住了,并没有任何error日志输出(这很头疼)。于是只能先尝试进行日志分析。

2. 应用日志分析

看了一下日志,发现应用日志卡在这一行:

2019-12-04 15:39:59.563  INFO 28684 --- [ost-startStop-1] o.s.b.w.servlet.FilterRegistrationBean   : Mapping filter: 'webStatFilter' to urls: [/*]

       这里的filter信息很重要:是一个webStatFilter。通过查阅知识可得,这是一个Druid的Filter。那么我就好奇了,为什么应用会卡在一个数据源的Filter上呢?难道是数据库连不上?

       想到这里,我就有点激动了,感觉到很可能是这个问题,但是日志却没有任何输出,这个确实很奇怪。因此,我决定再进一步,直接跟踪系统调用,看下操作系统到底做了啥?

3. 跟踪系统调用

于是,我通过命令查看系统调用:

strace -f -o strace.out nohup java -jar xxx.jar  &

这里的 -o 参数,可以把日志输出到 strace.out。拿到文件一看,我得到意想不到的收获!

30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 717936721}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 728500415}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 739076043}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 749474753}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
30085 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
30095 <... futex resumed> )             = 0
30085 futex(0x7f386018c128, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 759949491}, ffffffff <unfinished ...>
30085 <... futex resumed> )             = 0
30085 futex(0x7f386018c154, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 800171372}, ffffffff <unfinished ...>
30095 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 770506451}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 780956053}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 791363319}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 801613419}, ffffffff <unfinished ...>
30085 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
30085 futex(0x7f386018c128, FUTEX_WAKE_PRIVATE, 1) = 0
30085 futex(0x7f386018c154, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 850747694}, ffffffff <unfinished ...>
30095 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 812186450}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 822697279}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 833305155}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603078, 843815920}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

可以看到,futex 疯狂的Connection timed out。。。

于是,我顺藤摸瓜,根据进程号30095的进程,继续往上查找,找到最顶部:

30107 read(5, "PK\3\4\n\0\0\10\10\0[m\353N\177\240\202\253\4\2\0\0\324\3\0\0-\0\0\0", 30) = 30
30107 lseek(5, 31513185, SEEK_SET)      = 31513185
30107 read(5, "\215S]O\23Q\20=\323\226n\273.BZJ\21EE\245\264EX*\302\v\306DQ\202\246\362"..., 516) = 516
30107 lseek(3, 48793319, SEEK_SET)      = 48793319
30107 read(3, "PK\3\4\n\0\0\10\0\0;\211|L2\247\2100j\r\0\0j\r\0\0!\0\0\0", 30) = 30
30107 lseek(3, 48793382, SEEK_SET)      = 48793382
30107 read(3, "\312\376\272\276\0\0\0004\0\241\7\0?\1\0\23$assertionsDisab"..., 3434) = 3434
30095 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)
30081 mmap(0x7f3851de0000, 65536, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0 <unfinished ...>
30095 futex(0x7f38610c7fc8, FUTEX_WAKE_PRIVATE, 1) = 0
30081 <... mmap resumed> )              = 0x7f3851de0000
30095 futex(0x7f38610c7ff4, FUTEX_WAIT_BITSET_PRIVATE, 1, {603055, 953559058}, ffffffff <unfinished ...>

看到上面是进程号30107的进程,于是继续向上查找,最终找到这几行:

30107 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 15
30107 setsockopt(15, SOL_IPV6, IPV6_V6ONLY, [0], 4) = 0
30107 setsockopt(15, SOL_TCP, TCP_NODELAY, [1], 4) = 0
30107 setsockopt(15, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
30107 connect(15, {sa_family=AF_INET6, sin6_port=htons(3306), inet_pton(AF_INET6, "::ffff:10.0.24.241", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28 <unfinished ...>
30107 getsockname(15, {sa_family=AF_INET6, sin6_port=htons(33917), inet_pton(AF_INET6, "::ffff:10.0.17.211", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
30107 ioctl(15, FIONREAD, [0])          = 0
30107 recvfrom(15,  <unfinished ...>
30082 futex(0x7f386897d4c0, FUTEX_WAKE_PRIVATE, 1) = 0
30095 <... futex resumed> )             = -1 ETIMEDOUT (Connection timed out)

可以看到,通过socket创建一个socket,然后通过connect连接,这里的端口为3306,表明是MySQL的数据库连接!

到这里,真相大白了。

应用使用了Druid,Druid在启动时,通过Filter,创建数据库连接,然后连接一直超时,然后就一直重试,然后应用就卡在那里了。

4. 问题解决

开通服务器到数据库的网络策略,然后问题解决。

另外,升级Druid的版本到1.1.10。这个版本的Druid会输出Error日志,以供分析(同时观察到这个版本使用的人数是最多的)。

三. 总结与反思

1. 排查问题时,多利用日志,有时候会有意想不到的收获。(webStatFilter)

2. 不要太在意表象的东西,有时候里面的内容才是我们需要关心的。(strace)

3. 编码时一些规范的遵守:如果不输出日志或者日志输出不合理,那么可以想象,排查问题时,一定很头疼。(尤其error级别)

发布了19 篇原创文章 · 获赞 20 · 访问量 5841

猜你喜欢

转载自blog.csdn.net/qq_15898739/article/details/103393272