问题现象
ping不同node2节点上的pod IP
排查过程
#查看calico,所有节点上的 calico都是Running状态的,node2节点上calico也是Running状态的
[root@master ~]# kubectl get pod -A -owide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
kube-system calico-node-5c4tc 1/1 Running 0 35m 192.168.158.128 master <none> <none>
kube-system calico-node-dnwxf 1/1 Running 0 8m20s 192.168.158.130 node2 <none> <none>
kube-system calico-node-szl4f 1/1 Running 0 35m 192.168.158.129 node1 <none> <none>
#查看各个节点的路由和网络状态,发现其他节点都会有一条tunl0的规则,唯独node2没有,说明node2的calico肯定有问题
[root@node1 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.158.2 0.0.0.0 UG 0 0 0 ens33
10.244.104.0 192.168.158.130 255.255.255.192 UG 0 0 0 tunl0
10.244.219.64 192.168.158.128 255.255.255.192 UG 0 0 0 tunl0
10.244.166.128 0.0.0.0 255.255.255.192 U 0 0 0 *
10.244.166.158 0.0.0.0 255.255.255.255 UH 0 0 0 calid976e52937f
10.244.166.159 0.0.0.0 255.255.255.255 UH 0 0 0 caliba2f9f8d937
10.244.166.160 0.0.0.0 255.255.255.255 UH 0 0 0 calib210296b50c
10.244.166.161 0.0.0.0 255.255.255.255 UH 0 0 0 cali81c0718f725
10.244.166.172 0.0.0.0 255.255.255.255 UH 0 0 0 cali181376187f2
10.244.166.173 0.0.0.0 255.255.255.255 UH 0 0 0 cali95cbb982c5d
10.244.166.175 0.0.0.0 255.255.255.255 UH 0 0 0 caliea486d05ffa
169.254.0.0 0.0.0.0 255.255.0.0 U 1002 0 0 ens33
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.158.0 0.0.0.0 255.255.255.0 U 0 0 0 ens33
#node2没有tunl0的规则
[root@node2 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.158.2 0.0.0.0 UG 100 0 0 ens33
10.244.104.0 0.0.0.0 255.255.255.192 U 0 0 0 *
10.244.104.33 0.0.0.0 255.255.255.255 UH 0 0 0 cali37cbc773c09
10.244.104.42 0.0.0.0 255.255.255.255 UH 0 0 0 cali11c4f9f51a0
10.244.104.43 0.0.0.0 255.255.255.255 UH 0 0 0 cali477a596b3da
10.244.104.44 0.0.0.0 255.255.255.255 UH 0 0 0 cali12d4a061371
10.244.104.45 0.0.0.0 255.255.255.255 UH 0 0 0 cali17b404083e4
10.244.104.46 0.0.0.0 255.255.255.255 UH 0 0 0 cali7a6fbb26daa
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.158.0 0.0.0.0 255.255.255.0 U 100 0 0 ens33
[root@node2 ~]# systemctl status network
#查看node2节点上的calico日志,发现有如下截图的报错
kubectl logs -f -n kube-system calico-node-dnwxf
最终解决
百度,发现也有人遇到相同的问题,但解决方法不一样,在这篇https://github.com/projectcalico/calico/issues/3134
提到了一个NetworkManager服务,排查发现本人的其他节点都是关闭NetworkManager的,只有node2是开启了NetworkManager,所以关闭NetworkManager后node2的calico正常了,不再报错,ping node2上的pod IP也能正常ping通。