监控报警发现MGR的一个节点故障,查看时发现LVS已经发生切换,LVS切到了MGR新的写节点上了,排查原因
/var/log/message
Mar 27 16:51:05 db10 kernel: crond invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0 Mar 27 16:51:05 db10 kernel: crond cpuset=/ mems_allowed=0-1 Mar 27 16:51:05 db10 kernel: CPU: 35 PID: 12090 Comm: crond Tainted: G OE ------------ 3.10.0-693.21.1.el7.x86_64 #1 Mar 27 16:51:05 db10 kernel: Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS 4.1.16 06/21/2018 Mar 27 16:51:05 db10 kernel: Call Trace: Mar 27 16:51:05 db10 kernel: [<ffffffff816ae7c8>] dump_stack+0x19/0x1b Mar 27 16:51:05 db10 kernel: [<ffffffff816a9b90>] dump_header+0x90/0x229 Mar 27 16:51:05 db10 kernel: [<ffffffff810ecec2>] ? ktime_get_ts64+0x52/0xf0 Mar 27 16:51:05 db10 kernel: [<ffffffff8114140f>] ? delayacct_end+0x8f/0xb0 Mar 27 16:51:05 db10 kernel: [<ffffffff8118a884>] oom_kill_process+0x254/0x3d0 Mar 27 16:51:05 db10 kernel: [<ffffffff8118a32d>] ? oom_unkillable_task+0xcd/0x120 Mar 27 16:51:05 db10 kernel: [<ffffffff8118a3d6>] ? find_lock_task_mm+0x56/0xc0 Mar 27 16:51:05 db10 kernel: [<ffffffff8118b0c6>] out_of_memory+0x4b6/0x4f0 Mar 27 16:51:05 db10 kernel: [<ffffffff816aa694>] __alloc_pages_slowpath+0x5d6/0x724 Mar 27 16:51:05 db10 kernel: [<ffffffff811912a5>] __alloc_pages_nodemask+0x405/0x420 Mar 27 16:51:05 db10 kernel: [<ffffffff8108859d>] copy_process+0x1dd/0x1970 Mar 27 16:51:05 db10 kernel: [<ffffffff81121930>] ? audit_filter_rules.isra.8+0x280/0xf90 Mar 27 16:51:05 db10 kernel: [<ffffffff81089ee1>] do_fork+0x91/0x320 Mar 27 16:51:05 db10 kernel: [<ffffffff8108a1f6>] SyS_clone+0x16/0x20 Mar 27 16:51:05 db10 kernel: [<ffffffff816c0ad4>] stub_clone+0x44/0x70 Mar 27 16:51:05 db10 kernel: [<ffffffff816c0715>] ? system_call_fastpath+0x1c/0x21 Mar 27 16:51:05 db10 kernel: Mem-Info: Mar 27 16:51:05 db10 kernel: active_anon:32289123 inactive_anon:180550 isolated_anon:0#012 active_file:960 inactive_file:195 isolated_file:0#012 unevictable:0 dirty:4 8 writeback:0 unstable:0#012 slab_reclaimable:59079 slab_unreclaimable:32778#012 mapped:13096 shmem:534843 pagetables:66034 bounce:0#012 free:96590 free_pcp:105 free_cma:0 Mar 27 16:51:05 db10 kernel: Node 0 DMA free:13540kB min:8kB low:8kB high:12kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB iso lated(anon):0kB isolated(file):0kB present:15984kB managed:15900kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kern el_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Mar 27 16:51:05 db10 kernel: lowmem_reserve[]: 0 1680 64143 64143 Mar 27 16:51:05 db10 kernel: Node 0 DMA32 free:250600kB min:1176kB low:1468kB high:1764kB active_anon:1442100kB inactive_anon:464kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1934208kB managed:1722948kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB shmem:1740kB slab_reclaimable:11840 kB slab_unreclaimable:7640kB kernel_stack:368kB pagetables:1132kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unre claimable? yes Mar 27 16:51:05 db10 kernel: lowmem_reserve[]: 0 0 62462 62462 Mar 27 16:51:05 db10 kernel: Node 0 Normal free:54592kB min:43744kB low:54680kB high:65616kB active_anon:62871276kB inactive_anon:371740kB active_file:12kB inactive_f ile:24kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:65011712kB managed:63961888kB mlocked:0kB dirty:0kB writeback:0kB mapped:1028kB shmem:1190332kB slab_ reclaimable:124084kB slab_unreclaimable:45492kB kernel_stack:4768kB pagetables:92984kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Mar 27 16:51:05 db10 kernel: lowmem_reserve[]: 0 0 0 0 Mar 27 16:51:05 db10 kernel: Node 1 Normal free:68040kB min:45176kB low:56468kB high:67764kB active_anon:64843172kB inactive_anon:349996kB active_file:0kB inactive_file:160kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:67108864kB managed:66056756kB mlocked:0kB dirty:192kB writeback:0kB mapped:50080kB shmem:947300kB slab_reclaimable:100392kB slab_unreclaimable:77980kB kernel_stack:28736kB pagetables:170020kB unstable:0kB bounce:0kB free_pcp:640kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:55 all_unreclaimable? no Mar 27 16:51:05 db10 kernel: lowmem_reserve[]: 0 0 0 0 Mar 27 16:51:05 db10 kernel: Node 0 DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 1*64kB (U) 1*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 2*2048kB (UM) 2*4096kB (M) = 13540kB Mar 27 16:51:05 db10 kernel: Node 0 DMA32: 264*4kB (UEM) 403*8kB (UEM) 475*16kB (UEM) 342*32kB (UEM) 391*64kB (UEM) 300*128kB (UEM) 208*256kB (UEM) 107*512kB (UEM) 45*1024kB (EM) 5*2048kB (E) 0*4096kB = 250600kB Mar 27 16:51:05 db10 kernel: Node 0 Normal: 13593*4kB (UEM) 22*8kB (UM) 9*16kB (M) 2*32kB (M) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 54756kB Mar 27 16:51:05 db10 kernel: Node 1 Normal: 16649*4kB (UEM) 8*8kB (UM) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 66660kB Mar 27 16:51:05 db10 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Mar 27 16:51:05 db10 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Mar 27 16:51:05 db10 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB Mar 27 16:51:05 db10 kernel: Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Mar 27 16:51:05 db10 kernel: 535067 total pagecache pages Mar 27 16:51:05 db10 kernel: 0 pages in swap cache Mar 27 16:51:05 db10 kernel: Swap cache stats: add 0, delete 0, find 0/0 Mar 27 16:51:05 db10 kernel: Free swap = 0kB Mar 27 16:51:05 db10 kernel: Total swap = 0kB Mar 27 16:51:05 db10 kernel: 33517692 pages RAM Mar 27 16:51:05 db10 kernel: 0 pages HighMem/MovableOnly Mar 27 16:51:05 db10 kernel: 578319 pages reserved Mar 27 16:51:05 db10 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Mar 27 16:51:05 db10 kernel: [ 6050] 0 6050 35461 19476 75 0 0 systemd-journal Mar 27 16:51:05 db10 kernel: [ 6075] 0 6075 30235 80 28 0 0 lvmetad Mar 27 16:51:05 db10 kernel: [ 6094] 0 6094 10898 172 24 0 -1000 systemd-udevd Mar 27 16:51:05 db10 kernel: [11985] 0 11985 4845 104 15 0 0 irqbalance Mar 27 16:51:05 db10 kernel: [11988] 995 11988 25173 71 20 0 0 chronyd Mar 27 16:51:06 db10 kernel: [11989] 81 11989 6709 161 21 0 -900 dbus-daemon Mar 27 16:51:06 db10 kernel: [12004] 0 12004 31998 151 22 0 0 smartd Mar 27 16:51:06 db10 kernel: [12006] 996 12006 2144 37 10 0 0 lsmd Mar 27 16:51:06 db10 kernel: [12009] 0 12009 186971 9901 237 0 0 rsyslogd Mar 27 16:51:06 db10 kernel: [12016] 0 12016 1105 39 8 0 0 rngd Mar 27 16:51:06 db10 kernel: [12034] 0 12034 6620 99 19 0 0 systemd-logind Mar 27 16:51:06 db10 kernel: [12068] 0 12068 5955 48 17 0 0 atd Mar 27 16:51:06 db10 kernel: [12090] 0 12090 31058 165 19 0 0 crond Mar 27 16:51:06 db10 kernel: [12242] 0 12242 1055 19 7 0 0 supervise Mar 27 16:51:06 db10 kernel: [12243] 0 12243 28807 54 14 0 0 run Mar 27 16:51:06 db10 kernel: [12260] 0 12260 139002 3217 93 0 0 tuned Mar 27 16:51:06 db10 kernel: [12273] 0 12273 27021 242 54 0 -1000 sshd Mar 27 16:51:06 db10 kernel: [12316] 0 12316 27523 33 10 0 0 agetty Mar 27 16:51:06 db10 kernel: [12319] 0 12319 20378 199 38 0 0 hooagentd Mar 27 16:51:06 db10 kernel: [12324] 0 12324 80468 586 57 0 0 hooagent Mar 27 16:51:06 db10 kernel: [12804] 0 12804 22895 259 43 0 0 master Mar 27 16:51:06 db10 kernel: [12831] 89 12831 22965 281 45 0 0 qmgr Mar 27 16:51:06 db10 kernel: [13103] 0 13103 828994 4025 115 0 0 wonder-agent Mar 27 16:51:06 db10 kernel: [20985] 0 20985 175106 1241 72 0 -1000 logmon Mar 27 16:51:06 db10 kernel: [18570] 42583 18570 32515 159 19 0 0 screen Mar 27 16:51:06 db10 kernel: [18571] 42583 18571 29229 485 15 0 0 bash Mar 27 16:51:06 db10 kernel: [22385] 42583 22385 32515 153 19 0 0 screen Mar 27 16:51:06 db10 kernel: [22386] 42583 22386 29230 485 16 0 0 bash Mar 27 16:51:06 db10 kernel: [22416] 42583 22416 32515 154 20 0 0 screen Mar 27 16:51:06 db10 kernel: [22417] 42583 22417 29230 485 13 0 0 bash Mar 27 16:51:06 db10 kernel: [12032] 0 12032 28326 102 13 0 0 mysqld_safe Mar 27 16:51:06 db10 kernel: [13363] 33173 13363 74431932 31903076 64367 0 0 mysqld Mar 27 16:51:06 db10 kernel: [33949] 0 33949 14918 7466 33 0 0 mysqld_exporter Mar 27 16:51:06 db10 kernel: [ 6287] 0 6287 663221 5068 121 0 0 bbmon Mar 27 16:51:06 db10 kernel: [ 6621] 89 6621 22921 255 46 0 0 pickup Mar 27 16:51:06 db10 kernel: [ 6957] 89 6957 22922 256 44 0 0 trivial-rewrite Mar 27 16:51:06 db10 kernel: [ 7033] 0 7033 45072 238 45 0 0 crond Mar 27 16:51:06 db10 kernel: [ 7045] 0 7045 28274 48 13 0 0 sh Mar 27 16:51:06 db10 kernel: [ 7054] 0 7054 372238 1382 69 0 0 dbvip Mar 27 16:51:06 db10 kernel: [ 7421] 0 7421 47770 1426 49 0 0 python Mar 27 16:51:06 db10 kernel: [ 7422] 0 7422 4935 159 12 0 0 msval Mar 27 16:51:06 db10 kernel: Out of memory: Kill process 5396 (mysqld) score 970 or sacrifice child Mar 27 16:51:06 db10 kernel: Killed process 13363 (mysqld) total-vm:297727728kB, anon-rss:127612364kB, file-rss:0kB, shmem-rss:0kB
直接原因是下面这个mysqld进程被杀
Mar 27 16:51:06 db10 kernel: Killed process 13363 (mysqld) total-vm:297727728kB, anon-rss:127612364kB, file-rss:0kB, shmem-rss:0kB
然后往上面看,mysqld占用的内存是70多G,系统物理内存是128G
Mar 27 16:51:06 db10 kernel: [13363] 33173 13363 74431932 31903076 64367 0 0 mysqld
再往上看涉及到了node0、node1、hugepages_total,swap,这主要是numa和大页相关,先跳过这两个问题,既然这里是70多Gmysqld就被kill掉了,那我先设置mysqlbuffer_pool为 64G,先为防止该问题再出现加一道保险,然后再慢慢排查
mysql> show variables like '%pool_size%'; +-------------------------+-------------+ | Variable_name | Value | +-------------------------+-------------+ | innodb_buffer_pool_size | 85899345920 | +-------------------------+-------------+ 1 row in set (0.00 sec) mysql> select 64*1024*1024*1024; +-------------------+ | 64*1024*1024*1024 | +-------------------+ | 68719476736 | +-------------------+ 1 row in set (0.00 sec) mysql> mysql> mysql> set global innodb_buffer_pool_size=68719476736; Query OK, 0 rows affected (0.00 sec) mysql> show global variables like '%pool_size%'; +-------------------------+-------------+ | Variable_name | Value | +-------------------------+-------------+ | innodb_buffer_pool_size | 68719476736 | +-------------------------+-------------+ 1 row in set (0.00 sec)
注意,配置文件也要修改一下;修改后OS会慢慢释放一些内存,当然,那些正在使用内存不会被释放。