How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

Author:Skate
Time:2015/03/04

How to fix hung_task_timeout_secs and blocked for more than 120 seconds problem

现象:系统hang住,可以ping通,但ssh无响应

查看message log
[1379100.801689] [<ffffffff81536f95>] page_fault+0x25/0x30
[1379100.801693] INFO: task java:710923 blocked for more than 120 seconds.
[1379100.801766] Not tainted 2.6.32-042stab104.1 #1
[1379100.801835] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1379100.801963] java D ffff8800372d7200 0 710923 709954 67084186 0x00000000
[1379100.801968] ffff880e57e71cf0 0000000000000082 ffffea00021a8fc0 ffff880e57e71c68
[1379100.801972] ffffffff81155c60 ffff8800372d7200 ffffea00021a8fc0 ffff88100c409638
[1379100.801976] 00000007fa23bffc ffff880e57e71c78 ffffffff81155cd1 ffff880e57e71ca8
[1379100.801980] Call Trace:
[1379100.801984] [<ffffffff81155c60>] ? __lru_cache_add+0x40/0x90
[1379100.801988] [<ffffffff81155cd1>] ? lru_cache_add_lru+0x21/0x40
[1379100.801992] [<ffffffff81172c9c>] ? handle_pte_fault+0x65c/0x1040
[1379100.801996] [<ffffffff81536705>] rwsem_down_failed_common+0x95/0x1d0
[1379100.802000] [<ffffffff81536896>] rwsem_down_read_failed+0x26/0x30
[1379100.802004] [<ffffffff812a6a34>] call_rwsem_down_read_failed+0x14/0x30
[1379100.802008] [<ffffffff81535d94>] ? down_read+0x24/0x30
[1379100.802011] [<ffffffff8104dffe>] __do_page_fault+0x18e/0x480
[1379100.802015] [<ffffffff8106f0c8>] ? finish_task_switch+0xc8/0x120
[1379100.802019] [<ffffffff81539c2e>] do_page_fault+0x3e/0xa0
[1379100.802022] [<ffffffff81536f95>] page_fault+0x25/0x30
Show  Vitaly Medvedev added a comment - Yesterday 10:34 PM [1379100.801682] [<ffffffff81015019>] ? read_tsc+0x9/0x20 [1379100.801685] [<ffffffff81539c2e>] do_page_fault+0x3e/0xa0 [1379100.801689] [<ffffffff81536f95>] page_fault+0x25/0x30 [1379100.801693] INFO: task java:710923 blocked for more than 120 seconds. [1379100.801766] Not tainted 2.6.32-042stab104.1 #1 [1379100.801835] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [1379100.801963] java D ffff8800372d7200 0 710923 709954 67084186 0x00000000 [1379100.801968] ffff880e57e71cf0 0000000000000082 ffffea00021a8fc0 ffff880e57e71c68 [1379100.801972] ffffffff81155c60 ffff8800372d7200 ffffea00021a8fc0 ffff88100c409638 [1379100.801976] 00000007fa23bffc ffff880e57e71c78 ffffffff81155cd1 ffff880e57e71ca8 [1379100.801980] Call Trace: [1379100.801984] [<ffffffff81155c60>] ? __lru_cache_add+0x40/0x90 [1379100.801988] [<ffffffff81155cd1>] ? lru_cache_add_lru+0x21/0x40 [1379100.801992] [<ffffffff81172c9c>] ? handle_pte_fault+0x65c/0x1040 [1379100.801996] [<ffffffff81536705>] rwsem_down_failed_common+0x95/0x1d0 [1379100.802000] [<ffffffff81536896>] rwsem_down_read_failed+0x26/0x30 [1379100.802004] [<ffffffff812a6a34>] call_rwsem_down_read_failed+0x14/0x30 [1379100.802008] [<ffffffff81535d94>] ? down_read+0x24/0x30 [1379100.802011] [<ffffffff8104dffe>] __do_page_fault+0x18e/0x480 [1379100.802015] [<ffffffff8106f0c8>] ? finish_task_switch+0xc8/0x120 [1379100.802019] [<ffffffff81539c2e>] do_page_fault+0x3e/0xa0 [1379100.802022] [<ffffffff81536f95>] page_fault+0x25/0x30


宿主机的load达到460左右

By default Linux uses up to 40% of the available memory for file system caching.
After this mark has been reached the file system flushes all outstanding data to
disk causing all following IOs going synchronous. For flushing out this data to
disk this there is a time limit of 120 seconds by default. In the case here the
IO subsystem is not fast enough to flush the data withing 120 seconds. As IO
subsystem responds slowly and more requests are served, System Memory gets filled
up resulting in the above error, thus serving HTTP requests.


解决方案:

1. 修改参数 vm.dirty_ratio 和 vm.dirty_backgroud_ratio 可以避免这个问题

# sysctl -w vm.dirty_ratio=10
# sysctl -w vm.dirty_background_ratio=5

立即生效:
# sysctl -p

永久修改(需要reboot生效):
# vi /etc/sysctl.conf
vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

2.找到好资源的进程,然后对其优化


参考:http://www.blackmoreops.com/2014/09/22/linux-kernel-panic-issue-fix-hung_task_timeout_secs-blocked-120-seconds-problem/


-------end-------

猜你喜欢

转载自blog.csdn.net/wyzxg/article/details/44236263