前言
最近发生两次游戏服务器进程突然消失的事件,查询日志上下文没有找到有用的信息,日志显示运行到某处戛然而止,此处代码逻辑简单,排除异常逻辑导致的服务器进程崩溃,所以联想到之前同一台服务器上部署多个进程,因占用内存太大被内核杀死的情况,猜测本次也是这个原因。
查询方法
通过搜索引擎找到了两种查询方法,一种是直接过滤系统日志,一种是借助 dmesg
命令,备注以下命令在 CentOS 7
上测试通过
通过系统日志查找
[root@demo]# grep "Out of memory" /var/log/messages
Apr 4 10:32:30 hk-dev kernel: Out of memory: Kill process 2434 (Game9) score 212 or sacrifice child
Apr 4 10:33:53 hk-dev kernel: Out of memory: Kill process 2476 (git) score 381 or sacrifice child
Apr 4 10:33:53 hk-dev kernel: Out of memory: Kill process 2777 (git) score 381 or sacrifice child
通过dmesg命令查找
[root@demo]# dmesg -T | grep "Out of memory"
[Mon Aug 30 12:06:25 2021] Out of memory: Kill process 22437 (git) score 400 or sacrifice child
[Wed Sep 22 20:23:52 2021] Out of memory: Kill process 29780 (Game6) score 161 or sacrifice child
[Wed Mar 29 15:54:31 2023] Out of memory: Kill process 29093 (git) score 388 or sacrifice child
[Tue Apr 4 10:24:05 2023] Out of memory: Kill process 2434 (Game9) score 212 or sacrifice child
[Tue Apr 4 10:25:29 2023] Out of memory: Kill process 2476 (git) score 381 or sacrifice child
进程被杀的原因
Linux 内核有个机制叫OOM killer,全称为 Out Of Memory killer,很形象的一个名字——内存溢出杀手,这个机制会监控那些占用内存过大,尤其是瞬间占用内存很快的进程,为防止机器内存耗尽而主动把该进程杀掉。
当内核检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码 linux/mm/oom_kill.c
(2023-4-4 23:24:07确认了此文件存在),当系统内存不足的时候,out_of_memory()
函数被触发,然后调用 select_bad_process()
函数选择一个进程杀掉,这个选择的过程是通过调用 oom_badness()
函数实现的,挑选的算法和想法都暴力但朴实:就是找到最占用内存的进程。
源码如下:
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
long oom_badness(struct task_struct *p, unsigned long totalpages)
{
long points;
long adj;
if (oom_unkillable_task(p))
return LONG_MIN;
p = find_lock_task_mm(p);
if (!p)
return LONG_MIN;
/*
* Do not even consider tasks which are explicitly marked oom
* unkillable or have been already oom reaped or the are in
* the middle of vfork
*/
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
in_vfork(p)) {
task_unlock(p);
return LONG_MIN;
}
/*
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
mm_pgtables_bytes(p->mm) / PAGE_SIZE;
task_unlock(p);
/* Normalize to oom_score_adj units */
adj *= totalpages / 1000;
points += adj;
return points;
}
总结
- 如果你发现运行了一段时间的进程突然不见了,那可能是内核嫉妒生恨把它给干掉了
- 查询内存溢出被杀掉的进程可以直接通过系统日志来查
grep "Out of memory" /var/log/messages
- 也可以通过专门的命令查找
dmesg -T | grep "Out of memory"
- 刚刚看了linus 的
linux
代码库,昨天还在提交,真的是更新无止境
做人一辈子 快乐没几天
一条大路分两边 随你要走哪一边
不怕不怕就不怕 我是年轻人
风大雨大太阳大 我就是敢打拼