As DevOps/Ops, you maintain DB instances or RAM intensive services. You see OOM issues occasionally, don’t you? Yes, the scary Out-Of-Memory issues.
Nobody enjoys OOM issues. When it does happen, what should be checked? More importantly, how to monitor OOM issues? And get alerts, before it actually happens.
Here are some of my thoughts. Take a look and discuss with me!
Original Article: http://dennyzhang.com/monitor_oom
What is OOM? It happens when the machines run into very low memory somehow. The OS doesn’t want to run into kernel panic. So as a self-protection, the OS will choose one victim. Usually the process using the most RAM. Kill it and release the memory resource.
How to confirm an OOM issue? When it happens, the system log will have entries of “Killed process”. Thus, we can use grep it like this: dmesg -T | grep -C 5 -i ‘killed process’.
denny@devops:/# dmesg -T | grep -C 5 -i 'killed process' ... [Tue Feb 21 00:16:39 2017] Out of memory: Kill process 12098 (java) score 655 or sacrifice child [Tue Feb 21 00:16:39 2017] Killed process 12098 (java) total-vm:223934456kB, anon-rss:17696224kB, file-rss:1153744kB ...
Which process has been killed or sacrificed? When the OS kills the process, it will:
- List all processes, before killing the victim.
- Log the process id. In our example, we know pid(12098) has been killed.
So use “grep $pid” to find out which process gets killed.
denny@devops:/# export pid=12098 denny@devops:/# dmesg -T | grep -C 1 "\[$pid\]" [Tue Feb 21 00:16:39 2017]  0 11740 3763 42 12 3 0 0 rpc.idmapd [Tue Feb 21 00:16:39 2017]  999 12098 55983614 4712492 11091 105 0 0 java [Tue Feb 21 00:16:39 2017]  0 22050 3998 629 14 3 0 0 tmux
In the above example, we know some “java” program has been killed. Well, I admit, it’s not crystal clear. The good thing is that OOM only happens with processes using a huge amount of RAM. So in reality, we will always be able to guess which process gets killed.
OOM Exclusion: show mercy to my critical processes. OS chooses victim by scoring all processes. We can explicitly lower the score of certain processes. So they might survive, while some other less critical processes are sacrificed. This could buy us more time before things get worse.
Create a flagfile for OOM exclusion: /proc/$pid/oom_score_adj. The higher value you set for oom_score_adj, the more likely the process will be killed first. 
echo -17 > /proc/$pid/oom_score_adj
But remember there is no guarantee that your processes will be safe. The machines may run into very low RAM, and the OS might need to sacrifice more processes, yours included.
How to avoid OOM? Well, you have to make sure processes won’t take all the RAM. Usually this means:
- Reasonable capacity planning. If your cluster needs 100GB RAM in total, but you physically only have 90GB. It simply won’t work.
- Enforce a RAM quota for given services. Let’s say process A can only use 2GB RAM at most, and process B can only use 4GB RAM at most. If you’re using java, Xmx and Xms would be your friends.
Honestly speaking, I think you will have little room to avoid OOM, if there is not enough available RAM in your servers.
How to detect OOM issues? Here is what I do:
- Monitor OS memory usage. OOM only happens when free memory is very low at the OS level. So monitoring this would help us to better stay on top of potential OOM issues. If you don’t have scripts enforcing this, check check_linux_stats.pl
- Monitor memory usage of critical processes. Usually OOM happens only if certain heavy-weight processes keep taking more and more RAM. See how to monitor process memory usage: check_proc_mem.sh.
- Monitor system log to detect OOM incidence. We can keep polling system logs, and get alerted when OOM does happens. Currently I don’t have scripts to check this. If it turns out to be a strong requirement, I’d be happy to implement one and open source it. So leave me comments below. Let me hear your voice!