Who Ate My RAM?
One of our newest servers, with a hefty 256GB of RAM, recently began killing processes via the oomkiller.
According to free, only half of the RAM was in use (125GB). About 4GB was
free, with the remainer used by the file cache.
I’m used to seeing unexpected “free RAM” numbers like this and have been assured that the kernel is simply not wasting RAM. If it’s not needed, use it to cache files to save on disk I/O. That make sense.
However… why is the oomkiller being called instead of flushing the file cache?
I came up with all kinds of amazing and wrong theories: maybe the RAM is fragmented (is that even a thing?!?), maybe there is a spike in RAM and the kernel can’t flush the cache quickly enough (I really don’t think that’s a thing). Maybe our kvm-manager has a weird bug (nope, but that didn’t stop me from opening a spurious bug report).
I learned lots of cool things, like the oomkiller report includes a
table of the memory in use by each process (via the rss column) - and you
have to muliply that number by 4096 because it’s in 4K pages.
That’s how I discovered that the oomkiller was killing off processes with only half the memory in use.
I also learned that lsof sometimes lists the same open file multiple times,
which made me think a bunch of files were being opened repeatedly causing a
memory problem, but really it amounted to nothing.
That last thing I learned, courtesy of an askubuntu
post is that
the /dev filesystem is allocated by default exactly half the RAM on the
system. What a coincidence! That is exactly how much RAM is useable on the
server.
And, on the server in question, that filesystem is full. What?!? Normally, that filesystem should be using 0 bytes because it’s not a real filesystem. But in our case a process created a 127GB file there - it was only stopped because the file system filled up.