Identifying resource hogs
Suppose you manage a multi-user server with wily and unpredictable users running compromisable web sites and email accounts with loads of email.
Suppose your server blows up at 3:03 pm.
Suppose you ask: Was there a single compromised user account that caused this mess and if so, which user?
How would you figure it out?
If you are lucky enough to be logged into the server when it is blowing up and lucky enough that the server is responsive enough to run commands for you, then you have some options.
vmstat 1 nicely shows you whether you are cpu bound or disk i/o bound. The
first two columns show number of processes waiting to run due to either not
enough CPU (r) or not enough disk i/o (b).
pidstat -d 1 shows which processes are writing and reading to disk and when
you ctrl-c cancel that command it summarizes what you have been watching
allowing you to more easily pin point who is doing the most writing.
And good ole
top can show you which users are consuming the most CPU.
But, these tools are often misleading. When your system is under heavy contention, all kinds of processes get backed up and these tools often just show a mess of processes desperately trying to run amidst a giant resource shortage. It’s hard to pinpoint the user that may have started the problem.
Also, these tools are useless if you can’t login to the server or if you arrive at the scene after the storm has passed.
munin and sar
Both munin and sar (provided by sysstat) can record a history of usage. And, both can tell you, for example, whether your system was CPU bound or disk i/o bound and exactly when the problem started.
However, I can’t seem to convince either (out of the box at least) how to track such usage on a per user basis.
Gnu Accounting utilities
Now we are getting somewhere. The
acct package is specifically designed to
record usage information on a per user basis.
However, it suffers from a few problems considering our use case:
It has a subtley different goal:
acctwants to account for total resource usage at the end of the day. I want to measure per user resource usage at an exact point and time.
acctpackage works in a elegant fashion. It enables a feature of the kernel that causes the kernel to add data to a file every time a process ends. The data includes the pid, uid, total cpu usage and average memory consumption and the date and time the process began.
This approach means you don’t have to poll running processes and you always get accurate information.
For the purposes of pin-pointing who is consuming resources when, this works great for short running processes.
But for long running processes, if you chart it by the date/time provided (which is the time the process started), you get a giant jump in resource usage when the process starts. If a process runs for 30 minutes and consumes massive resources during the last minute of it’s life, that resource usage will get reported 29 minutes before it happened.
Confusing to munin. The date/time on the data is when the process started, however, it is reported to the kernel file when the process ends. For the purposes of munin graphing, we would have to record it when the process ends or else we would be reporting times in the past.
A much bigger problem is lack of disk i/o. Although the spec seems to include disk i/o, reporting disk i/o does not seem to be available on linux and sadly disk i/o is almost always the caues of our resource problems. The “io” column in dump-acct is always 0.
The information is reported in binary form making the raw file a bit hard to read. And the tools that come with the
acctpackage interpret that file (thanks!) but do so in a way that is hard to parse (in particular, dates are human readable and only include when the process began, not when it started, so you can’t effecitvely limit output by date range).
I’ve spent years writing various polling programs that either use
it’s ilk (or directly access pid statistics via /proc) in cron jobs or
constantly running processes and collect and record it’s output. However, all
these scripts suffer from either being inaccurate because they depend on
polling running processes or overly resource consuming themselves because they
are in a constant loop measuring things.