I've found bottle neck analysis to be something I can only understand in small bite sized chunks - really it's a continuous process of peeling back layers.

Thanks to yet another conversation with dkg, I've peeled back more layers.

Some general ideas to keep in mind:

The load average as reported by top is essentially the number of processes that are trying to run. As Wikipedia puts it:

An idle computer has a load number of 0 and each process that is using CPU or waiting for CPU adds to the load number by 1.

It's a good indicator of a sluggish machine, but doesn't tell us why processes can't run. Two common reasons could be that the CPU is overtaxed or processes need to write or read from the disk (commonly referred to as disk i/o for disk input/output), but there is so much reading and writing to the disk that they are waiting around for the kernel to give them the info they want or to write the data they want.

So, in comes vmstat (typically run with vmstat 1 to show a report every one second - it does not require root to run). The first two columns are the most important. The first, "r", is "The number of processes waiting for run time." The second is "The number of processes in uninterruptible sleep" (or blocked), which I guess technically doesn't always mean due to i/o, but seems like that's the most likely cause. (The quotes come from the vmstat man page).

If the number in the "r" column is high, but the number in the "b" column is low, then the contention is probably over the CPU. If the number in the "b" column is high, then most likely it's a disk i/o problem.

If you have a large number in the "b" column, you'll want to pay attention to the si and so columns under SWAP. If those numbers are over 0 then the problem is probably swap, meaning that you don't have enough RAM in the machine so the kernel is using the hard drive as a RAM storage space to compensate.

If the si and so numbers are zero, then the problem may be processes that are hard disk intensive.

In this case, running ps and greping for the state "D" (for example: ps aux | grep " D ") will produce a list of the processes in question that are blocked. Running lsof and grepping for those processes could tell you what files are being opened by them which might give some clues as to which process is the culprit. NOTE: if you are experiencing high load running lsof will only contribute to your load problems!

One more point of confusion: You might wonder how there could be, say, 10 processes in the "b" column. But, on the far right, 0% of the CPU is "wa" (meaning waiting, usually on i/o). That can happen if a bunch of processes are waiting on i/o and the kernel says: well, instead of sitting around idle waiting on i/o, I'm going to take care of the processes that do not require i/o until the i/o bottle neck gets cleared up. In that scenario, the cpu will have 0% of it's time waiting even though a number of processes are blocked because of i/o problems.

Hope this is helpful for others. Thanks dkg and man and wikipedia!