LVM Cache Surprises

2022-03-03 8-minute read

By far the biggest LVM Cache surprise is just how well it works.

Between 2010 and 2020, my single, biggest and most consistent headache managing servers at May First has been disk i/o. We run a number of physical hosts with encrypted disks, with each providing a dozen or so sundry KVM guests. And they consume a lot of disk i/o.

This problem kept me awake at night and made me want to put my head on the table and cry during the day as I monitored the output of vmstat 1 and watched each disk i/o death spiral unfold.

We tried everything. Turned off fsck’s, turned off RAID monthly checks. Switched to less intensive backup systems. Added solid state drives and tried to stragically distribute them to our database partitions and other read/write heavy services. Added tmpfs file systems where it was possible.

But, the sad truth was: we simply did not have the resources to pay for the infrastructure that could support the disk i/o our services demanded.

Then, we discovered LVM caching (cue Hallelujah). We starting provisioning SSD partitions to back up our busiest spinning disk logical volumes and presto. Ten years of agony gone like a poof of smoke!

I don’t know which individuals are responsible for writing the LVM caching code but if you see this: THANK YOU! Your contributions to the world are noticed, appreciated and have had an enormous impact on at least one individual.

Some surprises

Filters

For the last two years, with the exception of one little heart attack, LVM caches have gone very smoothly.

Then, last week we upgraded 13 physical servers straight through from stretch to bullseye.

It went relatively smoothly for the first half of our servers (the old ones hosting fewer resources). But, after rebooting our first server with lvm caching going on, we noticed that the cached disk wasn’t accessible.

No problem, we reasoned. We’ll just uncache it. Except that didn’t work either. We tried every argument we could find on the Internet but lvm insisted that the block device from the SSD volume group (that provides the caching device) was not available. Running pvs showed an “unknown” device and vgs reported similar errors. Now I started to panic a bit. There was a clean shutdown of the server, so surely all the data had been flushed to the disk. But, how can we get that data? We started a restore from backup process because we really thought that data was gone for ever.

Then we had a really great theory: the caching logical volume comes from the SSD volume group, which gets decrypted after the spinning disk volume group.

Maybe there’s a timing issue? When the spinning disk volume group comes online, the caching logical volume is not yet available.

So, we booted into busybox, and manually decrypted the SSD volume first, followed by the spinning disk volume. Alas, no dice.

Now that we were fully desperate, we decided to restore the lvm configuration file for the entire spinning disk volume group. This felt kinda risky since we might be damaging all the currently working logical volumes, but it seemed like the only option we had.

The main problem was that busybox didn’t seem to have the lvm config tool we needed to restore the configuration from our backup (I think it might be there but it was late and we couldn’t figure it out). And, our only readily available live install media was a Debian stretch disk via debirf.

Debian stretch is pretty old and we really would have preferred to have the most modern tools available, but we decided to go with what we had.

And, that was a good thing, because as soon as we booted into stretch and decrypted the disks, the lvm volume suddenly appeared, happy as ever. We uncached it and booted into the host system and there it was.

We went to bed confused but relieved.

The next morning my co-worker figured it out: filtering.

During the stretch days we occassionally ran into an annoying problem: the logical volumes from guests would suddenly pop up on the host. This was mostly annoying but also it made possible some serious mistakes if you accidentally took a volume from a guest and used it on the host.

The LVM folks seemed to have noticed this problem and introduced a new default filter that tries to only show you the devices that you should be seeing.

Unfortunately for us, this new filter removed logical volumes from the list of available physical volumes. That does make sense for most people. But, not for us. It sounds a bit weird, but our setup looks like this:

  • One volume group derived from the spinning disks
  • One volume group derived from the SSD disks

Then we carve out logical volumes from each for each guest.

Once we discovered LVM caching, we carved out SSD logical volumes to be used as caches for the spinning logical volumes.

In restrospect, if we could start over, we would probably do it differently.

In any event, once we discovered the problem, we used the handy configuration options in lvm.conf to tweak the filters to include our cache disks and once again, everything is back to working.

Saturated SSDs

The other surprise seems unrelated to the upgrade. We have a phsyical server that has been suffering from disk i/o problems despite our use of LVM caching.

Our answer, of course, was to add more LVM caches to the spinning logical volumes that seemed to be suffering.

But somehow this was making things even worse.

Then, we finally just removed the LVM caches from all the spinning disks and presto, disk i/o problems seemed to go away. What? Isn’t that the opposite of what’s supposed to happen?

We’re still trying to figure this one out, but it seems that our SSDs are saturated, in which case adding them as a caching volume really is going to make things worse.

We’re still not sure why they are saturated when none of the SSDs on our other hosts are saturated, but a few theories include:

  • They are doing more writing and/or it’s a different kind of writing. I’m still not sure I quite have the right tool to compare this host with other hosts. And, this host is our only MySQL network database server, hosting hundreds of GBs of database - all writing/reading direclty onto the SSDs.

  • They are broken or substanard SSDs (smartctl doesn’t uncover any problems but maybe it’s a bad model?)

I’ll update this post as we learn more but welcome any suggestions in the comments.

Update: 2022-03-07

Two more possible causes:

  • Our use of the write back feature: LVM cache has a nice feature that caches writes to smooth out writes to the underlying disk. Maybe our disks are simply writing more then can be handled and not using write back is our solution. This server supports a guest with an unusually large disk.

  • Maybe we haven’t allocated a big enough LVM cache for the given volume so the contents are constantly being ejected?

Update: 2022-06-20

We figured it out shortly after my last update, but I never came back to update this post. After turning off lvm caching entirely, everything went back to normal during regular working hours (phew!).

But… the backup took close to 9 hours to complete. Comparable servers take a couple hours to backup. So, we started a process of backing up only half the top level directories. If the backup went through normally, we added back in half the directories we previously omitted.

Over the course of about 7 days we narrowed it down to just one top level directory and, after perusing that directory I found a 20GB file. That’s not terribly unusual and certainly should not be causing this level of crisis (we use an incremental backup system so only the parts of that file that have changed should get backed up).

But… this was no ordinary 20GB file. It was the file written to by a WordPress site with debug enabled. And, most importantly, it had been writing debug errors to this file for close to five years. That means two important things for us:

  • It most likely was fragmented all over the disk
  • It changed every day

From the backup perspective, it means every time the backup ran, the entire file had to be read to see what changed, which required pulling 20GB in the most inefficient way possible causing a major spike in disk i/o.

From the LVM cache perspective, it means that every time this file was written to (i.e. every time this lousy WordPress site logged a debug message), the file had to be read from the spinning disk into the ssd cache disk. I imagine that lvm cache doesn’t have a “only load what’s changed” feature and instead simply re-reads the entire file everytime it changes.

Mystery solved. LVM cache still rocks. And now we have a new DDOS vector for very patient people.