LVM Cache woops
At May First, disk i/o has been our most serious bottle neck for many years. We have plenty of RAM, disk space and even CPU.
But when too much data is being written to our spinning disks everyting grinds to a halt.
As we have been adding SSD disks to our servers, we’ve recently begun experimenting with adding SSD-backed lvm caches. This approach has had a tremendous impact - resolving most of our disk i/o problems.
However, this morning we rebooted one of those virtual guests and I almost had a heart attack:
0 claudette:~# mount /home
mount: special device /dev/mapper/vg_claudette0-home does not exist
32 claudette:~# lvs vg_claudette0/home
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
home vg_claudette0 Cwi---C--- 309.00g [home_cachepool] [home_corig]
0 claudette:~# ls /dev/mapper/
control vg_claudette0-swap_1 vg_claudette0-var
vg_claudette0-root vg_claudette0-tmp vg_claudette0-var+lib+mysql
0 claudette:~#
Wah! lvm ate our data!
Let’s remove the cache and return to the way it was:
0 claudette:~# lvconvert --uncache vg_claudette0/home
/usr/sbin/cache_check: execvp failed: No such file or directory
Check of pool vg_claudette0/home_cachepool failed (status:2). Manual repair required!
Failed to active cache locally vg_claudette0/home.
5 claudette:~#
Wah! That doesn’t work either!
Let’s repair:
0 claudette:~# lvconvert --repair vg_claudette0/home_cachepool
Using default stripesize 64.00 KiB.
Operation not permitted on cache pool LV vg_claudette0/home_cachepool.
Operations permitted on a cache pool LV are:
--splitcache (operates on cache LV)
5 claudette:~#
What is happening!!
I booted into a live rescue disk with a more modern version of lvm that really should support the --repair
option:
0 debirf-rescue:~# lvconvert --repair vg_claudette0/home_cachepool
/dev/vg_claudette0/lvol1: not found: device not cleared
Aborting. Failed to wipe start of new LV.
WARNING: If everything works, remove vg_claudette0/home_cachepool_meta0 volume.
WARNING: Use pvmove command to move vg_claudette0/home_cachepool_cmeta on the best fitting PV.
0 debirf-rescue:~#
Help! Help!
Wait… thanks to a cool headed colleauge, it turns out the the only problem
was that thin-provisioning-tools
was not installed on the host.
Feel free to review the whole fiasco as it unfolded.