LVM Cache woops
At May First, disk i/o has been our most serious bottle neck for many years. We have plenty of RAM, disk space and even CPU.
But when too much data is being written to our spinning disks everyting grinds to a halt.
As we have been adding SSD disks to our servers, we’ve recently begun experimenting with adding SSD-backed lvm caches. This approach has had a tremendous impact - resolving most of our disk i/o problems.
However, this morning we rebooted one of those virtual guests and I almost had a heart attack:
0 claudette:~# mount /home mount: special device /dev/mapper/vg_claudette0-home does not exist 32 claudette:~# lvs vg_claudette0/home LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert home vg_claudette0 Cwi---C--- 309.00g [home_cachepool] [home_corig] 0 claudette:~# ls /dev/mapper/ control vg_claudette0-swap_1 vg_claudette0-var vg_claudette0-root vg_claudette0-tmp vg_claudette0-var+lib+mysql 0 claudette:~#
Wah! lvm ate our data!
Let’s remove the cache and return to the way it was:
0 claudette:~# lvconvert --uncache vg_claudette0/home /usr/sbin/cache_check: execvp failed: No such file or directory Check of pool vg_claudette0/home_cachepool failed (status:2). Manual repair required! Failed to active cache locally vg_claudette0/home. 5 claudette:~#
Wah! That doesn’t work either!
0 claudette:~# lvconvert --repair vg_claudette0/home_cachepool Using default stripesize 64.00 KiB. Operation not permitted on cache pool LV vg_claudette0/home_cachepool. Operations permitted on a cache pool LV are: --splitcache (operates on cache LV) 5 claudette:~#
What is happening!!
I booted into a live rescue disk with a more modern version of lvm that really should support the
0 debirf-rescue:~# lvconvert --repair vg_claudette0/home_cachepool /dev/vg_claudette0/lvol1: not found: device not cleared Aborting. Failed to wipe start of new LV. WARNING: If everything works, remove vg_claudette0/home_cachepool_meta0 volume. WARNING: Use pvmove command to move vg_claudette0/home_cachepool_cmeta on the best fitting PV. 0 debirf-rescue:~#
Wait… thanks to a cool headed colleauge, it turns out the the only problem
thin-provisioning-tools was not installed on the host.
Feel free to review the whole fiasco as it unfolded.