Working with RAID
We’ve been working with a giant backup server with 8 disks in a complex series of RAID arrays.
Last week, we had two disks fail from the same 7 Disk RAID 5 array within 24 hours of each other, causing the RAID to fail (and the whole server to stop responding).
When we re-booted the computer, the RAID in question was reported by /proc/mdstat to be inactive, with two disks missing:
md2 : inactive hde hdk3 hdi1 hda1 hdm 1367569600 blocks
It was missing sda3 and hdc1
Those disks had other partitions on the system that were working fine. And, we did a few read tests on the partitions in questions, and they seemed to fine. Hm.
mdadm --examine /dev/sda3
The super block on sda3 reported that it was healthy, but that hdc1 was faulty and removed.
The same test on hdc1 reported that it and all disks were healthy.
So, it appears as though hdc1 went down first, followed by sda3.
We began the recovery with the –re-add command:
mdadm /dev/md2 --re-add /dev/hdc1 mdadm /dev/md2 --re-add /dev/sda3
Then, we tried to bring it up again:
0 root@iz:~# mdadm --assemble /dev/md2 UUID=15a4aefd:d0a95db7:934e8ae1:fce89514 mdadm: device /dev/md2 already active - cannot assemble it 1 root@iz:~#
Woops. Stop the array first:
0 root@iz:~# mdadm --stop /dev/md2 mdadm: stopped /dev/md2 0 root@iz:~#
Then, try again:
root@iz:~# mdadm --assemble /dev/md2 --uuid=15a4aefd:d0a95db7:934e8ae1:fce89514 mdadm: /dev/md2 assembled from 5 drives - not enough to start the array. 1 root@iz:~#
Still not working. Try again with force:
0 root@iz:~# mdadm --assemble /dev/md2 --force \ --uuid=15a4aefd:d0a95db7:934e8ae1:fce89514 mdadm: forcing event count in /dev/sda3(2) from 36300 upto 36308 mdadm: clearing FAULTY flag for device 0 in /dev/md2 for /dev/sda3 mdadm: /dev/md2 has been started with 6 drives (out of 7). 0 root@iz:~#
Bingo. Not sure why it only took one of the disks, but we chose to copy the data off of it in a hurry and worry about that later.