Hi folks - this is an open call to May First/People Link Members inviting feedback on our backup strategy. All thoughts and comments welcome!

About 2 - 3 weeks ago we changed our backup strategy. We had been running two offsite backup servers in our sunset park office (in addition to one backup server in our hosting facility in Deleware which only provides a redundant back up of the After Downing Street database). We needed a change for two reasons:

Our new strategy was to build a single backup server with multiple disks using a RAID array. That means if one disk fails, the system continues running without a problem. This single backup server, named Iz, was brought online with 8 hard disks about 2 - 3 weeks ago.

We immediately ran into overheating problems, which caused the entire server to emit a continuous ear piercing beep from the depths of hell.

We solved this problem by opening the case and fixing a couple broken fans.

This worked for a couple weeks. Then, last friday, one disk failed. Since it is a redundant system, our backups continued running smoothly. On Tuesday, Daniel and I spent the entire day replacing the faulty disk. It should have taken 30 minutes, but we were slowed down by the fact that our RAID setup is very complicated, our bootloader was confused about which of the 8 disks actually contained our kernel, and, most troubling, we discovered hardware failures related to the hard drive controllers - leaving us still unsure if it was actually a disk failure or a controller failure. By the end of the day, Iz was back in place, with the disk replaced (plugged into a different controller). The particular disk that failed was acting as a spare on the giant RAID array, so the replacement disk was nearly immediately incorporated in the RAID array and everything went back to normal.

The very next morning another disk failed. This time the failed disk was not a spare. Iz continued functioning properly (as it should) and additionally it incorporated the spare into the RAID array. Incorporating a new disk into a RAID means "syncing" the data from the functioning RAID array to the newly incorporated disk. This process went on all day Wednesday and through the night, causing several of our normal backups to be delayed since the normal backups and the syncing process were both competing for resources on the server. On Thursday morning (yesterday) I noticed that several of our backup scripts were still running and the new disk was still syncing.

At this point we suspected that heat was the cause of the hardware failures and decided that the best course of action would be to spread out the hard drives inside the box (they were all placed tightly next to each other in two cages during our original installation).

However, in what order do we proceed? Until the syncing completes, one more hard drive failure will cause us to lose all the data on the entire server (yes, it is backup data, but re-copying that data is no small feat). If we let the backups run their course, the syncing will take longer. The longer we wait to spread out the hard drives, the higher the odds that another drive will fail.

In the end, I stopped the still running backups to allow the syncing to occupy all the system resources. By 8:00 pm Thursday night, it still was only 85% done, reporting that it need another 10 hours. Nervous that another hard drive might fail, I shutdown Iz and spread out the hard drives. When I re-booted, the RAID array didn't come back properly. On Alfredo's excellent advice, I went home rather than push my luck trying to fix this by myself after a 12 hour work day.

This morning (Friday), with a much clearer head, I discovered the drive that didn't come up properly, reseated the cables, and now Iz is back and running. It's still syncing, but it appears that it should complete before the backups run again tonight, meaning we should be back in shape.

For now.

The whole point of this long post is to propose a new way of doing backups. Alfredo was really pushing Daniel and I early this week to re-consider Iz - I wasn't ready to listen to that before, but now I'm seeing the wisdom in his concerns, so here are some ideas:

New problems we have with our new backup strategy:

Proposed new strategy:

We have about 450 GB of backup data currently. About 375 GB is from our Telehouse server and the remaining 75 GB is from member office backups.

Proposed Steps:

Now, we can retire Richie and Retire Iz and re-use their hard drives for building our more backup servers.