The future of May First/People Link backups

2007-01-08 7-minute read

Hi folks - this is an open call to May First/People Link Members inviting feedback on our backup strategy. All thoughts and comments welcome!

About 2 - 3 weeks ago we changed our backup strategy. We had been running two offsite backup servers in our sunset park office (in addition to one backup server in our hosting facility in Deleware which only provides a redundant back up of the After Downing Street database). We needed a change for two reasons:

Both of the sunset park servers’ hard drives were nearly at capacity
Neither server was running a redundant disk array. That means that a single hard drive failure would require us to re-copying all the backup data - a process that would take either 4 - 5 weeks to do over the Internet, or a physical trip to our rack in Telehouse plus a trip to the 3 - 4 member offices that do a remote backup to our servers.

Our new strategy was to build a single backup server with multiple disks using a RAID array. That means if one disk fails, the system continues running without a problem. This single backup server, named Iz, was brought online with 8 hard disks about 2 - 3 weeks ago.

We immediately ran into overheating problems, which caused the entire server to emit a continuous ear piercing beep from the depths of hell.

We solved this problem by opening the case and fixing a couple broken fans.

This worked for a couple weeks. Then, last friday, one disk failed. Since it is a redundant system, our backups continued running smoothly. On Tuesday, Daniel and I spent the entire day replacing the faulty disk. It should have taken 30 minutes, but we were slowed down by the fact that our RAID setup is very complicated, our bootloader was confused about which of the 8 disks actually contained our kernel, and, most troubling, we discovered hardware failures related to the hard drive controllers - leaving us still unsure if it was actually a disk failure or a controller failure. By the end of the day, Iz was back in place, with the disk replaced (plugged into a different controller). The particular disk that failed was acting as a spare on the giant RAID array, so the replacement disk was nearly immediately incorporated in the RAID array and everything went back to normal.

The very next morning another disk failed. This time the failed disk was not a spare. Iz continued functioning properly (as it should) and additionally it incorporated the spare into the RAID array. Incorporating a new disk into a RAID means “syncing” the data from the functioning RAID array to the newly incorporated disk. This process went on all day Wednesday and through the night, causing several of our normal backups to be delayed since the normal backups and the syncing process were both competing for resources on the server. On Thursday morning (yesterday) I noticed that several of our backup scripts were still running and the new disk was still syncing.

At this point we suspected that heat was the cause of the hardware failures and decided that the best course of action would be to spread out the hard drives inside the box (they were all placed tightly next to each other in two cages during our original installation).

However, in what order do we proceed? Until the syncing completes, one more hard drive failure will cause us to lose all the data on the entire server (yes, it is backup data, but re-copying that data is no small feat). If we let the backups run their course, the syncing will take longer. The longer we wait to spread out the hard drives, the higher the odds that another drive will fail.

In the end, I stopped the still running backups to allow the syncing to occupy all the system resources. By 8:00 pm Thursday night, it still was only 85% done, reporting that it need another 10 hours. Nervous that another hard drive might fail, I shutdown Iz and spread out the hard drives. When I re-booted, the RAID array didn’t come back properly. On Alfredo’s excellent advice, I went home rather than push my luck trying to fix this by myself after a 12 hour work day.

This morning (Friday), with a much clearer head, I discovered the drive that didn’t come up properly, reseated the cables, and now Iz is back and running. It’s still syncing, but it appears that it should complete before the backups run again tonight, meaning we should be back in shape.

For now.

The whole point of this long post is to propose a new way of doing backups. Alfredo was really pushing Daniel and I early this week to re-consider Iz - I wasn’t ready to listen to that before, but now I’m seeing the wisdom in his concerns, so here are some ideas:

New problems we have with our new backup strategy:

Iz has hard ware problems (particularly related to the hard drive controllers). It’s very possible that all hardware problems are related to heat generated by eight hard drives, and figuring out a way to cool down Iz will solve all problems. However, it might be that Iz has faulty hardware, or that the heating we’ve already subjected Iz to has caused permanent damage. In any event, I don’t have a lot of confidence in Iz.
We can withstand one disk failure. That’s a huge improvement. However, two disk failures will cause us to lose all data because we now only have one backup server for all our data. Related: if Iz has a motherboard failure, we will be offline until we can find a replacement server that can handle 8 disks. And - all backup activity will stop until do that.
Our only backup is offsite. That’s a good place for a single backup (much better than onsite!). However - if one of our servers fails at Telehouse - we will have to either copy the replacement data over the Internet (which will take hours if not days), or physically carry Iz to Telehouse and re-copy. This is not such a great strategy.

Proposed new strategy:

We have about 450 GB of backup data currently. About 375 GB is from our Telehouse server and the remaining 75 GB is from member office backups.

We phase out Iz in favor of multiple backup servers with 3 300 GB RAID 5 disk arrays. This will give us 600 GB per backup server with one redundant hard drive. It will spread out our backups over multiple machines.
We place backup servers both at Telehouse and Sunset Park. All of our Telehouse production servers will backup to both the Telehouse and Sunset Park servers. All of our member backups will only go to Sunset Park (at Telehouse we’re charged for bandwidth, so backing up members to Telehouse is not economically feasible).

Proposed Steps:

Install a single new backup server in Telehouse and start backing up our production servers. We’ll be using 375 of the 600 available GB, so we should be in good shape for now.
As a temporary measure, make a single backup of all member backup data currently on Iz to Richie. Richie is one of our old backup servers. If we delete all the May First Telehouse servers data from Richie (which we can safely do once we have have the Telehouse backup server in place), there will be plenty of room on Richie. This step is designed to allow us to recover if we have two hard drive failures on Iz, which would result in losing member backup data.
Install a single backup server in Sunset Park to take over for Iz.

Now, we can retire Richie and Retire Iz and re-use their hard drives for building our more backup servers.