Two Terabyte disks: prepare for major changes

2010-01-08 5-minute read

When we bought our first 2 TiB disk we had no idea what was in store.

Over the last several months we have been tearing our hair out over painfully slow performance caused by I/O bottlenecks compounded by our desperate attempts to remedy the situation by moving terabytes of data from slow performing disks to better performing disks.

Over time, we’ve started to piece together the underlying issues and figure out a strategy for properly using these disks - causing yet more slow downs during new server installations.

Below is an attempt to explain at a high level why previously routine tasks, such as installing Debian on new machines, or replacing dead hard disks has become significantly more complicated and taken exponentially longer than before.

Partition tables

Up to now, most of us have created partition tables (the information stored on a disk that instructs an OS or bios what partitions exists, where to find them, etc.) in the Master Boot Record or MBR. That’s the first 512-byte sector of the disk. When you run the debian installer or just about any disk partition utility, it stores the partition info in the MBR.

The MBR approach, however, has a limitation: it can’t handle disk partitions larger than 2 TiB (well… technically that means you could still use the MBR with a 2 TiB disk because no single partition would be larger than 2 TiB… however, given the rate of growth in disk size, it seems like now is the time to tackle this problem).

Fortunately, a new partition table layout has been created: GPT or GUID Partition Tables.

The bad news is that not all of our favorite tools can handle GPT partitioned disks, and some will fail spectacularly.

Grub

Although legacy Grub (Grub 1) supposedly supports GPT tables, when making this change, we’ve opted to switch to Grub 2, which their web site proudly proclaims has been re-written from scratch. Ug. Although legacy Grub users will recognize a few bits and pieces from Grub 1, it’s a steep learning curve.

When combined with GPT, that learning curve includes a significant departure from how we previously installed Grub on a disk. If you want to use Grub 2 with GPT, you need to create a small partition for Grub on the disk (in addition to your regular /boot partition) and add a flag on that partition called bios_grub. Note: Grub 2 only needs this partition if you are using a GPT-partitioned disk.

Then, when you run install-grub, it will be installed into that partition in a way that will properly boot your operating system.

Partition alignment

If all of this wasn’t enough…

With the introduction of 2 TiB disks, disk manufacturers are beginning to change the way they are writing data.

Previously, disks wrote data in 512-byte sectors. As a result, all disk utilities of the recent past have religiously created partitions and all other forms of dividing up a disk on 512-byte boundaries.

Some manufacturers of 2 TiB disks, however, are writing data in 4096-byte boundaries. That means if you create a partition that overlaps a 4096-byte boundary you are essentially screwed.

Consider a disk in which the following pipes represent 512-byte boundaries and [Pn ] represent partitions properly aligned along those boundaries:

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
[P1     ][P2        ][P3                        ]

All partitions neatly start at the beginning of a 512-byte block. Every time the disk wants to write, it can easily fit the data into the sectors.

A 2 TiB disk that uses 4096-byte sectors, however, needs to be divided along 4096-byte boundaries, displayed below with the middle row of pipes. As you can see, your beautifully aligned partitions are now a mis-aligned mess:

|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 
|                       |                       |
[P1     ][P2        ][P3                        ]

What you can do

We haven’t yet fully tested mdadm, cryptsetup and lvm to ensure that they create data on 4096-byte boundaries. Initial poking around suggests that they do - but more work is needed to be certain.

The version parted we’re using (squeeze), on the other hand, will not attempt to align your partitions on 4096-byte boundaries for you. You need to do that yourself by specifying the exact, properly aligned boundaries.

We have a write up with the new steps for creating a Debian server using 2 TiB disks.

The summary is: when partitioning your disk using parted:

  • switch the unit to sectors (unit s)
  • ensure that the starting sector is divisible by both 8 and 512
  • ensure that the ending sector + 1 is divisible by both 8 and 512 (so that the next sector start point is properly aligned)
  • ensure that the size is divisible by both 8 and 512.

For the math challenged, here’s a functional layout of a GPT partitioned disk:

~ # parted /dev/sda unit s p
Model: ATA WDC WD20EADS-00R (scsi)
Disk /dev/sda: 3907029168s
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number  Start     End          Size         File system  Name      Flags    
 1      2048s     4095s        2048s                     biosboot  bios_grub
 2      4096s     1052671s     1048576s                  boot      raid     
 3      1052672s  3905974271s  3904921600s               pv        raid  

You can get there with these commands:

parted /dev/sda mklabel gpt

parted /dev/sda unit s mkpart biosboot 2048 4095 
parted /dev/sda set 1 bios_grub on 

parted /dev/sda unit s mkpart boot 4096 1052671 
parted /dev/sda set 2 raid on 

parted /dev/sda unit s mkpart pv 1052672 3905974271
parted /dev/sda set 3 raid on 

I’m not sure what the disk size limitations of GPT are… but I hope we don’t reach them any time soon.