2.6 Persistent Storage

2.6 Persistent Storage

With the exception of BIOS code and configuration, all data stored in memory is lost when power cycles occur. In order to store data persistently, non-volatile storage medium is required. Specifically, data from a system's main memory is usually stored on some sort of disk when applications are not using it. It is then loaded when the application needs it again.

2.6.1 Local Hard Disks

Most clusters have a hard disk on each node for some storage. This is usually used in addition to a central data storage facility. Hard disks are magnetic storage media that interface with some sort of storage bus. A hard drive will contain several platters. Data is read off of these platters as they rotate. Logic in the drive optimizes read and write requests based on the geometry of the disk to provide better collective performance. This logic also contains memory cache, which is used to prevent the need for multiple reads of the same data.

Disks also have an interface to any of a number of disk buses. The three most common buses currently in use for commodity disks are IDE (or EIDE or ATA), SCSI, and Serial ATA. IDE disks are the most common. Controllers are integrated into nearly every motherboard sold today. These controllers support two devices per bus and typically include two buses, for a total of four devices. The fastest of these buses, UDMA133 (Ultra DMA 133), run at rates up to 133 MB/s. IDE devices are typically implemented with less logic on each drive, leading to higher host CPU utilization during I/O when compared with SCSI.

SCSI disks are typically used in servers. Everything but the bus interface logic is nearly identical in many disks, regardless of disk interface bus. Many vendors sell multiple versions of many of their drives, one for each bus type. That said, the major difference between IDE and SCSI disks is the obvious one: the data bus. SCSI buses support many more devices and run at higher speeds. Current SCSI buses support up to fifteen devices and the controller, which functions as a SCSI device as well. Current-generation SCSI buses operate at rates up to 320 MB/s. This higher data rate is needed because of the larger quantities of devices sharing a single bus. The largest differentiating characteristic between IDE and SCSI disks is the cost at this point; SCSI disks are more expensive.

Serial ATA, or SATA, is the newest commodity disk standard. New, high-end motherboards are beginning to incorporate controllers. Nominally, Serial ATA is similar to IDE/ATA. Those older standards are now referred to collectively as Parallel ATA, or PATA. SATA is poised to take over the market segment of PATA; drives are not quite price competitive at this time, but their prices are close enough that in the next few months, they should drop to PATA levels. Serial ATA, as the name suggests, is a serial bus as opposed to the parallel buses used PATA and SCSI. Hence, the cables attached to drives are smaller and run faster: current SATA connections function at . Because SATA buses are only used by two devices, the aggregate data rate doesn't need to be as high as those on parallel buses to perform comparably. Because of the serial nature of SATA, bus speeds will increase rapidly, when compared with parallel buses like PATA and SCSI. SATA is natively hot-pluggable, and its cables are far smaller than the ribbon cables used by PATA and SCSI. The increased speed of SATA buses doesn't provide a real benefit at this point; most drives don't function at speeds high enough to congest a high-speed PATA controller.

The same basic disk technology is used in disks using any of the three previously mentioned buses. Hence, the basic measures of performance are the same as well. The platters in disks spin at a variety of rates. The faster the platters spin, the faster data can be read off of the disk, and data on the far end of the platters will become available sooner. Rotational speeds range from 5,400 RPM to 15,000 RPM. The faster the platters rotate, the lower latency and higher bandwidth are. The other main indicator of performance of a disk is the amount of cache included in the on-disk controller. As was mentioned previously, this cache is used to avoid disk reads when particular blocks on the disk are requested multiple times.

2.6.2 RAID

RAID, or Redundant Array of Inexpensive Disks, is a mechanism by which the performance and storage properties of individual disks can be aggregated. Aggregation may be done for a variety of reasons. Simplification of disk layout is the most common. Basically, the group of disks appear to be a single larger disk. This approach is commonly used when disks are in use that are not as large as the data that will be stored. Performance is another common reason. Multiple disks will perform better than single disks. The last reason RAID is used is to guard against hardware failure. When multiple disks are used in a RAID set, data can be stored in multiple places. This approach allows the system to continue functioning with no loss of data after disk faults. These solutions can be implemented in software, usually as an operating system driver, or in hardware, typically consisting of disk controllers, a processor that handles RAID functions, and a host connection. Hardware solutions tend to be more expensive but also tend to perform better without impacting host CPU utilization. Software solutions typically allow more flexibility, but the computational overhead of some RAID levels can consume large amounts of computational resources.

A variety of allocation schemes are used in RAID systems. With RAID0, or striping, data is striped across multiple disks. The result of this striping is a logical storage device that has the capacity of each of the disks times the number of disks present in the array. This array performs differently from a single larger disk. Reads are accelerated; each byte of data can be read from multiple locations, so interleaving reads between disks can double read performance. Write performance is similarly accelerated, as actually disk write performance is improved compared with that of a single disk.

With RAID1, or mirroring, complete copies of the data are stored in multiple locations. The capacity of one of these RAID sets will be half of its raw capacity. In this configuration, reads are accelerated in a similar manner to RAID0, but writes are slowed, as new data needs to be transmitted multiple times, to both parts of the mirror.

The third common RAID level is RAID5. It works similarly to RAID0, in that data is spread across multiple disks, with one addition. One disk is used to store parity information. This means for any block of data stored across the N-1 drives in an array, a parity checksum is computed and stored on the last disk. This allows the array to continue functioning in case of drive failure, as the parity checksum can be used in the place of a block off of any one of the data disks. Read performance on RAID5 volumes tend to be quite good, but write performance lags behind mirrors because of the overhead of checksum computation. This overhead can cause performance problems when using software RAID.

RAID is typically used on storage nodes in clusters. The reasons for this are the performance and capacity differences when compared to standalone disks. These disk I/O characteristics are not of prime import on compute nodes, so RAID is not typically configured there.

2.6.3 Nonlocal Storage

Nonlocal storage is used in similar ways to local storage. Data that needs to survive system power cycles is stored there. The physical medium on which data is stored is similar, if not identical, to the hard disk technology described in the preceding sections: the difference lies in the data transport layer. In the case of nonlocal storage, the storage device bus traffic is transmitted across a network to a central depot of storage. This network may or may not be dedicated to storage; standards exist for protocols of both types.

ISCSI is a protocol that encapsulates SCSI commands and data inside IP packets. These are typically transmitted over ethernet. It allows a single network to be used for disk I/O and regular network traffic, however, this can form a serious performance bottleneck. Fiberchannel is similar to ISCSI in character, but uses a dedicated network and data protocol.

Network filesystems are most common in clusters. Examples of this include NFS and PVFS. (PVFS is discussed in detail in Section 19) Network filesystems transmit persistent data across a network, but differ from the previous two storage types in the nature of the data being transmitted. Network filesystems transmit data with filesystem semantics across the network; the previous two protocols transmit block-based data.

Part III: Managing Clusters