What you will learn from this tip: The importance of replacing a failed RAID drive immediately and the possible consequences of not doing so in a timely manner.
As the name implies, RAID has built-in redundancy to protect data from drive failures. However, to maintain that protection, it is vital to replace any failed or failing drives in the array as quickly as possible.
There is one particularly important reason to replace a drive quickly: You run the risk of suffering from another drive failure due to lack of protection. While all RAID levels can withstand a single drive failure, they can not, in general, handle a second failed drive.
There is a common assumption that drive failures are completely independent, but this isn't entirely true. First, the drives in the array are identical, usually down to the same manufacturing lot; any defect in one is likely to be present in all the drives in the array. Defects aside, it's common to find a high number of electronic components that experience relatively high failure rates at the beginning and end of their lives. In other words, if one drive in an array has failed, the likelihood of another failure increases significantly.
Second, in RAID levels that use parity, such as RAID-5 and RAID-3, performance suffers when a drive fails. In order to continue functioning, the array reconstructs the data on the failed drive from parity data stored elsewhere in the array. This takes time and results in significantly slower performance on writes.
To reduce vulnerability, many manufacturers offer arrays with hot spares. When a drive fails, the array automatically begins to rebuild itself using the already installed spare drive. Nearly all RAID vendors, including EMC Corp. and Promise Technology Inc., offer arrays with hot spares. Whether your arrays use hot spares or not, it's a good practice to keep spare drives of the proper characteristics (capacity, rotation speed, etc.) on hand so you can quickly replace any drive that fails. If the data stored on the array is critical, but not critical enough to warrant hot spares, the array should allow for hot-swapping drives -- swapping without having to shut down the array.
Since hot spares generally aren't used until a drive fails, it is a good idea to test them regularly to make sure they're fully functional. Most vendors include a test function on the controller, and you should use it regularly.
Keep in mind that swapping the drive is only part of the recovery process. The array still has to rebuild itself by putting the data from the failed drive onto the new drive. With mirrored arrays (RAID-1) this is a quick process. Parity-based RAID levels take longer, sometimes much longer, to restore. How long depends on the size of the drive, the amount of data to be restored, the characteristics of the array and controller and whether your system has kept a log of changed blocks or files. It isn't uncommon for the restoration process to take hours and, in extreme cases, days.
For more information:
About the author: Rick Cook has been writing about mass storage since the days when the term meant an 80 K floppy disk. The computers he learned on used ferrite cores and magnetic drums. For the last 20 years, he has been a freelance writer specializing in storage and other computer issues.