Self-healing storage explained

Today's IT environment requires a highly reliable foundation of storage where most failures are prevented, and if not prevented, then resolved in place. A self-healing array that eliminates up to 70 percent of drive failures and can then cover the remaining legitimate failures will minimize overall data vulnerability.

Ever wonder what the number one cause of drive failures is? Answer: Nothing.

That's right. Nothing. Drive manufacturers report that some 70% of the drives they get returned from makers of disk arrays have nothing wrong with them. Why? Because heat and vibration can cause intermittent errors in storage arrays, and the only remedy that array manufacturers have for these intermittent errors is to fail the drive.

Why does this matter to you, a storage administrator? The drive is under warranty, so why should you care? Here are three reasons why you should.

  1. A drive failure costs you time: packing the drive, arranging for the manufacturer to pick it up, buying more hot spares to replace the ones just consumed, installing them, labeling them, and so on. If there's one thing a busy storage professional does not need, it's more paperwork.
  2. You suffer the performance degradation and time penalty of doing a RAID rebuild (up to tens of hours for deep drives). During this RAID rebuild, you run the risk of a second drive failure (probably also an erroneous false positive) and, as a result, could suffer a complete data loss and a roll-tape event (a system failure where you have to redcover all the data from tape).
  3. Although no one will ever admit that they have done this, you risk the possibility of physically removing the wrong drive (which can also lead to complete data loss and another roll-tape event).

The fundamental role in any storage system -- reading and writing of data -- is performed by disk drives. Any interruption of this basic task has a ripple effect across all aspects of storage management by reducing performance, requiring human intervention and increasing the risk of service outage or data loss.

A storage system that itself could automatically resolve erroneous disk drive failures would save everyone time and money, and eliminate the introduction of unneeded risk into the storage environment. What can be done to make a system self-healing?

Modifying the drive enclosure
The best way to heal a drive is to avoid failure conditions in the first place. Before dealing with any of the software issues, it makes sense to modify the physical enclosure to make sure you are eliminating potential causes of failure. Today's systems can place 12 or more drives (all powered on) in an off-the-shelf 3U chassis. This creates heat and vibration. Reducing heat and vibration are the two biggest steps a supplier can take to improve drive reliability.

Excessive drive vibration is caused by the way today's external arrays are put together. Drives are tightly packed into a single drive bay, then mounted on drive sleds for easy access and removal. This means the drives are all mounted, the disks are all spinning, and the heads are all seeking in the same direction. But all this results in excessive harmonic vibrations, which lead to enough read/write errors to presume a drive failure. These "failed" drives often end up working properly once they are sent back to the drive manufacturer.

Vibration can cause the drive that is vibrating too much to fail. It can also cause neighboring drives to skip on reads or writes, hence the external controller will designate them as failed. This second issue is of real concern because it can cause a double drive failure by first failing a drive in an adjacent slot and then failing itself. Double drive failure on a RAID 5 system requires that data be restored from another source, such as tape. No rebuild is possible at this point.

Drive makers can minimize vibration by rigidly packing the components so there is less movement from the spinning drives as well as designing the individual drive bay or housing so that it has the same rigidity throughout. Often in hot swap systems, the drive bay is looser in the front than the back, which amplifies vibrations for the front half of the drives.

The only way manufacturers can significantly reduce drive vibration is to redesign the way their drive shelves are packed. There are two ways they can do this. First, the drives must counter-rotate (meaning they must be installed front to back), alternating throughout the array shelf. Doing so naturally dampens vibration and reduces or eliminates enclosure torque. Two companies that counter-mount their drives are Xiotech and Copan Systems.

The second step is to build a better drive shelf and drive sled system that provides more consistent rigidity so the drives cannot vibrate. The combination of these two techniques can reduce vibration significantly.

Minimize heat buildup
The second method of precluding a drive failure is to minimize heat buildup. Manufacturers can do this by increasing and improving airflow in the drive enclosure. When you see how tightly packed most drive enclosures are, you may wonder how they get any airflow across the drive surfaces. One solution is to stop putting all the drives side by side in the front of a drive bay. Staggering the placement deeper into the drive bay increases airspace between drives, thereby improving airflow and reducing vibration.

A storage system that itself could automatically resolve erroneous disk drive failures would save everyone time and money, and eliminate the introduction of unneeded risk into the storage environment.

While the physical redesign of the array hardware layout can significantly reduce the amount of failures upfront, other drive failures can be addressed by increasing the intelligence of the array system so it has the ability to heal itself.

The easiest step in creating a self-healing array is to power-cycle the drive (akin to rebooting a desktop workstation), which usually fixes the problem. In the case of a self-healing drive system, the first attempt to repair a drive that is showing signs of failure is to automatically reset or power-cycle the drive in a manner that has little or no impact on normal operations. The key is to have the whole process performed within the application time-out thresholds, using cache to manage I/Os during the recovery. Once the drive comes back on, it is tested to see if it is operating normally. If so, it is returned to service. This can all be made to happen without user intervention.

Most of the time a simple reset or power cycle will fix the problem. While most array systems and controllers cannot do this, companies like Xiotech are leading the charge.

Process of remanufacturing
If the drive reset/power cycle does not clear the problem, a self-healing system should have the ability to go through a complete remanufacturing process. This includes recalibrating the heads, performing a low-level format and rewriting servo (control) tracks. In most cases, the steps of power-cycling the drive and performing the remanufacturing process will bring the drive back online, saving the storage administrator significant time and expense.

A drive enclosure that reduces heat and vibration, combined with drive remanufacturing capabilities, should eliminate most drive failures. But drive failure can still occur in even the most drive-friendly environments. If a drive does eventually fail, the next logical step is to fail smart. The three aspects of failing smart include:

  1. Recovering data at a granular level, such as failing a surface instead of a whole drive if a head fails. This minimizes the amount of data that has to be copied out or rebuilt to reduce the time it takes to recover.
  2. Putting the intelligence for drive and RAID management into the drive enclosure. Rebuilds are highly processor-intensive; putting the horsepower to manage the rebuild at the drive enclosure level distributes the load for the RAID rebuild process and allows I/O destined for other enclosures to proceed unimpaired. This also ensures that unnecessary delay is not placed onto other production workloads.
  3. Improving spare-in-place technology. Having drives that sit idle and are always powered-on wastes capacity and energy. With today's technology, there is no need for a drive to be a hot spare. Spare capacity should be spread across all available drives in the array, allowing for maximum use of that capacity (i.e., use all the drives to handle the workload, not just the non-spare drives), while minimizing further power consumption.

Today's IT environments require a highly reliable foundation of storage where most failures are prevented. . .and if not prevented, then resolved in place. A self-healing array that eliminates up to 70 percent of drive failures and can then cover the remaining legitimate failures will increase system administrator productivity and minimize overall data vulnerability.

About the author: George Crump is founder of Storage Switzerland, an analyst firm focused on the virtualization and storage marketplaces. It provides strategic consulting and analysis to storage users, suppliers, and integrators. An industry veteran of more than 25 years, Crump has held engineering and executive management positions at various IT industry manufacturers and integrators. Prior to Storage Switzerland, he was CTO at one of the nation's largest integrators.

Dig Deeper on Primary storage devices