Home > Self-healing storage explained
Self-healing storage explained:
EMAIL THIS

Self-healing storage explained

19 Aug 2008 | George Crump, Contributor

News and trends in the storage industry
Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google

Ever wonder what the number one cause of drive failures is? Answer: Nothing.

That's right. Nothing. Drive manufacturers report that some 70% of the drives they get returned from makers of disk arrays have nothing wrong with them. Why? Because heat and vibration can cause intermittent errors in storage arrays, and the only remedy that array manufacturers have for these intermittent errors is to fail the drive.

Why does this matter to you, a storage administrator? The drive is under warranty, so why should you care? Here are three reasons why you should.

  1. A drive failure costs you time: packing the drive, arranging for the manufacturer to pick it up, buying more hot spares to replace the ones just consumed, installing them, labeling them, and so on. If there's one thing a busy storage professional does not need, it's more paperwork.
  2. You suffer the performance degradation and time penalty of doing a RAID rebuild (up to tens of hours for deep drives). During this RAID rebuild, you run the risk of a second drive failure (probably also an erroneous false positive) and, as a result, could suffer a complete data loss and a roll-tape event (a system failure where you have to redcover all the data from tape).
  3. Although no one will ever admit that they have done this, you risk the possibility of physically removing the wrong drive (which can also lead to complete data loss and another roll-tape event).
The fundamental role in any storage system -- reading and writing of data -- is performed by disk drives. Any interruption of this basic task has a ripple effect across all aspects of storage management by reducing performance, requiring human intervention and increasing the risk of service outage or data loss.

A storage system that itself could automatically resolve erroneous disk drive failures would save everyone time and money, and eliminate the introduction of unneeded risk into the storage environment. What can be done to make a system self-healing?

Modifying the drive enclosure
The best way to heal a drive is to avoid failure conditions in the first place. Before dealing with any of the software issues, it makes sense to modify the physical enclosure to make sure you are eliminating potential causes of failure. Today's systems can place 12 or more drives (all powered on) in an off-the-shelf 3U chassis. This creates heat and vibration. Reducing heat and vibration are the two biggest steps a supplier can take to improve drive reliability.

Excessive drive vibration is caused by the way today's external arrays are put together. Drives are tightly packed into a single drive bay, then mounted on drive sleds for easy access and removal. This means the drives are all mounted, the disks are all spinning, and the heads are all seeking in the same direction. But all this results in excessive harmonic vibrations, which lead to enough read/write errors to presume a drive failure. These "failed" drives often end up working properly once they are sent back to the drive manufacturer.

Vibration can cause the drive that is vibrating too much to fail. It can also cause neighboring drives to skip on reads or writes, hence the external controller will designate them as failed. This second issue is of real concern because it can cause a double drive failure by first failing a drive in an adjacent slot and then failing itself. Double drive failure on a RAID 5 system requires that data be restored from another source, such as tape. No rebuild is possible at this point.

Drive makers can minimize vibration by rigidly packing the components so there is less movement from the spinning drives as well as designing the individual drive bay or housing so that it has the same rigidity throughout. Often in hot swap systems, the drive bay is looser in the front than the back, which amplifies vibrations for the front half of the drives.

The only way manufacturers can significantly reduce drive vibration is to redesign the way their drive shelves are packed. There are two ways they can do this. First, the drives must counter-rotate (meaning they must be installed front to back), alternating throughout the array shelf. Doing so naturally dampens vibration and reduces or eliminates enclosure torque. Two companies that counter-mount their drives are Xiotech and Copan Systems.

The second step is to build a better drive shelf and drive sled system that provides more consistent rigidity so the drives cannot vibrate. The combination of these two techniques can reduce vibration significantly.

Minimize heat buildup
The second method of precluding a drive failure is to minimize heat buildup. Manufacturers can do this by increasing and improving airflow in the drive enclosure. When you see how tightly packed most drive enclosures are, you may wonder how they get any airflow across the drive surfaces. One solution is to stop putting all the drives side by side in the front of a drive bay. Staggering the placement deeper into the drive bay increases airspace between drives, thereby improving airflow and reducing vibration.

A storage system that itself could automatically resolve erroneous disk drive failures would save everyone time and money, and eliminate the introduction of unneeded risk into the storage environment.
While the physical redesign of the array hardware layout can significantly reduce the amount of failures upfront, other drive failures can be addressed by increasing the intelligence of the array system so it has the ability to heal itself.

The easiest step in creating a self-healing array is to power-cycle the drive (akin to rebooting a desktop workstation), which usually fixes the problem. In the case of a self-healing drive system, the first attempt to repair a drive that is showing signs of failure is to automatically reset or power-cycle the drive in a manner that has little or no impact on normal operations. The key is to have the whole process performed within the application time-out thresholds, using cache to manage I/Os during the recovery. Once the drive comes back on, it is tested to see if it is operating normally. If so, it is returned to service. This can all be made to happen without user intervention.

Most of the time a simple reset or power cycle will fix the problem. While most array systems and controllers cannot do this, companies like Xiotech are leading the charge.

Process of remanufacturing
If the drive reset/power cycle does not clear the problem, a self-healing system should have the ability to go through a complete remanufacturing process. This includes recalibrating the heads, performing a low-level format and rewriting servo (control) tracks. In most cases, the steps of power-cycling the drive and performing the remanufacturing process will bring the drive back online, saving the storage administrator significant time and expense.

A drive enclosure that reduces heat and vibration, combined with drive remanufacturing capabilities, should eliminate most drive failures. But drive failure can still occur in even the most drive-friendly environments. If a drive does eventually fail, the next logical step is to fail smart. The three aspects of failing smart include:

  1. Recovering data at a granular level, such as failing a surface instead of a whole drive if a head fails. This minimizes the amount of data that has to be copied out or rebuilt to reduce the time it takes to recover.
  2. Putting the intelligence for drive and RAID management into the drive enclosure. Rebuilds are highly processor-intensive; putting the horsepower to manage the rebuild at the drive enclosure level distributes the load for the RAID rebuild process and allows I/O destined for other enclosures to proceed unimpaired. This also ensures that unnecessary delay is not placed onto other production workloads.
  3. Improving spare-in-place technology. Having drives that sit idle and are always powered-on wastes capacity and energy. With today's technology, there is no need for a drive to be a hot spare. Spare capacity should be spread across all available drives in the array, allowing for maximum use of that capacity (i.e., use all the drives to handle the workload, not just the non-spare drives), while minimizing further power consumption.
Today's IT environments require a highly reliable foundation of storage where most failures are prevented. . .and if not prevented, then resolved in place. A self-healing array that eliminates up to 70 percent of drive failures and can then cover the remaining legitimate failures will increase system administrator productivity and minimize overall data vulnerability.

About the author: George Crump is founder of Storage Switzerland, an analyst firm focused on the virtualization and storage marketplaces. It provides strategic consulting and analysis to storage users, suppliers, and integrators. An industry veteran of more than 25 years, Crump has held engineering and executive management positions at various IT industry manufacturers and integrators. Prior to Storage Switzerland, he was CTO at one of the nation's largest integrators.

BROWSE BY TAG
Primary Storage or Storage Hardware,   Disk arrays,   Disk drives,   Resource Library,   Storage explained,   VIEW ALL TAGS

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us    Add to Google



RELATED CONTENT
Disk arrays
Use MAID, intelligent power management as green storage options to control energy consumption
Pillar rolls out speedier Axiom 600 disk array
Disks and disk subsystems finalists: 2009 Products of the Year
Compellent Storage Center 5 adds portable volume option for replication
EMC upgrades Symmetrix V-Max arrays, thin provisioning
Primary storage data reduction advancing via data deduplication, compression
NetApp: Post-process deduplication limits performance hit in primary storage data deduplication
EMC Celerra: Primary storage data reduction through deduplication, compression
Ocarina ECOsystem deconstructs before compression, deduplication for primary storage data reduction
Primary storage data reduction: Data deduplication and compression tools
Disk arrays Research

Disk drives
Use MAID, intelligent power management as green storage options to control energy consumption
SAS drives showing up more and more
Disks and disk subsystems finalists: 2009 Products of the Year
SAS challenges Fibre Channel drives
Primary storage data reduction advancing via data deduplication, compression
NetApp: Post-process deduplication limits performance hit in primary storage data deduplication
EMC Celerra: Primary storage data reduction through deduplication, compression
Storwize claims good data compression rates, no performance degradation on STN-6000 appliance
Primary storage data reduction: Data deduplication and compression tools
Gartner analyst on data deduplication for primary storage
Disk drives Research

Storage explained
Data storage management in virtual server environments
Data storage and wide-area networks in 2009
Tiered storage tutorial
Buying storage capacity in 2009
The evolution of RAID data protection
Applying ITIL best practices to storage explained
Content-addressed storage (CAS) explained
NAS virtualization explained
How to create Tier 0 storage by leveraging solid-state drive technology
Multiprotocol arrays: A look at multiprotocol array technology

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
array  (SearchStorage.com)
array-based memory  (SearchStorage.com)
byte  (SearchStorage.com)
column address strobe  (SearchStorage.com)
Fast Guide to Storage Technologies  (WhatIs.com)
gigabyte  (SearchStorage.com)
hard disk drive  (SearchStorage.com)
Kilo, mega, giga, tera, peta, and all that  (SearchStorage.com)
storage medium  (SearchStorage.com)
terabyte  (SearchStorage.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary




Find Data Reduction
TechTarget Storage Media
Storage Magazine View this month\\'s issue and subscribe today.
Storage Decisions Apply online for free conference admission.
SearchStorage.com
HomeNewsMagazineTopicsLearningMultimediaWhite PapersBlogsEventsAbout Us

About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2000 - 2010, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts