Several storage system vendors claim their products can detect and repair hard disk problems automatically. Here's how they do it and the low-down on how well they work.
A fundamental change in the basic building blocks of storage is occurring, one that's as groundbreaking today as RAID was when it was introduced 20 years or so ago. The revolutionary development is commonly referred to as "autonomic self-healing storage," and it promises greater reliability from disk systems than ever before.
Autonomic self-healing storage might sound more like a trumped-up term than a fundamental change. After all hasn't it been around for a while in the form of RAID, redundant array of independent nodes (RAIN), snapshots, continuous data protection (CDP) and mirroring?
If you define self-healing as the ability to restore from a failure situation, you'd be right. All of those familiar technologies are designed to restore data from a failure situation. But to be a bit more precise, those technologies are actually self-healing data, not self-healing storage. They restore data when there's a storage failure and mask storage failures from the apps -- they don't restore the actual storage hardware.
Self-healing storage is more accurately defined as transparently restoring both the data and storage from a failure. That might seem like splitting hairs, but it's not. It's the difference between treating the symptoms and fixing the cause.
What happens when a disk fails
The lowest common denominator in standard storage systems today is the hard disk drive (HDD). The hard disk drive is the only electro-mechanical device in the storage system, and it has the highest probability of failure or lowest mean time between failures (MTBF) (see "MTBF: The odds of failure," below). It's well documented that the HDD component is the Achilles' heel of a storage system.
|MTBF: The odds of failure|
A disk manufacturer's hard disk drive (HDD) mean-time between failures (MTBF) rating enables you to forecast the useful operational life of a hard disk drive. When there are a lot of HDDs in the system, the probability of HDD failures increases. The general formula for calculating average time between drive failures within a system is as follows:
Using the manufacturers' MTBF numbers (approximately 1.5 million hours for enterprise-class Fibre Channel and SAS HDDs, and approximately 600,000 hours MTBF for SATA HDDs), a system with 240 enterprise drives should expect a hard disk drive failure every 260 days: 1,500,000/240 = 6,250 hours or about 260 days (roughly two HDDs per year or approximately a 0.8% replacement rate). If the HDDs are SATA, the system should expect a HDD failure every 104 days (roughly four HDDs per year or approximately a 1.67% replacement rate).
Unfortunately, manufacturer MTBF numbers don't reliably reflect real-world MTBFs. The Computer Science department at Carnegie Mellon University in Pittsburgh ran stress tests of 100,000 Fibre Channel, SAS and SATA hard disk drives. Their published testing results determined that a typical drive (Fibre Channel, SAS or SATA) has a realistic MTBF of approximately six years or 52,560 hours. Using Carnegie Mellon's MTBF numbers, a storage system with 240 HDDs can expect a drive failure approximately every nine to 10 days (approximately 40 HDDs per year or an annual replacement rate of 16.67%).
The unprecedented growth of data in most companies has led to an explosion of storage systems and hard disk drives. It's statistically proven that as HDDs proliferate so will the number of hard disk drives failures, which can lead to lost data. Analyzing what happens when a HDD fails illustrates the issue:
1) A hard disk dive fails
2) The drive must be physically replaced, either manually or from an online pool of drives.
3) Depending on the RAID set level, the HDD's data is rebuilt on the spare:
4) The time it takes to rebuild the HDD's data depends on the hard disk drive's capacity, speed and RAID type.
- SATA drives typically have a rated non-recoverable read error rate of 1014: roughly 1 out of 100,000,000,000,000 bits will have a non-recoverable read error. This means that a seven-drive RAID 5 group with 1 TB SATA drives will have approximately a 50% chance of failing during a rebuild resulting in the loss of the data in that RAID group.
- Enterprise-class drives (Fibre Channel or SAS) are rated at 1015 for non-recoverable read errors, which translates into less than a 5% chance of the RAID 5 group having a failure during a rebuild.
- RAID 6 eliminates the risk of data loss should a second HDD fail. You pay for that peace of mind with decreased write performance vs. RAID 5, and an additional parity drive in the RAID group.
Eventually, the hard disk drive is sent back to the factory. Using the MTBF example in "MTBF: The odds of failure," this suggests that there'll be approximately 40 HDD "service events" per year.
Most storage admins might be surprised by what happens when a HDD is sent back to the factory. After being run through the factory's failure analysis process, the results for the vast majority of failed hard disk drives (somewhere between 67% and 90%) will be "no failure found" -- the HDD is fine. But the service event still took place and the RAID data rebuild still had to occur. That's a lot of operational hassle for "no trouble found."
Undetected data corruption
Another problem with HDDs that's rarely mentioned but actually quite prevalent is "silent data corruption." Silent data corruptions are storage errors that go unreported and undetected by most storage systems, resulting in corrupt data being provided to an application with no warning, logging, error messages or notification of any kind.
Most storage systems don't detect these errors, which occur on average with 0.6% of SATA HDDs and .06% of enterprise HDDs over 17 months (from "An Analysis of Data Corruption in the Storage Stack," L.N. Bairavasundaram et al., presented at FAST '08 in San Jose, Calif.). Silent data corruption occurs when the RAID doesn't detect data corruption errors such as misdirected or lost writes. It also occurs with a torn write -- data that's partially written and merges with older data, so the data ends up part original data and part new data. Because the hard disk drive doesn't recognize the errors, the storage system isn't aware of it either so there's no attempt at a fix.@pb
Autonomic self-healing systems
Among this new breed of systems, some tackle end-to-end error detection and correction, including silent data corruption. Other systems take the same approach, but add sophisticated algorithms that attempt to "heal-in-place" failed HDDs before requiring a RAID data rebuild. A final group of systems matches those capabilities and ups the ante with the new concept of "fail-in-place" so that in the rare circumstance when a HDD truly fails (i.e., it's no longer usable), no service event is required to replace the hard disk drive for a RAID data rebuild.
End-to-end error detection and correction
Vendors and products offering end-to-end error detection and correction include DataDirect Networks Inc.'s Silicon Storage Architecture (S2A) with its QoS and SATAssure; EMC Corp.'s Symmetrix DMX-4 with its Double Checksum; NEC's D-Series support of the American National Standards Institute's new T10 DIF (Data Integrity Field) standard for enterprise Fibre Channel or SAS HDDs, and their proprietary Extended Data Integrity Feature (EDIF) for SATA hard disk drives; Panasas Inc.'s ActiveStor with Vertical Parity for SATA HDDs; Sun Microsystems Inc.'s Zettabyte File System (ZFS)-based systems when volumes are mirrored; and Xiotech Corp.'s Emprise 5000 (aka Intelligent Storage Element), which is also based on the T10 DIF standard (see "Self-healing storage products," below).
Click here to view the "Self-healing storage products" PDF.
T10 DIF is a relatively new standard and only applies to SCSI protocol HDDs (SAS and Fibre Channel) (see "Inside ANSI's T10 DIF spec," below). The T10 DIF standard is being incorporated into quite a few storage systems scheduled for release in 2009 and 2010. However, there's no standard spec for end-to-end error detection and correction for SATA hard disk drives at this time. That's why DataDirect Networks, EMC and NEC devised their own SATA end-to-end error detection and correction methodologies.
|Inside ANSI's T10 DIF|
The American National Standards Institute's (ANSI) T10 DIF (Data Integrity Field) specification calls for data to be written in blocks of 520 bytes instead of the current industry standard 512 bytes. The eight additional bytes or "DIF" provide a super-checksum that's stored on disk with the data. The DIF is checked on every read and/or write of every sector. This makes it possible to detect and identify data corruption or errors, including misdirected, lost or torn writes. ANSI T10 DIF provides three types of data protection:
When errors are detected, they can then be fixed by the storage system's standard correction mechanisms.
DataDirect Networks' S2A SATAssure software does a Reed-Solomon error-correction calculation on every read operation and then compares HDD data to parity to ensure data consistency. SATAssure repairs the data if an inconsistency is detected, then passes it back to the requesting app and rewrites it to the HDD. All of this happens in real-time. EMC DMX-4 uses a double checksum that's very similar to Oracle Corp.'s industry-proven double checksum that minimizes database corruptions.
NEC's D-Series EDIF is modeled on ANSI T10 DIF. The difference is that EDIF is specifically modified for SATA's Integrated Disk Electronics (IDE) protocol.
Panasas' Vertical Parity is designed to maintain individual hard disk drive reliability. Vertical Parity isolates and repairs (using redundant information in the horizontal RAID stripe) torn, lost or misdirected writes on SATA HDDs at the disk level before they're seen by the RAID array.
Sun's ZFS is now used in several unified storage systems (Sun's 4500 and 7000 Series, and the new OnStor Inc. Pantera LS 2100). ZFS utilizes its own end-to-end error-detection algorithms to sniff out silent data corruption. It requires mirrored volumes and corrects the detected silent data corruption by copying the uncorrupted data from the good volume.@pb
Does end-to-end error correction work?
User evidence over the past 18 months suggests that HDD error-correction methods work. Interviews with IT organizations storing petabytes of storage (where silent data corruption is statistically more likely to be noticed) in mission-critical applications such as government labs, high-energy particle research, digital film/video production and delivery, seismic processing and so on, have revealed high levels of satisfaction. Perhaps the most telling remark came from an IT manager who wishes to remain anonymous: "I don't worry about silent data corruption anymore because it's no longer an issue for us."
Sector errors in traditional disk subsystem designs mark the HDD as failed. A failed HDD initiates a RAID data rebuild process that degrades performance and takes a long time. It can also be expensive, as there may still be useful life in the hard disk drive.
A heal-in-place system goes through a series of automated repair sequences designed to eliminate or reduce most of the "no failure found" HDD failures, as well as the subsequent unnecessary and costly RAID data rebuilds. As of now, there are five systems that provide heal-in-place capabilities: Atrato Inc.'s Velocity1000 (V1000), DataDirect Networks' S2A series, NEC's D-Series, Panasas' ActiveStor and Xiotech's Emprise 5000. Each provides a proven, albeit completely different, heal-in-place technology.
Atrato's V1000 uses fault detection, isolation and recovery (FDIR) technology. FDIR continuously monitors component and system health, and couples it with self-diagnostics and autonomic self-healing. Atrato uses FDIR to correlate SATA drive performance with its extensive database of operational reliability testing (ORT) performed on more than 100,000 SATA hard disk drives. FDIR uses decision logic based on that extensive ORT history, stress testing and failure analysis to detect SATA HDD errors. It then leverages Atrato Virtualization Software (AVS) to deal with detected latent sector errors (non-recoverable sectors temporarily or permanently inaccessible). AVS' automated background drive maintenance commonly prevents many of these errors. When it doesn't, it remaps at a sector level using spare capacity on the virtual spare SATA HDDs. This enables many of those SATA HDDs with sector errors to avoid being forced into a full failure mode permanently, and allows those SATA hard disk drives to be restored to full performance.
DataDirect Networks' S2A's heal-in-place approach to disk failure attempts several levels of HDD recovery before a hard disk drive is removed from service. It begins by keeping a journal of all writes to each HDD showing behavior aberrations and then attempts recovery operations. When recovery operations succeed, only a small portion of the HDD requires rebuilding using the journaled information. Having less data to rebuild greatly reduces overall rebuild times and eliminates a service event.
NEC's D-Series Phoenix technology detects sector errors, but allows operation to continue with the other HDDs in the RAID group. If an alternative sector can be assigned, the hard disk drive is allowed to return to operation with the RAID group avoiding a complete rebuild. Phoenix technology maintains performance throughout the detection and repair process.@pb
Panasas' ActiveScan feature continuously monitors data objects, RAID parity, disk media and the disk drive attributes. When it detects a potential problem with HDD blocks, the data is moved to spare blocks on the same disk. Future hard disk drive failure is predicted through the use of HDD SMART attribute statistical analysis, permitting action to be taken that protects data before a failure occurs. When a hard disk drive failure is predicted, user-set policies facilitate preemptively migrating the data to other HDDs. This eliminates or mitigates the need for reconstruction.
Xiotech's Emprise 5000, or ISE, is architected to proactively and re-actively provide autonomic self-healing storage. ISE preventive and remedial component repair takes place within its sealed DataPacs (storage capacity modules). It never requires manual intervention to pull failed drives. ISE provides in-place automatic data migration (when required), power cycling, factory remanufacturing and component re-calibration; only the surfaces of affected heads with allocated space, as opposed to entire disk drives, are rebuilt in very fast parallel processes. The result is the equivalent of a factory-remanufactured HDD, and the only components ever taken out of service are those that are beyond repair. Everything else is restored to full activity and performance.
Does autonomic self-healing work?
Based on interviews with users and on vendors' historical service data, autonomic self-healing works. The numbers show a decrease in RAID data rebuilds and service calls by as much as 30% to 50%. For Atrato and Xiotech, there are never any HDD replacement service calls because of their fail-in-place technologies.
Fail-in-place is a fairly new concept aimed at resolving some prickly side effects of hot-plug or hot-swap HDDs in storage systems. An example of these difficult side effects include pulling the wrong drive and causing inadvertent data loss; delaying the replacement of a failed HDD, which defers rebuild starts and increases data loss risk; or using spare drives that may not have been recently tested, which may result in a second hard disk drive failure.
The basic concept of fail-in-place is to redefine and increase the smallest field-replaceable unit (FRU) from being a HDD to being a storage pack. A storage pack is a collection of hard disk drives operating in concert with a certain percentage of capacity allocated for sparing. HDD failures are automatically rebuilt from the allocated capacity. There are currently only two vendors supplying fail-in-place storage systems: Atrato (with its V1000) and Xiotech (with the Emprise 5000 or ISE). Both systems feature end-to-end error detection and correction, as well as autonomic self-healing.
Both vendors' product architectures are based on the concept of available user capacity being tightly coupled with enclosure lifecycle within a single FRU. An enclosure's lifecycle is the timeframe in which the enclosed raw capacity will be available to an application. The total enclosure capacity also includes an allowance for anticipated sparing requirements over the warranted capacity life of the enclosure (three years for Atrato and five years for Xiotech).@pb
The differences between the two implementations are reflective of each vendor's product philosophies. Atrato makes 2.5-inch SATA drives enterprise-capable with their ORT, end-to-end error correction and detection, autonomic self-healing, high densities per enclosure, and with clever vibration and cooling methods. They improve performance by combining 160 drives for up to 80 TB in a 3U enclosure that provides up to 12,500 IOPS and 1.5 GBps throughput from a single enclosure.
Xiotech's focus is on providing increased reliability and performance from enterprise Fibre Channel and SAS 3.5-inch and 2.5-inch drives. The baseline FRU is a sealed DataPac of 10 3.5-inch or 20 2.5-inch Fibre Channel or SAS HDDs for up to 16 TB in 3U. Each ISE has dual removable DataPacs, power supplies with cooling, 96-hour battery backup and active-active RAID controllers. Unlike standard storage subsystems, ISE DataPacs feature innovations such as sophisticated vibration reduction and improved cooling; Xiotech exploits the internal structure of all of the components to fully leverage very advanced drive and system telemetry. DataPac drives feature special firmware that relieves the burden of device compatibility required of all other storage subsystems. The result of the tightly knit control within the DataPac is a highly reliable "super disk" that has demonstrated more than a 100-fold increase in reliability vs. a typical storage system drive bay (based on Xiotech's test of 208 ISEs containing 5,900 drives for 15 months with no service events).
Does fail-in-place work?
Atrato and Xiotech have proven that fail-in-place definitely works. Their product testing and customer testimonials indicate these technologies can virtually eliminate HDD replacement service calls. That translates to lower costs, less risk of lost data and fewer application disruptions.
Self-healing storage solves tangible operational problems in the data center. It reduces services events, costs, management, data loss risk and application disruptions. And most importantly, it works. Ten years from now, self-healing storage will be considered a minimum requirement just as RAID is today.
BIO: Marc Staimer is president of Dragon Slayer Consulting.