This article can also be found in the Premium Editorial Download "Storage magazine: Exploring systems that detect and repair hard disk problems automatically."
Download it now to read this article plus other related content.
- Logical block guard for comparing the actual data written to disk
- Logical block application tag to ensure writing to the correct logical unit (virtual LUN)
- Logical block reference tag to ensure writing to the correct virtual block
Autonomic self-healing systems
Among this new breed of systems, some tackle end-to-end error detection and correction, including silent data corruption. Other systems take the same approach, but add sophisticated algorithms that attempt to "heal-in-place" failed HDDs before requiring a RAID data rebuild. A final group of systems matches those capabilities and ups the ante with the new concept of "fail-in-place" so that in the rare circumstance when a HDD truly fails (i.e., it's no longer usable), no service event is required to replace the hard disk drive for a RAID data rebuild.
End-to-end error detection and correction
Vendors and products offering end-to-end error detection and correction include DataDirect Networks Inc.'s Silicon Storage Architecture (S2A) with its QoS and SATAssure; EMC Corp.'s Symmetrix DMX-4 with its Double Checksum; NEC's D-Series support of the American National Standards Institute's new T10 DIF (Data Integrity Field) standard for enterprise Fibre Channel or SAS HDDs, and their proprietary Extended Data Integrity Feature (EDIF) for SATA hard disk drives; Panasas Inc.'s ActiveStor with Vertical Parity for SATA HDDs; Sun Microsystems Inc.'s Zettabyte File System (ZFS)-based systems when volumes are mirrored; and Xiotech Corp.'s Emprise 5000 (aka Intelligent Storage Element), which is also based on the T10 DIF standard (see "Self-healing storage products," below).
Click here to view the "Self-healing storage products" PDF.
T10 DIF is a relatively new standard and only applies to SCSI protocol HDDs (SAS and Fibre Channel) (see "Inside ANSI's T10 DIF spec," below). The T10 DIF standard is being incorporated into quite a few storage systems scheduled for release in 2009 and 2010. However, there's no standard spec for end-to-end error detection and correction for SATA hard disk drives at this time. That's why DataDirect Networks, EMC and NEC devised their own SATA end-to-end error detection and correction methodologies.
|Inside ANSI's T10 DIF|
The American National Standards Institute's (ANSI) T10 DIF (Data Integrity Field) specification calls for data to be written in blocks of 520 bytes instead of the current industry standard 512 bytes. The eight additional bytes or "DIF" provide a super-checksum that's stored on disk with the data. The DIF is checked on every read and/or write of every sector. This makes it possible to detect and identify data corruption or errors, including misdirected, lost or torn writes. ANSI T10 DIF provides three types of data protection:
When errors are detected, they can then be fixed by the storage system's standard correction mechanisms.
DataDirect Networks' S2A SATAssure software does a Reed-Solomon error-correction calculation on every read operation and then compares HDD data to parity to ensure data consistency. SATAssure repairs the data if an inconsistency is detected, then passes it back to the requesting app and rewrites it to the HDD. All of this happens in real-time. EMC DMX-4 uses a double checksum that's very similar to Oracle Corp.'s industry-proven double checksum that minimizes database corruptions.
NEC's D-Series EDIF is modeled on ANSI T10 DIF. The difference is that EDIF is specifically modified for SATA's Integrated Disk Electronics (IDE) protocol.
Panasas' Vertical Parity is designed to maintain individual hard disk drive reliability. Vertical Parity isolates and repairs (using redundant information in the horizontal RAID stripe) torn, lost or misdirected writes on SATA HDDs at the disk level before they're seen by the RAID array.
Sun's ZFS is now used in several unified storage systems (Sun's 4500 and 7000 Series, and the new OnStor Inc. Pantera LS 2100). ZFS utilizes its own end-to-end error-detection algorithms to sniff out silent data corruption. It requires mirrored volumes and corrects the detected silent data corruption by copying the uncorrupted data from the good volume.
This was first published in June 2009