Storage, heal thyself

Enterprises are keeping more data online for longer periods of time, much of it ending up on high-capacity SATA disk drives requiring lengthy rebuild times when a drive failure occurs. But self-healing storage systems, an emerging class of storage products, pick up where today's RAID configurations leave off, using heuristics, journaling, read parity correction and distributed data resiliency in conjunction with RAID 6 for higher levels of performance and data protection.

Arun Taneja, consulting analyst and founder of the Taneja Group, Hopkinton, MA, finds that SATA disk drives are at the core of most data protection strategies. The question companies need to ask themselves, says Taneja, is whether or not RAID alone is an adequate level of data protection when it comes to SATA. "RAID does not do anything to minimize the horrendous rebuild times of a single 1TB drive failure in an array group," he says.

Mark Seager, assistant department head for advanced technology at the Lawrence Livermore National Laboratory (LLNL) in Livermore, CA, says the individual SATA disk drives on his DataDirect Networks S2A storage system occasionally act quirky. "From time to time, individual SATA drives go catatonic from a half a second to a couple of seconds," says Seager.

DataDirect Networks uses heuristics and journaling to address this unresponsive condition on SATA drives. When a drive fails to respond, the S2A storage system marks the drive as failed, but doesn't immediately treat the drive as if it's actually failed. It takes the drive offline, journals writes to that drive, and then attempts are made to reset and reinitialize the drive before marking it as failed.

"If the drive has truly failed, this process causes no harm and the S2A storage system starts the rebuild. If the drive still works, the S2A storage system saves the drive and users are spared a lengthy rebuild time," say Josh Goldstein, DataDirect Networks' VP of product marketing.

Unlike enterprise-class SAS and Fibre Channel (FC) drives that include error correction such as parity checks on reads in their firmware, SATA drives have no such feature in their software. Self-healing storage systems address this according to whether their intended purpose is for production or to archive and back up data.


LLNL uses multiple DataDirect Networks storage systems in a production setting that attach to multiple different SUSE Linux supercomputers. Data is consistently accessed on all of the drives; as the data is read, the DataDirect Networks' storage systems monitor data coming from each SATA drive and do parity corrections as reads occur in real-time. Thousands of SATA drives sit on these multiple DataDirect Networks storage systems, so errors occur almost daily.

"We try to be proactive in identifying bad drives, so if we suspect one is going bad, we find it better to replace a drive before it fails," says Seager.

NEC's Hydrastor targets backup and archive applications, relying on proprietary technology rather than RAID 5 or RAID 6. The Hydrastor distributes data across storage nodes based on the total number of potential failures from which an administrator wants the Hydrastor to recover. "Administrators can simulate up to 25% of the drives in the Hydrastor failing and still keep full data integrity," says Karen Dutch, NEC's general manager of advanced storage products.

As SATA disk drives become more integral to how companies store data, self-healing storage systems can help to keep data residing on SATA disk drives viable and accessible. Although a gap between FC and SATA drives still exists, "self-healing storage systems squeeze out FC-like performance and availability characteristics from SATA storage systems," says Taneja.

--Jerome M. Wendt

Dig Deeper on Data storage strategy

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.