News Stay informed about the latest enterprise technology news and product updates.

SATA takes on a life of its own

Companies tend to focus on the positive aspects of using SATA disk drives for a growing portion of their enterprise storage needs but as some companies are finding out, managing thousands or tens of thousands of SATA disk drives can take on a life of its own.

Recently, I spoke to Lawrence Livermore National Laboratories (LLNL) which is a huge DataDirect Networks user. By huge, I mean they use multiple DataDirect Network Storage Systems with the total number of SATA disk drives in production numbering in the tens of thousands, possibly even up to a hundred thousand SATA disk drives. More impressive, LLNL uses these storage systems in conjunction with some of the world’s fastest supercomputers, including the BlueGene/L currently rated #1 among the world’s fastest computers.

The issue that crops up when companies own tens of thousands of disk drives — SATA or FC — is the growing task of managing failed disk drives. Companies such as Nexsan Technologies report failure rates of less than half of 1% of all SATA disk drives that they have deployed out in the field. Those numbers sound impressive until one begins to encounter environments like LLNL that may have up to a hundred thousand SATA disk drives in their environment. Using a .005% failure rate in that scenario, companies can statistically expect a SATA disk drive to fail about every other day, which is inline with LLNL’s experience.

This is in no way intended to reflect negatively on DataDirect Networks. If users were to deploy a similar numbers of disk drives from any other SATA storage system provider, be it Excel Meridian, Nexsan Technologies or Winchester Systems, they could expect similar SATA disk drive failure rates.

The cautionary note for users here is twofold. First, be sure your disk management practices keep up with your growth in disk drives. Replacing a disk drive may not sound like a big deal, but consider what is involved with a disk drive replacement:

  • Discovering the disk drive failure
  • Contacting and scheduling time for the vendor to replace the disk drive
  • Monitoring the rebuild of the spare disk drive
  • Determining if there is application impact during the disk drive rebuild
  • Physically changing out the disk drive

Assuming a .005% failure rate, companies with hundreds of disk drives will repeat this process once a year, those with thousands of disk drives once a quarter and those with tens of thousands once a week. Once a company crosses the 10,000 threshold barrier, companies need to seriously contemplate dedicating a person at least a part-time just to monitor and manage the task of disk drive replacements regardless of which vendor’s storage system one selects.

The other cautionary note is that the more disk drives one deploys, the more likely it becomes that two or even three disk drives in the same RAID group will fail before a recovery of an existing failed disk drive is complete. Companies, now more than ever, need to ensure they are using RAID-6 for their SATA disk drive array groups and, when crossing the 10,000 disk drive threshold, should consider the new generation of SATA storage systems from companies such as DataDirect Networks and NEC. These systems give companies more data protection and recovery options for their SATA disk drives.

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

I agree with Jerome's cautionary notes. While the number of extremely large deployments of Sata are relatively small. Medium deployments are just as vulnerable to loss as disk sizes increase. Medium deployments of sata or FC should be considering the mean time between failure for their media, early failure detection criteria and especially note the rebuild time for large sized disks. The new 750GB and 1T sata drives take a long time to rebuild. If you configure raid sets with a large number of drives (> 5) per raid group, the probability of dual drive failures increase due to the long rebuild time for these drives. The mean time to failure is still small but if all of the disks are new and started duty cycles at the same time, the peak failures should be predictable and extra caution should be exercised.
Doesn't LLNL require shredding of drives? Even though a drive goes bad it could still have recoverable data on it. In a secure site like LLNL procedures should call for proper disposal of replaced drives adding even more cost to a replacement.
This user's cautionary tale is nothing more than simple and obvious math, and has nothing to do with SATA as a technology. As drives get larger, the number required to provide a given level of storage is reduced, and redundancy is the normal procedure for any large operation. The drive issues cited are normal with any technology, and have been with uf for a very long time. I manage a room of machines that have been running for many years, and the modern machines as well as the early ones all run raid, regardless of the technology used in the drive array. For performance we use raid-10, which can frequently see multiple drive failures. We uae a quality drive controller for many systems (3ware) and software raid for many others (typically scsi based systems). I don't see many multiple failures, but there have been a few. In one case the machine was simple a test machine, and the programmers wanted it to fail, so they ignored it's dead drive for two years... And some of our large raid 10 systems would see multiple failures when a critical fan would fail... In all of our raid 10 arrays we have been lucky in never loosing two drives from a single mirrored pair. But the point is that drives fail... they always have, and they probably always will. Sounding an alarm because someone may have a lot of drives spinning is silly. Anyone with that many drives will have a sutably sized staff keeping the systems running. I have seens rooms full of spinning hard drives for years. When I was a teen, I remember a tour of a university lab with a whole room full of disk drives, feeding a system in the next room that took about 20 hours to fully boot (after several media changes)... And there was a staff of people there to fix it when things went bad. the only thing that has changed is drived are smaller, faster, and hold more data, with a single drive holding more data than 100 rooms did in the 70's. I just don't get what the big deal is... Is this the first time this author has ever stopped to do the math on drive failures? I have 48 drives in the room I am in now (my home office) and several drives have failed over time with this equipment... Sata has brought with it better drives with more speed and fewer failures... yet they still fail, so I'm supposed to be scared? Geez...