The previous tip in this series covered the first two issues inherent in the planning for long-term data archiving: deciding what to archive and how to make an archive ingest the data efficiently.
In this tip, we will consider the remaining three issues: How will data be stored? How will data integrity be maintained? How will technology issues be addressed?
Deciding how data will be stored should actually be considered in two parts. One part has to do with the software container or wrapper that will be used. This will play an important role in mitigating the potential problem of finding a way to read an archived file after the software used to create it has been rendered obsolete. In a previous tip on long-term data archiving I wrote about the possibility of using commercial software containers, such as Adobe's Portable Document Format (PDF), and keeping source code in escrow as a hedge against future non-readability -- or alternatively, using XML or standardized object wrappers.
Attention needs to be paid to this step in order to avoid falling into the trap of having to "un-ingest" all of your data from the archive repeatedly so you can re-wrap it using newer versions of your preferred container. The archivists for a national museum experienced this problem a few years ago after they selected early Adobe PDF formats for their archive containers. To their dismay, Adobe changed the container a whopping 30 times in the first year.
The other part of the question of how archival data will be stored has to do with media technology. The choices these days seem to come down to tape, disk or cloud (which is usually either a tape- or disk-based repository remotely hosted and offered as a service).
The trends, according to an Enterprise Strategy Group survey last year, appear to favor the use of disk arrays as the preferred archive repository technology. However, cloud vendors have also begun to enter the archive space, offering archival storage as a service with a compelling cost model.
A disk-based archive has the appeal of being comparatively simple to implement, though this is by no means certain. Newer technologies such as high-capacity SATA disks, on-array features like deduplication and compression, and scale-out (aka clustering) architectures combine to provide a more economical disk solution compared with past archive-to-disk offerings. That said, a disk-based archive still suffers from a cost model that needs to be considered closely. Disk must be powered continuously (even with the low-power modes available on some drives) at 7 W to 21 W per disk. Recent studies from INSIC (covering the total cost of ownership for a five-year, 100 TB disk archive) and from Clipper Group (covering the TCO for a disk-based archive of the same size for a 12-year period) showed that the cost of energy alone was greater than the TCO for a comparably sized archive leveraging tape technology. Aside from platform acquisition, operational and energy costs, and maintenance costs, problems with disk-based archiving are generally linked to another factor: disk media vulnerability. Disk is highly vulnerable to failure from a variety of causes -- substantially higher than the failure stats advertised by vendors.
Another growing concern has to do with bit error rates in disk. Undetected bit errors, sometimes called silent corruption, occur, by some estimates, once in about every 67 TB of disk.
A bit error might occur on an unused portion of the disk where it has no consequence, or it may affect a single file, rendering it unreadable. In the worst-case scenario, the bit error could occur in a RAID stripe or on a parity disk and could render all the data in the RAID set unusable.
Tape is a bit sturdier from the standpoint of undetectable bit errors. Contemporary tape media sports a bit error rate of between one in 1017 (about one error in every 12.5 petabytes of media capacity) or (with read/verify passes used after writing media) one in 1027 -- which is to say, infinitesimal.
Finally, from an investment perspective, tape trumps disk in a couple of ways. First, tape technology changes about every seven years, and every generation of tape is designed explicitly to be read/write compatible with the previous generation and read compatible with the generation before that. Disk arrays, by contrast, guarantee no backwards compatibility, and arrays are generally declared "end-of-life" by their manufacturers within 17 months of delivery to market.
Cloud-based archive services, which may be either disk- or tape-based, have the appeal of providing archive space seemingly on the cheap. "One penny per [gigabyte] of data stored," is a line from the brochure of one vendor. However, a closer examination reveals that the service costs one cent per gigabyte, per month, and costs more if more than 5% of the archived data is retrieved by the subscriber in a given year.
Archive integrity, changing technology
So, storage media choice matters as much as software container choice when a platform is being designed for long-term data archiving. Wrapped up in this analysis of media options are two additional issues: assuring archive integrity and selecting methods to cope with technology change. In fact, it is not uncommon for vendors to emphasize media durability to answer both issues, but the issue is not that clear-cut.
Media durability -- whether defined as the length of time that media will retain the electromagnetic states of recorded bits, or as the service life of a media or drive component or of an entire array or library -- is beside the point. Tape, properly maintained, has a shelf life of about 30 years, while disk has a life expectancy, according to manufacturers, of about five years. This doesn't necessarily mean that you will refresh your tapes only every 30 years or your disks every five years.
The truth is that most companies using tape-based archives will migrate data between generations of tape about every two generations (or 14 years), and more frequently when optimizing the media used by the archive. Disk users like to keep arrays in service for five to seven years, though the costs for renewing three-year warranty and maintenance agreements tend to cost as much as an entirely new array. Moreover, there is no certainty that a disk-based archive can be readily migrated to fresh hardware, even when the new hardware is from the same vendor as the existing hardware. Cross-platform migrations can be daunting in the extreme, as many folks could testify who have tried to migrate data from an EMC Centera platform or from an Isilon Networks (EMC) rig to, say, a NetApp Filer.
Bottom line: You guarantee integrity in an archive by periodically testing the files or objects stored, and repairing or replacing data from a backup copy when errors are detected. Replacing media (and other parts) can usually keep an array up and running for five years or so, while a tape library can stay in service for a much longer period of time and provides guaranteed media exchangeability between two generations of drive equipment.