Tracking regulatory requirements for data retention is tough enough, but restoring data from long-term archives is likely to be even harder.
Some legislative requirements mandate that data be kept for as long as 70 years--that's the easy part. Having to restore a 70-year-old file is the hard part.
Twenty years ago, mainframe backups were done on nine-track, 1-inch tape, while PC backups were likely put on 5.25-inch floppies or audio-cassette tapes. Trying to read any of those media today would be difficult, even assuming the media hadn't deteriorated past readability.
Besides obvious issues like the device and removable media used for archiving, the data format used by the backup program and the format of the data itself, there are additional issues such as the usable lifespan of the media, the necessary keys for any encrypted data and--a really big one--the ability to find the name of the data file you want. If that's not enough to discourage you, you may also have to discover where the archive is stored and what specific piece of media actually holds the file.
Text documents formatted using XML, OpenDoc or RTF standard formats should be readable in the long term, and even Micro-soft Word and Excel documents will probably be readable for quite a long time. Microsoft Corp. recently announced it will support the Open Document Format (ODF) standard via a translator that will convert documents from the proprietary Open XML format that Microsoft prefers.
Text documents are relatively easy to read, even if you have to do some conversion or drop out formatting. Graphics, on the other hand, are less simple. Even open standards like TIFF and JPEG have many varieties, not all of which can be read by any given program. Unfortunately, open-graphics format standards like Portable Network Graphics (PNG) aren't yet widely supported by many commercial applications. Adobe Systems Inc.'s PDF/A is supposed to address this issue, but given that many older PDFs can crash the Adobe 7.0 reader, this is not yet a perfect solution.
Time is not on your side
According to Stephanie Balaouras, senior analyst in the computing systems research group at Forrester Research, Cambridge, MA, the first issue in creating an archiving plan is to define what data needs to be archived and why, whether it's for legal discovery, regulatory compliance or business requirements. Because discovery, compliance and business requirements may have extremely different or even conflicting criteria for data retention, the issue can become very complex. To manage the archiving process, large organizations are beginning to create specialized archiving positions with titles such as archiving officer or digital-preservation officer.
Long-term archiving can be very expensive. In addition to data, you must build what amounts to a museum of old tape drives, associated parts and software to restore the old tapes. Of course, you don't necessarily have to hoard old archival equipment; there are service companies that specialize in data recovery, but it's an expensive proposition and there's no guarantee they'll actually have the equipment your media requires. The process becomes even more complex when there's a requirement to keep data in a read-only format for its required retention period.
Databases offer a special case for archiving. Unlike relatively small flat files, databases can range in size (up to many terabytes), change often and, in some cases, should be archived in a way that documents every change to each record over time. Additionally, restoring a single record to previous states usually depends on a proprietary scheme unique to each database vendor. Because enterprises often have multiple databases on different software platforms, it's very difficult to develop a unified retention strategy.
Overlapping retention requirements is another large issue. Even within a single archiving application that's used to create retention policies, the process of retaining read-only copies of data for a period and then purging the data when it expires is complicated by the need to satisfy retention policies for different standards. Consider, for example, information on a single customer that's maintained across several files and in two databases. Three different data-retention policies require different pieces of the files and database records to be retained for a given time and then purged. One of the standards requires data to be retained for seven years and then purged, another has a 12-year period and the third requires retaining the data for 70 years. How do you keep track of which regulation has precedence, and which pieces of each file or database record apply to each standard? (See "Long-term formats," this page).
If you're responsible for data retention in your organization, you need to address the strategy before the tactical considerations. You'll also need to decide if you should attempt to create a complete plan for recovering data from archives, including media, tape drives, backup applications and data applications, or opt for an alternative approach, such as migrating archived data to SATA-based, second-tier disk storage. Given how rapidly per-gigabyte pricing is dropping, dedicated SATA storage may be less expensive than documenting and managing all of the parts of a tape-based archive solution.
For companies maintaining data on tape, there are several media issues to consider. For example, let's assume there's a formal archiving plan in place; tapes are tested to ensure they're readable and are then rotated offsite and stored in a secure facility. Years later, you receive a request for data that has long since been deleted from your servers; however, you have a catalog of archived data, so you know it's available on a tape that can be requested from the archive company.
When it arrives, you realize you're no longer using drives that support that tape format. You pay twice the original retail price for a drive that will read the tape, but discover it uses a SCSI interface four generations back. You look for a SCSI host bus adapter that will work with the drive, but the only one you can find is an ISA bus card, so you end up buying an old PC that has ISA bus slots. However, the latest version of Windows no longer supports the drivers for that card.
You find drivers that work with Windows 2000, and then track down an old copy of Windows 2000 and install it on the server. Luckily, you held onto the old versions of your backup application, so you can finally restore the files. But your sense of accomplishment is short-lived because the files are in a proprietary format created using a program you no longer have ... and the software company is no longer in business. With no other alternatives, you turn to a conversion service that transforms the data into something readable, only to discover that the formatting and fields don't match the other data you're trying to reconcile it with. You spend another few weeks reformatting the data to match.
Organizations like Data Recovery Services of Irving, TX, and ActionFront Data Recovery Labs Inc., a subsidiary of Seagate Technology, can recover data from old tapes or disks, figure out the backup formats and even translate the data into newer file formats. But "you have to look at the overall cost of recovery," notes Scott Selley, senior consultant at ActionFront. "It might be cheaper to re-key the data or use OCR to re-create it from paper."
Forrester Research's Balaouras says companies are increasingly addressing this problem by migrating archives from tape to inexpensive second- or third-tier disk storage. Although this addresses the issues of old tape drives and interfaces, there's still the issue of old data formats at the operating system and application level. Balaouras says data should be renewed on a regular basis to update data formats to those readable by current versions of applications.
Existing databases aren't architected for long-term data storage. Though some vendors have proprietary solutions, there are few ways to manage data retention across multiple databases over long periods of time. Databases can hold multiple types of data--transactional, hierarchical and relational--but standard archiving solutions are of little use because all of the files in a database are accessed constantly and data is spread throughout the entire database.
So far, according to Jack Olsen, chief technology officer at Neon Enterprise Software Inc., which provides enterprise data availability software and services, there's "no satisfactory solution" that provides transparent access to data stored in a universal file format; the data shouldn't need to be imported into the same database it started in to restore it. The archive has to be self-sufficient. Considering data may be retained for five to 10 years, the archive has to be independent of the original apps and database structural definitions.
|Dos and Don'ts for creating a long-term archive|
Planning for the future
To ensure you'll have a reasonable chance to recover archived data 30 or more years from now, you must first identify the different requirements for data retention and then use those requirements to define policies (see "Dos and don'ts for creating a long-term archive," this page). Then decide what kind of data management application you'll need and start a test bed to validate rule sets and policies. You'll need to work out a way to pull old unstructured archives into the new management system, bringing all the old archives under the same system. Consider rewriting data formats and backup formats on a regular basis to avoid orphaned data that can no longer be read.
For the long term, data formats such as Adobe Systems Inc.'s PDF/A, Microsoft Corp.'s Microsoft Office Open XML file format and XML-based standards like OpenDoc should ensure that data continues to be readable. Developers are beginning to embrace these long shelf-life data formats, so there's a good chance the applications your company uses can be updated or enhanced to add these capabilities.
On the storage format side, it's critical to create meta data describing what files are being stored and how they were created. According to Forrester Research's Balaouras, the Storage Networking Industry Association (SNIA) will be instrumental in creating standards for both information lifecycle management (ILM) and the eXtensible Access Method (XAM) standard, which gives ILM applications a standard interface and meta data structure to communicate with object-based storage systems. Meta data stored with each object identifies the owner, the application that created the file, data format and so forth. The standard specifically addresses both long-term retention standards and data security.
Storage management products from companies like CA Inc. and IBM Corp./Tivoli can use the meta data associated with files to determine how long a piece of data is archived and what policies apply. This is increasingly important when administrators are faced with archiving millions of e-mails, as well as all of the other content created within an organization. There's no way a human can individually set policies for that much data.
There should also be policies in place for long-term storage of encryption keys. As more regulations designed to protect customer and proprietary data require some form of encryption, the practice will undoubtedly become more prevalent. While it's possible that data-recovery companies will be able to bypass current encryption standards 50 years from now, it might still be more expensive to re-create the data. Archiving the necessary keys as part of the overall archive process should prevent this problem.