An archiving deployment requires long-term retention management while avoiding vendor lock-in.
A data management paradox exists today that, if unresolved, portends serious consequences long into the future. We have more information available to us than ever before. If you fast forward 50 years or 100 years, how much of this information will be accessible and in usable form? Given current data management practices--in terms of both physical storage and logical data representation--it's questionable whether electronic information created and stored today will be usable even 10 years or 15 years from now.
This stands in stark contrast to all previous periods in human existence when data was recorded in visual form with symbols like cave drawings, cuneiforms and hieroglyphics, to modern alphabets. While all data representations involve some degree of decoding and interpretation, the extent required with electronic information is such that it can be rendered entirely meaningless without the one or more layers of meta data that are often completely dissociated with the body of the data. A string of ones and zeros can be interpreted to mean anything (or nothing), and requires an external interpreter to decode it whether it represents a jpeg image, a Word doc or a database table.
This doesn't even touch upon the other challenges of long-term data retention and preservation. Think back just 20 years and consider how you would provide legal assurance that a document originally stored on a 5.25-inch DOS floppy disk using WordStar on an Intel 286 PC is the "same" as one that has been repeatedly transferred, updated and now exists as a Microsoft Word 2007 file in a SAN viewed on a Windows Vista laptop. And that's a relatively trivial example!
Beyond the concerns of historians and archivists, why should we care? By the time this problem surfaces within our organizations, many of us probably expect to have moved on to bigger and better things (such as retirement). And because nearly all organizations face some data-retention challenge, it's tempting to assume a wait-and-see attitude. However, the steps we take now will greatly affect the magnitude of the problem facing us (or our successors) in the future.
Just as IT managers focused on backup without requisite attention to recovery for many years, long-term archiving has traditionally been performed without sufficient consideration to retrievability. In the past, technology and cost constraints often made effective, long-term data retention prohibitive, leading to the loss of such critical electronic data as the NASA moon landing tapes. As times changed, it became more practical to retain data, first on tape and later on disk. In many organizations, the approach was to save everything, primarily via backup, resulting in thousands of tapes with no practical means of retrieving information. In this state, retained data is of little or no value; however, its very existence represents a potentially huge liability, thus leading to the next phase of the evolution of data archiving: indexing.
Typically beginning with email, firms are investing in hardware and software that enables them to not only save data but index it. But before selecting a technology, it's critical to understand the actual drivers and proposed use for long-term retention, and to cultivate a strong awareness of the limitations inherent in the current generation of solutions. With the rush to address legal and regulatory concerns, some organizations have bypassed business-driver, policy and cost-risk analyses, and moved directly to technology selection. They may experience a serious case of buyer's remorse in a few years.
With archiving, more so than with other data management functions, lack of planning and poorly understood requirements have particularly far-reaching consequences. The combination of long retention periods and current technology design traits for hardware and software can make it extremely difficult (and often cost prohibitive) to migrate data, resulting in a de facto product lock-in.
Content-addressed storage (CAS) devices have been adopted by some as a preferred target device for archival data. Features like single-instance store and guaranteed immutability through WORM are among the benefits of CAS. But current CAS storage formats are proprietary and, with some CAS solutions, the only way to access data is via the vendor's API. As a result, unlike traditional storage devices where data can be migrated at a file or LUN level, migration from CAS may require the export and then re-import of all of those years of email or other data; in a worse-case scenario, it may even mean moving the data through the original email archiving app and then out to another device.
Email archiving apps store their data in proprietary formats. Some firms have found after months or years of operation that their chosen email archiving app doesn't satisfy their requirements or that a newer, better solution is available. Unfortunately, the cost and impact of migration and conversion often leads to the unavoidable alternative of supporting multiple email archiving apps many years hence.
Standards to the rescue?
Addressing current technology limitations is one factor in solving the archiving puzzle. To their credit, some vendors acknowledge that these constraints are inhibiting wide-scale adoption and are taking steps to address the issue. For example, the Storage Networking Industry Association's (SNIA) Data Management Forum is driving standardization efforts for Fixed Content Aware Storage, as well as hosting a task force focusing on the 100 Year Archive.
A critical part of the SNIA effort involves promoting a common storage specification known as the eXtensible Access Method (XAM) that could provide the missing data portability for CAS and archival apps. XAM is a self-describing format that encapsulates meta data with its associated data (à la XML), and enables data to be accessed by an app or utility other than the one that created it. Great concept, right? Unfortunately establishing a standard is anything but easy.
To a large extent, the key to long-term data management lies at its point of creation: the application. In most cases, only the app has sufficient knowledge regarding data dependencies and the business logic to effectively manage the data. As a result, for many categories of data, the problem can't be completely addressed without support from within the application. Application vendors traditionally haven't made long-term retention their highest priority.
What you can do now
For users, it's time to get your house in order. This means understanding requirements, completing a cost/risk analysis, establishing policies, and developing procedures to classify and manage data at a high level.
Here are a few items to consider:
- Understand current policies and practices regarding paper records. Inevitably, there will be differences between paper and electronic data retention, but this is often a useful place to start.
- Develop a retrieval/recovery policy before or in conjunction with retention; e.g., who needs to search, extract and delete, as well as policies for use and response times.
- Determine associated data-preservation policy requirements, such as immutability, authentication and security.
- Before investing in technology, develop a policy-driven reference architecture that addresses requirements for ingestion, access, retention, retrieval, security, etc. The more specific you are with vendors, the better your solution will be.
- Don't always assume that you need WORM. Understanding the regulatory requirements in this area can save money and future headaches.
- Look for data pruning opportunities. The "archive everything" approach will be unsustainable and costly. Effort invested here will ultimately play an enormous role in the future usefulness of archived data and the cost to maintain it.
- Push your vendors to support and adopt archiving standards. In product RFPs, have them describe how the data will be retrieved and presented in five years, 10 years and 25 years. This will be fun, if not educational.