One of the most challenging tasks facing storage managers is the development of a strategy for archiving data. Deciding what should be archived, when it should be archived and for how long goes to the core of the storage management process. You also have to understand the business value of data - perhaps more than you currently do. But when done properly, archiving can be a lifesaver to businesses requiring access to historic information...
for regulatory or audit purposes. Conversely, when it isn't done right, it can cost a company dearly in lost revenue, fines and other penalties.
To avoid these problems, you need a comprehensive strategy built around solid policies about data retention - something you'll need to develop with business managers. But there are also a host of factors directly under the control of storage managers: the tools you use, the formats you choose and the procedures to execute your strategy.
Archiving vs. backup
When many administrators hear the term archive, they think backup. That's where the trouble often begins.
"Sure, we do archiving," I was recently told by an IT manager. "Every quarter, we send full backups off for seven years," he stated confidently.
I asked him a few follow-up questions: How would he handle specific requests for three- or four-year-old data? What would the process be for retrieving it? This quickly left him feeling somewhat less confident.
One reason that the term "archive" is often misused is that many products that claim to do archiving provide different capabilities. At one end of the spectrum, there are a number of backup products that treat archiving as simply a backup followed by a deletion of the data from primary storage - a rather scary thought. This definition of archiving is really intended to assist in removal of old data cluttering up servers. A more effective approach to addressing this particular problem is through the use of storage resource management (SRM) or hierarchical storage management (HSM) tools.
So what is archiving, anyway?
A more useful definition of archiving is "the long-term storage of a point-in-time copy of information for a specific business purpose." This contrasts with backup in that backups are intended primarily to protect against short-term data loss, such as accidental deletion, device failure and data corruption.
Some strong candidates for archival data include periodic corporate financial information retained for auditing purposes, medical patient information retained for compliance with Health Insurance Portability and Accountability Act of 1996 (HIPAA) regulations, or data pertaining to clinical trials of a new drug wending its way through the FDA Drug Approval process.
The long-term nature of archived data presents a number of problems. Some may seem obvious, while others are less so. Here are some fundamental concerns:
- Can the media format be read? How many of you still have QIC tape drives in-house? How about 9-track tape? Today, we have various tape formats and numerous generational variants within a given format. Tape drives typically can't read media older than a generation or two. For long-term retention, some thought must be given to maintaining devices for long-term recovery, or migrating data to newer media. This is further complicated in some regulated industries, where migration can raise validation and authentication issues.
- Is the media still valid? The lifespan of magnetic tape media is dependent on a number of factors, but the bottom line is that if data is being maintained for a long time, steps must be taken to ensure long-term integrity. This includes maintaining proper environmental control, refreshing volumes as needed and similar tasks.
- Can the data be utilized after it's restored? This goes to the heart of the matter. The data must be in a somewhat portable format, and not dependent on a now obsolete version of an application or operating platform. Old data might be dependent on a version of an application, an operating system and even the architecture of the processor in use when the data was stored.
While there aren't a large number of tools to assist specifically with data archiving, there are some worthy of consideration. Start with your current backup software. Some backup products - such as IBM's Tivoli Storage Manger - have specific features designed to handle archiving, including the ability to do the following:
- Attach a descriptive label to a group of archived files, browse by using this label and if desired; retrieve the entire group of files en masse;
- Designate dedicated storage pools for archiving;
- Easily define and assign retention policies for an archived file different from the backed up copy;
- Track archived volume location and expiration;
- Don't require deletion of archived files from primary storage.
One of the most problematic types of data to archive is that which is contained in a database. While it is relatively easy to archive files, how do you archive the records contained within a database file? The most common current practice is to retain a copy of the entire database. This has at least two disadvantages:
- The entire database needs to be restored to retrieve the desired data. Imagine all of the problems associated with restoring your five-year-old Oracle database. Suffice it to say, I don't have enough space to list them here.
Database growth is a problem that has a ripple effect throughout storage. As databases expand, they become slow and unwieldy, consume more disk space and are increasingly difficult to back up and restore. If there was a way to prune and store records from databases, it could have a significant impact in a storage infrastructure.
Another more effective approach to database archiving is by exporting data using SQL. This provides greater portability, is easier to retrieve and is readily available. In addition, there are tools available that improve on this process by making it more automated and manageable.
Another promising development in data archiving is the emergence of content management tools. Designed to work at the application level, these products are aware of the relationships and context of data within a specific application and can be used to store data in a readily retrievable form (see a sampling of these in "Some applications have tailored archive tools," this page).
Know your data
The key requirement for effective archiving is to develop an understanding of your data - or more accurately, the value of your data. A system of data classification leads to intelligent policy management with regard to primary storage (e.g., disk) and secondary storage (backup and archive).
Data classification isn't an easy undertaking for an organization. It requires business units and other application owners to make decisions about what's important and what isn't. Data classification complicates life for IT organizations - and especially storage administrators - by forcing them to consider tiers of offerings instead of the simple one-size-fits-all approach. Most challenging, perhaps, is that it forces diverse groups in an organization to communicate with one another.
Is it worth all this trouble? The alternative to data classification is what I refer to as the cross-your-fingers approach to storage management. With regard to archiving, this translates to a policy of "save everything, and hope that you never need to retrieve it." This may work with small quantities of data, but it's extremely costly in most organizations and can prove extremely risky as well. The result could essentially be the same as no long-term protection. The question you must answer is: How good are you at finding needles in haystacks?