Database archiving done right

Once you have decide what to archive in your database, the hard part begins. Discover how to archive your database in the least costly manner possible.

When is a gigabyte not a gigabyte?
Like a soldier in the front lines, a gigabyte of production data is supported by many others behind it. To keep the production data available 24x7, behind the scenes are local mirrored copies, remote mirrored copies for disaster recovery and copies for each parallel development and test effort, plus multiple copies of all of these.

Consider the following best practices advocated by various vendors for mission-critical production data: Each production volume is protected by mirrored storage. To provide fast recovery from errors, two additional rolling snapshots of the data are kept online. To enable disaster recovery and business continuance, two copies of the data are kept on a remote storage system. One or both of those copies will be mirrored to protect the storage on the remote side in case production operations must be transferred to the recovery site. To facilitate non-disruptive backups of the production data, another independent copy of the data may exist (the life of the rolling snapshot is typically not long enough to be used to backup a large database). Add another copy to do backups at the recovery site--although it may be possible to double-dip and use the space from one of the remote copies used for replicating the data prior to the disaster.

Now, consider the tape backup of production data. There will be at least one full copy of the data, followed by multiple versions of incremental changes to enable the business to recover data back to a given point in time. Assuming 25% of the database contents are modified in one form or another in the course of a week, the equivalent of another full copy of the data is required to protect a month of data. Best practice is to keep one copy of the tapes on site and another in off-site storage, adding another two full copies of data. The total so far is 13 copies of production data, under full view and control of the IT department.

As springtime approaches, perennial household chores such as spring cleaning take center stage. Unlike an annual once-over for a messy house, deciding what to clean in a relational database is an ongoing problem.

File-oriented data can be managed using familiar operating system utilities, but from a storage administrator's perspective, a database is a monolithic container. The storage administrator only knows the size of the container and where it's located. Managing its contents is the domain of the database administrator. But make no mistake--all databases need to be archived for three main reasons:

  1. Regulatory compliance. For many organizations, the primary reason for considering an archiving strategy is to address regulatory concerns and risks. These days, almost everyone is aware of the increased focus on the need to retain and retrieve certain kinds of information and the penalties associated with the inability to do so. Many of these regulations dictate long-term retention requirements that place a significant burden on organizations attempting to comply.

  2. Managing data growth. Another factor is the continuing--and often uncontrolled--growth of data. Corporate databases are becoming the primary repositories for critical enterprise data, and are growing at an estimated 60% to 125% annually. Many primary-tier applications are built on relational databases, and the use of advanced data protection techniques such as rolling snapshot volumes and remote replication are most often applied to database volumes. It's surprising how many copies of the data may exist when you add backups to the mix (see "When is a gigabyte not a gigabyte?").

  3. Application Performance. One of the most critical drivers for database archiving is application performance. Quite simply, as databases grow, they slow down. Typically, 50% or more of the data residing in databases is historical or inactive. Yet when database searches or lookups are performed, this inactive data is processed and combed through along with current data, resulting in a significantly slower application response. And of course, the performance of ancillary activities such as backup and recovery is also affected (see "Why upgrade when you can archive?").
Issues and challenges
Archiving data that resides within a database presents some significant challenges. The first challenge is to determine what needs to be archived and when: in other words, data classification and policy development. This can be a complex issue because in most organizations, arriving at an answer is a multidepartmental and multifunctional effort. Cross-functional teams of IT infrastructure and application groups and lines of business, as well as functional areas such as finance and legal, are required to classify data and establish policies for movement and retention. Then, there's the whole range of technical and process issues. These include:

Heterogeneity. Most organizations deploy a wide range of applications that are often built on several different databases. This could include legacy products or older versions of current products. Archiving this range of information could require a variety of tools and processes, thereby increasing complexity.

Application complexity. Even after completing data classification and establishing policies, the rules associated with the business logic of the application and internal relationships and dependencies must be considered. This requires a comprehensive understanding of the application.

Retrieval considerations. Storing years of database backup tapes isn't difficult; retrieving a particular set of information is the hard part. The difficulty of retrieving this information increases in direct correlation to the age of the data. Issues such as media readability and compatibility, system dependencies and application versions must be considered. Above all, there's the problem of locating and identifying the specific information among the many generations of data.

Data destruction. The news is rife with high-profile investigations demonstrating that in many situations, it's just as important to destroy data as it is to retain it. Data retained for longer periods than required for regulatory purposes may become a liability. If an investigation or legal proceeding takes place and some relevant data is discovered, it may need to be produced, even if there was no legal obligation to retain it. In addition to potential liabilities, the cost of retrieving the data can be substantial.

When is a gigabyte not
a gigabyte? (continued)
The reality is that other copies usually exist, and are controlled by different organizations. Each parallel line of development will have its own snapshot of a stable copy of the production data, plus another copy to test against. Both are RAID-protected like the production data. An independent test organization will have similar storage requirements. The stable copy of data is probably backed up to tape for both organizations. In this scenario, a single development and test effort would add up to ten more copies of the data. The line of business staff may request additional backups of the production data before any significant change to an application, perhaps twice a month.

Viewed across the entire enterprise, one can see how each gigabyte of production data could require an additional 25 copies of the data to support it. Are you running a tight shop, and using tape-based disaster recovery to save money on storage? It would still be hard to operate with less then ten copies of the data.

Where is the technology today?
Specialized database archiving technology is relatively young, with dedicated third-party products entering the market in late 1999 and 2000. The enterprise faced with problematic database growth must choose among three options: ad hoc methods, administrative tools native to their database applications or specialized third-party archiving products (see "Comparison of database archiving approaches").

Comparison of database archiving approaches

Ad hoc archiving typically begins as a reaction to intolerable circumstances. The database has grown so large that batch processing windows are being missed, and ever-expanding storage and processing requirements catch the attention of senior management. The quick response is the equivalent of an emergency liposuction: A highly skilled team or individual combs through the database, pulling out as much old or unused data as they safely can. Time, cost and expertise constraints limit the depth and breadth of this exercise, so only the easy targets are found.

The operation continues until the database is lean enough to meet minimal operational requirements. If done with some consideration for the future, the purged data will be kept in a separate instance of the database, so referential integrity will (or may) survive, as will some ability for future access. However, at the next crisis a different team may respond using different techniques, and over time it's a near certainty that the purged data will be either lost or become so disjointed as to be unusable. Building and supporting a customized tool to handle the situation effectively is often beyond the budgets and skill sets available.

Native database administrative tools start to look attractive after the enterprise has dealt with a couple of crisis situations. Tools native to database vendors have the advantage of working well with the application and are supported by expert technologists. Oracle, SAP and other application vendors have their own tools and methodologies. The tools make it easier to extract and manage data; continued access by the application is straightforward. However, it isn't uncommon for a major upgrade of a database product to require production data to be brought forward into a new format incompatible with the previous version. This is no small task.

Third-party database archiving products

Maintaining access to archived data requires that it be migrated with production data, which adds to the support burden. However, many native database management tools weren't designed with archiving in mind, and lack the notion of higher-level policies to drive and automate the archiving process. This can be overcome by building custom scripts and procedures to manage the process, but this brings us back to the long-term support issues involved with the ad hoc approach. In a single-vendor shop, this may be manageable. In a large shop with several different applications or databases, the problem is managing the separate solutions across all of the different platforms. An enterprise with active DB2, Oracle, SAP, Siebel and PeopleSoft applications is faced with developing and managing separate archiving solutions for each flavor, and the costs quickly become untenable.

A small number of third-party products are now available that are specially designed for database archiving, such as products offered by Princeton Softech, in Princeton, NJ, and OuterBay Technologies, Cupertino, CA (see "Third-party database archiving products"). These products feature policy engines to help capture and manage the business rules that drive the archiving process. They have tools that facilitate extracting data and managing archives. Some also have monitoring and reporting facilities to track data growth and make projections on database and archive size. They provide a way for the application to easily access archived data as needed.

Archived data can be kept online in application-native format or in an application-independent format that still preserves referential integrity. The latter is especially useful for moving less-used data to nearline or offline storage for extended periods of time, without the necessity of bringing archived data forward with each new release of the application. Once an archiving policy for the database has been established, the archiving product handles movement of data between application and archive, and performs translation between application versions as needed. Additionally, these specialized third-party tools cover an increasing number of database and enterprise resource planning (ERP) applications, so that one archiving tool can serve the entire enterprise.

Why upgrade when you can archive?
Here is a typical database performance scenario: A key application gradually becomes less and less responsive over time. It's gotten to the point that the system, storage and database administrators are meeting to discuss the problem. Like the blind men describing the different parts of an elephant, each begins to analyze the problem from his own perspective.

The system administrator analyzes CPU, memory, paging and IO utilization, and (if trending data is available) confirms that utilization has been steadily increasing. The storage administrator reviews capacity and performance data, such as storage area network (SAN) switch port utilization, storage system cache parameters and physical disk utilization. It's discovered that the amount of data has grown significantly, and while there's plenty of SAN bandwidth, there's definitely a high rate of disk activity for the physical storage assigned to the database. The DBA begins to examine the problem from the application perspective and finds that queries and batch jobs are taking much longer to complete.

At this point, the next step is usually to go through several iterations of tuning. This often includes bringing in expertise from the vendors involved to make sure the system is optimized. Also, it isn't uncommon to begin to hear talk of the need for hardware upgrades. After several attempts, the conclusion is drawn that tuning options have been exhausted, and the only option is to upgrade.

However, there is another option: purge and archive the database. There are three classes of database archiving tools: ad hoc, native and third-party products.

An ad hoc process should only be considered for those rare instances when an archive will be only required infrequently and there isn't a need for long-term retention or frequent retrieval. Another consideration is the number of applications: More than one or two ad hoc database archiving processes will likely prove to be too difficult to manage.

Native tools are typically used for less complex, custom applications where data relationships are well understood within the organization. As long as data retrieval and retention requirements remain low, this approach is acceptable.

Third-party tools can make the archiving process more repeatable. They are particularly suited to large commercial applications, and where either regular access and/or long-term retention of archived data are required.

Getting started
Developing an archiving strategy is a significant undertaking. As mentioned earlier, a cross-functional team will be required to ensure that the business and technical needs are adequately addressed. The phases for developing and deploying an archiving strategy include:

  • Assess your archiving needs
  • Understand application data characteristics
  • Classify data
  • Establish data handling and retention policies for each class of data
  • Determine requirements for archival data format and retrieval
  • Evaluate tools
  • Conduct a pilot program
  • Wide-range rollout
The greatest challenge to database archiving isn't technology. The greatest obstacle for most organizations is coming to an agreement on what to archive and establishing appropriate policies for archived data. To address this, the first step is to assess your needs. Is the primary factor regulatory compliance and long-term retention, or is it capacity and performance? For some industries, compliance issues may require adopting records-management applications in addition to data archiving techniques.

Application data characteristics and dependencies can greatly impact the feasibility and cost to implement archiving. A major application like PeopleSoft can have thousands of tables, and understanding the business rules and logic to determine how to archive this data isn't trivial. Application complexity is a major driver for the adoption of third-party archiving tools. The classification of data and the development of policies for retention, migration to the archive and capabilities for retrieval are essential.

The phases of product evaluation and piloting should not just focus on the technology. They should also include the development and testing of standard operating procedures and the identification of roles and responsibilities needed to ensure that archiving policy requirements can be met. A wide-range rollout of an archiving solution demands regular monitoring and measurement to ensure policy compliance and evaluate whether performance and data capacity levels are meeting expectations.

The benefits of a successfully deployed database archiving strategy can be far-reaching. Performance improvements, better storage management and improved data retention are significant paybacks. Third-party database archiving products are starting to play a more prominent role in automating the archiving process. Take the necessary time to properly evaluate, design and test these new database archiving applications to achieve success.

Dig Deeper on Long-term archiving