Using a backup program to create archive files isn't a good idea, because trying to find specific information in backups is costly and time consuming.
A bottle of grape juice left on a shelf long enough will ferment, but no one would call it wine. Similarly, it's possible to restore data from old backups, but no one should call them archives. Simply put, backups make lousy archives.
Archives are for the logical retrieval of information; that is, to retrieve information grouped in a logical way. For example, with archives you can store reference data such as:
- The CAD drawings, parts lists and other manufacturing information for a widget your company used to make
- All of the information pertaining to a former customer
- All information related to a closed project, account, law case, etc.
- Tax returns, financial records or other records for a particular year
The second way archives manifest themselves is in the logical storage of active data. Suppose, for example, it was discovered that a critical safety part was removed from a particular widget's design. It would be important to see every version of the specification, along with information about who changed it. And what about the common practice of electronic discovery of e-mail systems? Think about the discovery requests that can occur when someone in management is accused of harassment or discrimination, a trader is accused of promising financial returns or a company is charged with colluding with its competitors. Such accusations may result in e-discovery requests that look like the following:
- All e-mails from employee A to employees B, C and D for the last year
- All e-mails and instant messages from all traders to all customers for the last three years that contain the words "promise," "guarantee," "vow," "assure" or "warranty"
- All e-mails that left a company going to domains X, Y and Z, or to certain specific e-mail addresses
|Turning backups into archives|
|Another common question is what to do when switching from backups as archives. What should you do with all of the old tapes in the old backup format(s)? The answer is the same as it is for changing backup formats. Your only real alternative is to restore the oldest versions of the data being archived, archive it, delete it and then restore the next version. It's not pretty, but it's reality. The good news is that every backup you turn into an archive means storage savings.|
Old backups aren't enough
The most common way data is archived is by keeping backups for a long time. Weekly or monthly full backups are performed, and then the backup is kept from one year to 50 years, depending on business requirements. There couldn't be a worse way to archive.
There are many difficulties with using backups as archives. The most common use of backups as archives is for the retrieval of reference data. The assumption is that if someone asks for widget ABC's parts (or some other piece of reference data), the appropriate files can just be restored from the system where they used to reside. The first problem with that scenario is remembering where the files were several years ago.
Even if you can remember where the files belong, the number of operating systems or application versions that have come and gone in the intervening time can stymie the effort. To restore files that were backed up from "Apollo" five years ago, the first requirement is a system named Apollo. Someone also has to handle any authentication issues between the backup server and the new Apollo because it isn't the same Apollo it backed up from five years ago. Depending on the backup software and OS in question, the new Apollo may also need to be running the same version of the OS and applications the old Apollo was running five years ago. Otherwise, there may be incompatibilities in the file system or database being restored.
Satisfy electronic discovery requests
Backups are also used to satisfy electronic discovery requests, which can be even more challenging. Let's use the most common electronic discovery request as an example: a request for e-mails that match a particular pattern and were sent via an Exchange server. (The following also applies to other e-mail systems, such as Lotus Notes or SMTP.) There are two big problems with using backups to satisfy such a request. The first is that it's impossible to retrieve all e-mails sent or received by a particular person. It's only possible to restore the e-mails that were in the Exchange server when backups were made. If the discovery request is looking for an e-mail that somebody sent, deleted and then cleared from their Deleted Items folder, it wouldn't be on that night's backup, and thus would never show up when attempting to retrieve it weeks, months or years later. It would therefore be technically impossible to meet the discovery request using backups. This means that even after doing your best to successfully satisfy the discovery request, a plaintiff may claim you haven't proven your case.
The second problem with using backups to satisfy an Exchange electronic discovery request is that it's very difficult to retrieve months or years of e-mails using backups. Suppose, for example, a company performs a full backup of its Exchange server once a week, and for compliance reasons it stores these backups for seven years. If the company received an electronic discovery request for e-mails from the last seven years, it would need to perform many restores of its entire Exchange server to satisfy the request. The first step would be to restore the Exchange server to an alternate server using last week's backup. Next, you would have to run a query against Exchange to look for the e-mails in question, saving them to a .pst file. You would then have to restore the Exchange server using the backup from two weeks ago, rerun the query and create another .pst file. It would be necessary to restore the entire Exchange server 364 times (seven years multiplied by 52 weeks) before you're done. And almost every step in this process will have to be done manually.
The above scenario isn't impossible to accomplish, but the recovery effort will entail an incredible amount of time and money. A plaintiff in a civil suit or the government doesn't care how much it costs the defendant; your company has a court order to produce the e-mails regardless of cost.
|Which is best for archiving: Disk or tape?|
An archive system encounters the same issues as a backup system if tape is used as its primary storage medium. One solution might be to use content-addressed storage (CAS) as the primary storage device for archives. If the product supports a standard file-system interface, such as NFS or CIFS, as well as single-instance storage and delta-block technologies, it could solve a number of problems.
First, a disk product using single-instance storage and delta-block incremental technologies will be less expensive to operate than a tape-based system because you can't apply delta-block technologies to tape-based systems. Second, if the CAS device supports a file-system interface, then migrating between storage systems should be relatively simple. With a tape-based system, you have to copy all data from the old tape format to the new tape format. With a file-system-based system, you simply copy data from the older device to the newer device.
Finally, you could potentially solve the format issue. If archive products can support the discovery of existing CAS systems, you could theoretically switch archive products with no ill effects. The raw data would still be accessible via the file-system interface, and the meta data could be imported--or the new archive system could grab the meta data from the CAS device. Your mileage will definitely vary, but solutions are available.
Other backup bugaboos
Backups are also an extremely inefficient way to store archives. While an archive system will make sure it has only one or two copies of a particular version of a file, a backup system usually has no such logic. If a company is using weekly full backups as archives (or creating "archives" with its backup product but not deleting the original files), and storing its archives for seven years, it will have 364 copies of many of its files stored on tape--even if those files never changed. This leads to an incredible amount of media waste.
Another strike against using backups as archives is the number of times a company changes backup formats and tape formats over the years. Almost every company using backups as its archives has a number of older tape and backup formats it must continue to support for archive purposes. While older tape formats can be converted with a lot of copying, converting older backup formats is another story. Most people choose to hold onto both old tape formats and old backup formats, and hope they never have to read them.
The most important feature of an archiving system is that it contain enough meta data to allow information to be retrieved in logical ways. For example, meta data can include the author or business unit that created an item. (An item can be any piece of archived information, such as a file, a record from a database or an e-mail.) Meta data might also contain the project the item is attached to or some other logical grouping. An e-mail archive system would include who sent and received an e-mail, the subject of the e-mail and other appropriate meta data. Finally, an archive system may import the full text of the item into its database, allowing for full-text searches against the archive. This can be useful, especially if multiple formats can be supported. It's particularly expedient to be able to do a full text search against all e-mails, Word documents, PDF files, etc.
Another important feature of archive systems is their ability to store a predetermined number of copies of an archived item. A company can then decide how many copies to keep. For example, if a firm is storing its archives on a RAID-protected system, it may choose to have one copy on disk and another on a removable medium such as optical or tape.
Two types of archivers
Archive systems can be roughly divided into two categories depending on the way they store data. The first is the traditional, low-retrieval archive system attached to your backup software package. Such an archive system lets you make an archive of a selected group of files and attach limited meta data to it, such as "widget XYZ," and then have the archive system delete the backup files in question. The good thing is that it allows the attachment of meta data and can reduce multiple copies in the archive by deleting the duplicate backup files as they're archived. The bad news is that if you want to search archives using different types of meta data--such as owner, time frame, etc.--you need to create multiple archives. The main use for this type of archive is to save space by deleting files attached to projects or entities that are no longer active.
The second--and newer--category of archive systems realizes that any archived item might need to be retrieved for different reasons and would thus require different meta data. To support multiple types of retrievals, it's important to store the actual archived item only once, but with all of its meta data in a searchable database. Such a system realizes that a given archived item might be put into the archive not to save space, but to allow it to be searched for logically. Unlike its predecessors that stored the only copies of reference data, newer archive programs store an extra copy of the data, leaving the original in place.
As discussed previously, one of the problems with using backups as archives is that they won't have all occurrences of a file or message; they'll have only those items that were available when the backup was made. Some of the newer archive systems solve this problem by archiving data automatically. For example, every e-mail that comes in or is sent out is captured by the archiving system. Every time a file is saved, a version of the file is sent to the archive system.
Another advantage of newer archive systems is their use of single-instance store and delta incremental concepts. They store only one copy of a file or e-mail, no matter where it came from or who it went to. (Of course, the archiving system records who it came from or who it was sent to.) If that file or e-mail is then changed and sent/stored again, the archiving application will store only the changed bytes in the new version. Single-instance store saves a lot of disk space.
Regarding the format issues of backups as archives, many archive systems still grapple with those issues (see "Turning backups into archives"). Many people still store their archives on tape and, as time passes, may change their archive software. Therefore, this problem could persist even for archives (see "Which is best for archiving: Disk or tape?").
Newer archiving systems also serve as a hierarchical storage management-like system, automatically deleting large, older files and e-mails, and invisibly replacing them with stubs that automatically retrieve the appropriate content when accessed. This is one of the main business justifications used to sell e-mail archive software. In addition to satisfying e-discovery requests, you can save a lot of space by archiving redundant and unneeded e-mails and attachments.
Surveys show that more than 90% of typical e-mail storage is consumed by attachments. If you can store only one copy of an attachment across multiple e-mail servers (and Exchange Storage Groups) and replace it with a stub, then you can save a lot of storage. If you add delta-block incrementals to that, you can save even more storage.
If your company has more than one employee, it wouldn't be hard to build a business case for archiving. And if you're using backups as archives, you could be in for a pretty rough time when you get an electronic discovery request. Perhaps you should look at an e-mail archiving product or an enterprise content management product today.