Despite the growing interest and adoption of disk-based backup, most companies still rely on tape for their daily backup and archival needs. Yet restoring critical files from tape is never easy.
Exacerbating the tape restore problem is that few companies proactively monitor, report and remediate issues within their tape-based backup environments. This requires a great deal of effort and manpower, and an understanding of the tape infrastructure. In many cases, time is the limiting factor, leaving risky restores as an unavoidable consequence. In order to minimize the chances of an unsuccessful or time-consuming restore, it's essential that you prepare by optimizing your backup infrastructure for recovering data. This involves developing best practices for the backup infrastructure and refining overall operational approaches.
Best practices for better restores
Critical factors that are related to restore performance include backup application configuration, network configuration, media management and the client environment during a restore. The following guidelines will help increase the likelihood of successful file restores.
It's important to reduce disk drive contention for restored data. During restores, you should disable applications that may be accessing the same disks to which the data is being restored. Also, you should disable packet-reading software as well. Then there's virus-protection software, which when set to its highest protection level, scans every incoming and newly created file. During a restore, the recovered files appear as new files which would be scanned, thereby significantly slowing down the restore.
Some clients have too much data to back up over the network within an allocated backup window. For those hosts, backing up to dedicated tape drives can reduce the amount of time required to back up and recover data. Also, when possible, tune the network buffer size of the client's network card to match the tape drive buffer. This ensures that the recovered packets do not overrun or underfill the buffers. It also will help to modify the data transfer buffer sizes to match the tape drives. If data is sent in packets that are too small, the drives will end up spinning cycles waiting for data, and there will be empty space between data blocks on the tape. The further data is spread out on the tape, the longer it will take to restore.
You should also make sure you match the throughput of the host bus adapter to the drive throughput. If you attach 10 LTO-2 drives (30MB/sec each, for a total of 300MB/sec) to one 1Gb/sec host bus adapter (a theoretical maximum of 128MB/sec), data won't stream to the drives during backup. The sporadic nature of the data transfer will spread the data blocks across the tape, requiring even more time to restore the data.
It's also important for you to regularly expire your media. While the type of media you use does not typically affect restore times, the condition of the media does. As media experiences more and more read/write passes, the integrity of the media begins to break down, which can cause media errors. It's possible that data will be written to tape successfully, but then won't be readable because of the media's degradation. You should also be sure to clean your drives, too. If the backup fails because of dirty tape drives, all the preparation in the world won't help you.
This next tip may sound obvious, but you should make certain that the drive is available. The throughput of new tape drives often exceeds the total throughput of the data sent to the drives (slower networks or too few host bus adapters are typical causes). The result may be that the 10 new LTO-2 drives you just purchased actually run slower than the DLT7000 drives they replaced. There are several ways you can remedy this problem, but most of them involve making additional expenditures. These include things like upgrading the network infrastructure or replacing the backup server with a higher end system.
There is an alternative way to create a balanced tape, network and backup server infrastructure, which is to reduce the total number of drives being used during a backup. Not only will this improve backup performance, it also will leave some drives free for restores in the case they are needed. If a restore request is received during the backup window, and all the drives are actively backing up data, either the restore isn't performed until the backups finish, or active backup jobs are killed to handle the restore request.
If the data is critical to your business operations (and therefore, has a short recovery time objective), you should consider implementing additional solutions, such as snapshot or raw-disk backups, to improve backup and recovery performance. Either of these processes may incur additional expenditures, but if the data really is truly mission-critical, it becomes easier to justify the cost of going beyond traditional tape backup.
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Data and backup device mapping matrix |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
 |
Operational issues
Backup and restore times can be greatly affected by the number and size of the files you are backing up. Millions of small files pose a serious challenge for traditional tape backup products. When written to tape using typical backup products, a large number of small files can significantly impair the performance of even the fastest drives. It's not uncommon to see backup and restore speeds of 50KB/sec to 100KB/sec on tape drives rated to run at 15MB/sec to 30MB/sec. This dramatically increases backup and restore times and at the same time decreases the lifespan of the tape drives and media.
If you are aware of the fact that these kinds of small data files exist in your environment it will help in your preparations for successfully recovering your data. If you want to make the best use of your primary disk and tape backup destinations, a good first step is to classify data based on its characteristics (file size and file volume) and volatility because it affects incremental or cumulative incremental data movement. As a rule, disk subsystems provide optimal performance for large numbers of small files, while tape works best for small numbers of large files. The reason for this boils down to random vs. sequential access to data on the respective device types.
Classifying backup clients into groups based on data characteristics creates a logical basis for segregating backup workloads among different types of target storage devices (see "Data and backup device mapping matrix").
Creating a general matrix for mapping client types to device types is one practical way for backup administrators to optimize utilization of tape devices for backup and restore operations, in cases when disk is already part of the picture. Because every backup environment is unique in terms of its data and its hardware and software infrastructure, coming up with an ideal backup data classification system requires extensive planning, measurement and adjustment along the way.
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Common causes for tape failures |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
 |
Problem management
Tape devices are among the most active devices in the data center and, as such, are very prone to different types mechanical problems. The majority of mechanical failures occur during startup or shutdown cycles, as opposed to steady-state operational failures. In a large tape library environment, thousands of start/stop actions occur daily among the library robotics, drives and tape media elements. Inevitably, things will go wrong (see "Common causes for tape failures"). However, the disastrous consequences of a failed restore can be avoided if you actively practice proper systems management. Properly managing tape infrastructure issues and controlling associated risks is key to this.
Too often, organizations don't realize that they will have a problem restoring data until it's already too late. "Too late" usually means that the data has been lost and now must be recovered from tape. Regular, random testing of your restore procedures can help you avoid this eleventh-hour problem. Restore testing will help you identify potential bottlenecks, problem clients/data sets or breakdowns in your process. Identifying these obstacles prior to actually needing the data gives you sufficient time for tuning your environment and to prepare for the inevitable restore request. By simply integrating restore testing into the application delivery process, you gain preproduction exposure for potential recovery pitfalls. As data volumes grow, ongoing recovery testing should be given equal weight as performing software upgrade tests in any mission-critical application environment.
Prepare for disk
Over the next several years, it's quite possible that disk-based backup could potentially displace tape as the primary backup media. Until then, it is extremely critical to make sure that your current tape infrastructure continues to support all of your recovery needs. Restores are not just about technology: Overall operational procedures are also essential to successful backup and restore operations.
Rapidly introducing disk to replace a poorly managed tape infrastructure is a bad idea because it may very well exacerbate existing issues in your environment, at a high cost and at a great risk to your organization. Inadequate management or weak operational practices are often the root cause of the instability of tape infrastructures. As inadequate practices are rolled into a disk technology base, the underlying management problems will manifest themselves not only as restore failures, but they may also affect your entire backup infrastructure, seriously threatening the integrity of your data.