Another bottleneck may occur in organizations where the majority of clients still backup over an IP LAN. The processing overhead associated with pushing a large, sustained amount of data over an IP network interface card (NIC) can tax client CPUs. The CPU load created by the IP processing overhead is frequently associated with iSCSI performance issues, but can impact backup performance. New backup architectures, such as SAN-based backups, send the data files over shared FC SAN-attached devices using a more efficient protocol optimized for the large sustained data movement associated with backups. SAN-based backups also reduce the overall load on the LAN because data files are copied to the backup devices using the FC SAN.
File system characteristics can be a source of slow backup performance. File systems with millions of small files (which is becoming more common) usually back up more slowly because of the overhead associated with recording the meta data for each file on the backup server and the time it takes for the file system to look for changed files.
Typically, the overhead of recording meta data is negligible because the ratio of data to new files is very high. However, in systems with large numbers of small files, this ratio reverses and the overhead impacts overall performance. File systems can cause bottlenecks during incremental backups, when the backup client needs to check the file system to identify which files have changed.
Two-tier backup
In a traditional client/server backup architecture, data sent from a client to a backup server moves through the backup server to the target devices (see "Two-tiered backup architecture,"). In traditional IP-based architectures with a large number of clients, a tremendous amount of data will pass through a single backup server. Backup servers' CPUs, memory, NICs or internal I/O buses are frequently maxed out in larger environments.
The introduction of two-tier backup architectures allows much of the load associated with moving data from the client to the backup target to be offloaded to dedicated data movers (storage nodes/media servers). The centralized backup server is still responsible for managing all of the meta data and shared library/robot control. The NDMP protocol (see "NDMP speeds backup traffic," right) lets NAS appliances act as data movers, minimizing backup-generated LAN traffic.
At the far end of the backup data path, tape drives are often the focus of backup bottlenecks. Newer tape drives are fast, with some exceeding 30MB/sec. It isn't uncommon for a tape drive to achieve higher throughput rates than disk drives. But achieving maximum tape drive throughput depends on sufficient amounts of data being sent to the tape drive to sustain data streaming. If insufficient data is sent to the tape drive, back-hitching occurs, which greatly reduces overall throughput.
If too much data is sent to the tape drive, then the drive once again becomes a bottleneck. The amount of data written to a tape device is usually controlled by adjusting the number of simultaneous write sessions (also called multiplexing) to each tape device (see "Disk-based backups are more forgiving,"). The downside to multiplexing is that restore performance is decreased because backup sets are interleaved on the tape.
A frequent backup mistake is letting backup clients temporarily mount tape drives through a shared target software option in an attempt to improve backup throughput. Typically, the throughput for the one client is improved because the tape drive is temporarily dedicated to a client. This eliminates tape drive contention and allows data to be moved to the target tape drives through the FC SAN, while eliminating the IP processing-intensive overhead. But improving backup speed for one client may decrease overall throughput to the tape drive because only a single backup client is writing to the drive. In this situation, the greater good of all systems may be sacrificed to benefit a few systems.
Eliminate backup windows
A different approach to solving the backup window problem is to create a snapshot or PIT copy of data used for backup purposes. Once the PIT copy is created, normal data processing can resume because an image of the quiesced system has been captured.
Snapshots create a virtual copy of data. It's called "virtual" because the second copy is created only if blocks are changed after the copy was initiated. Because most data in a volume doesn't change daily, these snapshot copies don't take up a large amount of disk space. The additional disk space required is equal to the amount of changed data. Snapshot creation is typically scripted; it takes less than a minute to create the virtual copy of a volume. Once the copy is made, the primary data volume is available again for changes without any impact to the backups. Snapshots address two critical backup window issues:
- They provide the ability to resume data processing without fear of having open files skipped because a quiesced system is required only while the snapshot is created.
- You can start the next night's backup even though the backup from the night before is still running because each night's backup image is captured on a separate snapshot image.
Snapshot creation commands are often integrated into applications, such as databases, to allow for temporary quiescing of the application, creation of the snapshot and resumption of the application.
PIT copies, sometimes referred to as clones, are similar to snapshots because they also quickly create a copy of data that can be used for static backup images. The advantages of PIT copies are that a full copy of the data resides on a completely separate volume, while snapshots copy only the changed portion of a volume. This helps address the issue of resource contention caused by backups on production servers. The PIT copy volumes can then be mounted to surrogate clients. Disk drives and all client system resources driving the backups are separate from production disk drives and system resources. Like snapshots, PIT copy creation can be integrated into apps, ensuring clean copies of data quickly.
The size of the volumes typically has a minimal impact on how long it takes to create snapshots or PIT copies; therefore, this method scales easily as storage capacity grows. The downside of PIT copies is the higher cost associated with the increased disk capacity needed to create full copies of data on different drives.
Although snapshots and PIT copies reduce backup duration, another big benefit is the effect on restores: If a user is looking to retrieve data from last night's backups, the data doesn't have to be retrieved from tape because all the information resides on disk.
Restore performance issues
In optimizing data transfer for quick backups, admins may unknowingly create restore performance issues. This is particularly true when trying to reduce backup window durations by decreasing the ratio of fulls to incrementals, or by increasing the number of streams multiplexed to a tape drive target.
When decreasing the full to incremental ratio, restores may require more tape volumes to be mounted and read. In one incremental-forever scenario, we saw a customer recall more than 1,000 tape volumes to restore a single system. Most incremental-forever systems have controls to limit the number of tapes that a single file system can be spread across, but this requires additional configuration and data movement cycles. Fortunately, this data movement isn't typically associated with a backup window because it doesn't impact client operations. In addition, the proliferation of disk-based targets makes reading from a large number of incremental backups a non-issue (due to the random access nature of disk).
Multiplexing a large number of backup streams to a single tape drive causes restores to degrade. Because multiplexing intersperses data from one client with another, the sequential nature of tape often requires reading all data on the tape to retrieve the bits associated with a single file system. Depending on the tape technology and priority or restore speed, it's not uncommon to have four or more streams multiplexed simultaneously to a single tape drive.
Where to start
If a large percentage of backups are running slowly, causing them to run beyond the desired backup window, look at components shared by all of the clients. This includes the backup network, backup servers and backup target device (tape or disk). If only a small percentage of clients are exceeding the desired backup window, look at client-side issues first. Start with just a few clients. What you learn from the first few will likely help troubleshoot the others.