The following tips will help lessen your big file backup problems.
It's a long-standing problem: As data piles up on a server, completing a successful backup becomes harder. Backup apps become bogged down with millions of files to examine, and network and CPU limits can stall throughput when transferring a gigantic file. Even if a backup job is successful, the data in a large file may have changed in the hours it took to create the backup image. Vendors and users are now applying new ideas and technologies to ensure that no data set is too big to back up.
The rapid creation and accumulation of stored data has pushed traditional backup approaches to their breaking point. Large amounts of storage capacity and advances in processing power have led users to believe that a virtually unlimited amount of data can be stored and protected, but most backup managers will admit that it's just not so. While tape drives have become larger and faster, and new technologies like LAN-free and disk-based backup have reduced the load, the old approach of "scan everything every night and back up what has changed" is failing.
Storage devices are limited by their interfaces. A fast disk drive or gigabit Ethernet network can transfer only a few dozen megabytes of data per second, and most are far slower. At that speed, copying the entire contents of a 300GB disk drive takes a few hours at best, even if no other factors are involved.
Backup systems mitigate this problem in a number of ways. Most examine the contents of the drive and copy only what has changed since the last backup, and this incremental backup approach can greatly reduce backup times. Multiple backup processes/jobs can also be run at the same time, taking advantage of servers and disks that have to be backed up every night with their own interfaces. If that isn't enough, extra network connections, backup servers and tape drives can be added. These approaches have traditionally kept the world of backup afloat, but things are changing.
Tom Woods, backup supervisor at Ford Motor Co., was faced with a monumental backup task that challenged traditional approaches. "We had one system with 18 million files and we had to back it up every four hours," says Woods, "but just scanning that many files with TSM [IBM Corp. Tivoli Storage Manager] took five hours." Another system had massive database files that took hours to stream to tape, and applications had to be quiesced to keep changing data from ruining the usefulness of the backup copy. Finally, Woods had to consider whether his backup copies could be restored in a timely fashion. "The NDMP [Network Data Management Protocol] method we tried with our NAS servers was reliable, but restoring data at 200GB per hour meant it would take four or five days to recover," he explains.
The problem with big backups
No matter how you slice it, backing up big file systems is a problem:
- Backup applications need a few moments to examine each file and determine if it should be backed up or not, and another moment to store a record of each backup in the database. Multiply these moments by a few million, and they add up quickly.
- Massive files generally can't be backed up in parallel, and traditional backup approaches copy them in their entirety even if just a few bytes have changed.
- Even if you can wait for the backup to complete, the backup copy might not be consistent with the latest copy of the file.
- Data is backed up so that it can be restored, but many methods for speeding backups make recovery time unacceptably long.
Consistency and timeliness
Some backup systems are better at handling issues than others, but all will have difficulty when faced with a single file system with millions of files or hundreds of gigabytes of data to back up. Although the point-in-time consistency across different files in large file systems isn't always required, it can be critical; during an eight-hour backup, while the application is running, there may be inconsistency problems with files the application uses and some in the backup. The prime solution to the problem of consistency is to cheat the constraints of time by creating a snapshot copy of the data to be backed up. Leveraging the technology included in many storage arrays and OSes, a snapshot-based backup can freeze the data set at a point in time and copy it to tape at its leisure. This technology ensures that the entire set of files to be backed up is consistent with respect to changes over time. But snapshot technology isn't a native component of a backup application, and the particular type used must be supported by the backup system or custom scripting is required.
A massive number of files
Sheer numbers can overwhelm any backup product (see "The problem with big backups," above). Sean O'Mahoney, manager of client/server information systems at Norton Healthcare in Louisville, KY, saw his Meditech Electronic Medical Record (EMR) file server grow to contain more than 25 million files in 1.3 million directories. "It took almost five hours just for Windows to count the files," explains O'Mahoney, "but we have trimmed the backup time for this half-terabyte LUN to around three hours." The fix was a simple one: Ignore the files and dump raw disk blocks to tape. Although it lacks an index of files, this solution fits fine because all of those files are part of a single massive app.
Another Norton Healthcare unit needed a different fix, however. It uses an EMC Corp. Celerra to house personal home directories and departmental shares, amounting to 5.5 million files and 2.5TB of data. "On this server, we split the backups into sections to shrink the backup window," says O'Mahoney. "Incremental backups now take under an hour, and we have recently received a recommendation from EMC to increase parallelism [to run several backup processes at the same time], so we hope to reduce the full backup time from the eight hours it takes today."
When faced with a massive number of files, splitting them up and running multiple backup jobs in parallel can be a big help if your client and backup server can handle the load. Bill Mote, systems engineer at Cincinnati-based Making Everlasting Memories L.L.C., saw huge benefits when parallelizing backup on a server containing millions of image files. "We split the directory tree into 10 smaller ones to improve performance, manageability and scalability for our application," says Mote. "Now we can point the backup application at a subset of the total data and run multiple jobs in parallel."
When using IBM's TSM to back up a Windows client, a special option called journaling is available. "TSM normally examines each file and compares it to the database, creating a list of files to be backed up," explains John Haight, master consultant at Forsythe Solutions Group. "Although this is a very efficient process in general, when presented with millions of files it can take hours. In the case of journaling, the client keeps track of which files have changed and notifies the backup server, rather than scanning the whole file system." When journaling is enabled, the results can be dramatic. "The time required to scan our imaging system dropped from over 24 hours to just a few minutes when we tried TSM journaling," says Mote at Making Everlasting Memories.
But there are some issues with this technology. "Files that are deleted aren't cleaned up with TSM journaling, so a normal incremental is needed occasionally to clean this up," says Haight. Some experts also voice concern that the journal might be deleted in some instances, forcing a complete file system scan. And the technology is limited to a single OS and backup app.
Big files? No problem
Large files are easy to scan, but sending them to tape can take a great deal of time. Many options exist to speed the streaming of a large file to tape (see "Quick fixes for large backups," above), but some applications allow more intelligent and integrated backups. Agents can be used to extract native records from large databases, enabling incremental and online backups. These often have the added benefit of enhancing the usefulness of restored data, as smaller logical pieces can be recovered. For example, individual mailboxes or messages can be extracted using a native software agent, rather than attempting to back up the entire message store every day.
Ford Motor Co.'s Woods used Oracle Corp.'s Recovery Manager (RMAN) technology to allow the database to interact directly with the backup system. RMAN enables many advanced features, including multiplexing of backup data streams, encryption, compression, integration with snapshots and the ability to "freeze" database activity for a consistent copy. Woods streams this data to disk for maximum performance, copying it to tape later as needed.
There's no quick fix to the problem of large backups, but there are many effective approaches. If you have a large number of infrequently changed files, consider splitting the backup job to speed the scanning process. If these files need to get to tape, use a snapshot to gain extra time (see "Use snapshots," below). And if a few large files clog up the queue, check to see if there's an agent that can accelerate the process. Big backups need not jeopardize your data protection.
When faced with the challenge of massive backups, not everyone agrees that traditional backup methods are the right approach. "Skip the traditional backup to tape, since these take forever to complete, especially with file systems with millions of files," suggests Edwinder Singh, business manager, Datacentre Solutions Group at Datacraft Asia of Singapore. "Use a snapshot or clone instead." This approach is gaining favor among those users with backup challenges. Snapshots are quick and have little or no impact on clients, as they usually leverage the resources of the storage array. "Snapshots take up less space than clones or mirrors, since they use pointers to the production volume," explains Singh. So a storage system can retain a large number of snapshots for point-in-time reference.
But there are some negative aspects to this approach. Snapshots lack the file catalog commonly found in backup applications and restoring files is a manual process, which makes recovery much more difficult. Singh also points out that "accessing your snapshots will affect the production volumes, as they both refer to the same set of data." This can be mitigated, but "losing your primary data in a disk crash will render your snapshots useless," he says, just when backups are needed most.