Scheduling backups is more of an art form than a science. It takes creativity to take a bird's eye view of the backup and recovery infrastructure, and then assign specific clients to designated storage nodes or media servers to facilitate their backups according to their SLAs. Although apparently simple, I have participated in more than a few storage area network (SAN) assessments where the recommendation was to rework the backup schedule instead of purchasing additional hardware.
Most modern enterprise backup packages use a multi-tier architecture in which some backup servers kick off jobs and collect metadata about the backed-up data, while others perform the actual job of managing the data being streamed to the tape drives. Most modern packages also allow administrators to create regular collections of backup clients - whether desktops or servers - that can be backed up together.
Using Legato's NetWorker as a sample product, scheduling clients for backup consists of creating the fundamental submittal mechanism within Legato, a group. A group is an object with certain user-defined characteristics used to provide direction and apply policies to the clients within that group. For example, you may have a group of Microsoft Exchange servers that all have the same backup window and should be written to the same tape media pool. These servers would be backed up at the same time, sharing the same set of tapes within the designated pool when using a single Legato group
Your schedule has to work around the capacity of your backup infrastructure. Let's work through a typical configuration that uses one NetWorker Server, three NetWorker Storage Nodes, an enterprise class tape library using STK9840 tape drives and more than 100 Legato Clients (see "Building a backup infrastructure" sidebar).
|Building a backup infrastructure|
With modern backup software, a small number of servers can control backup schedules and metadata collection for a large number of clients and servers with attached tape drives. This example shows Legato's NetWorker package, but other major packages use similar architectures.
To determine the maximum number of clients/file systems that can concurrently be streamed to this drive, set the number of allowable target sessions to an inordinately high number, say 15. This tells NetWorker to assign up to 15 file systems or volumes - depending on the OS - to the tape drive before invoking queuing.
At this time, the administrator should start full backups of two clients that would normally run concurrently - in the same group - and then watch the real-time transfer rate of the tape drive in the NetWorker display. The administrator should then continue to submit one additional client system for backup until the observed transfer rate starts to level off or degrade. When this happens, you have reached your data specific limit on the number of clients that any one of your tape drives can support concurrently. Multiplying the result by the number of tape drives your library is outfitted with yields the total overall number of clients that can be active at any one time.
Keep in mind that the number of file systems or volumes that the backup client will throw at the storage node is by default four. This number directly maps to the total number of target sessions that the tape drive can support. So if you determine that with StorageTek's 9840 tape drive you can submit up to three backup clients before the transfer rate starts to level off, that would equate to a total of 12 file systems or target sessions per tape drive. And if your tape library is outfitted with eight 9840s, that would equate to a maximum of 96 total concurrent target sessions or 24 concurrent clients of the same data type. Remember, this exercise should be completed for the various data types being stored in your environment because the high watermark is likely to change with the data type - which should be kept in mind when adding client systems to their respective groups.
If you're fortunate, you have a separate back-end Gigabit Ethernet network for the aggregation of client networks into your backup and recovery infrastructure. That keeps backup traffic off of the production LAN and can also be leveraged when implementing networked-attached storage (NAS) or iSCSI solutions in the future.
Usually the backup clients' LAN configurations would consist of a number of 100Mb/s LAN segments aggregated up to a Gigabit LAN where the Legato Storage Nodes sit and field the backup data from the clients. When defining groups, you'll also want to consider the maximum throughput of the client networks being staged up to the Gigabit back-end network. In theory, this suggests that no more than 10 backup clients on a 100Mb/s LAN segment could be streamed to the storage node's Gigabit interface. However, the real number is less, due to various performance losses in the network.
Client data is an important consideration because the type and size of the files that reside on the client can drastically affect the performance of the backup. For example, an application's storage repository that houses lots of small files would take the backup application significantly longer to interrogate the inodes of millions of files when compared to a humongous Oracle data warehouse with only a couple hundred large files. Also relevant is the number of open() and close() system calls necessary when comparing the two data types. On clients with millions of small files, clients are spending a greater percentage of time opening and closing files instead of actually sending data, resulting in less performance.
Grouping backup clients
Now you can start to combine clients into their related groups. Within the group, provision eight clients to each of the storage nodes, creating a stripe effect similar to the layout of a disk file system. In this group configuration, one group of 24 clients will be started with each of the storage nodes fielding and then writing to tape data from eight of the 24 clients in the group.
The higher the number of concurrently active clients, the greater number of updates the NetWorker Server will have to perform during the backups. This number increases even further when backing up clients with millions of small files.
Large data centers often use an enterprise scheduler to make the backup of a client system dependent on some other event going on inside the client system. For example, an Oracle stored procedure executes daily on one of your client systems and has an inconsistent end time. Eventually, a backup will start before the stored procedure has completed, possibly rendering the backup copy useless. But an enterprise scheduler can prevent the backup from kicking off until the stored procedure completes. However, you should check with your backup and recovery vendor to ensure that all your client information is backed up when scheduling clients for backup outside of the software's native scheduler.