Many experienced Tivoli Storage Manager (TSM) administrators tout TSM's flexibility. However, with flexibility comes a level of complexity that can cause basic requirements to be overlooked. Below are some of the most common problem areas that can cause headaches for unsuspecting TSM administrators.
Undersized disk storage pools
A common mistake is to define disk pools that do not have the capacity to hold one night's worth of backup data. Doing so results in data migration to tape starting when the disk pool has reached its high migration threshold, rather than starting as scheduled using the "migrate" function. Unscheduled migrations can interfere with backups causing them to fail or run beyond the allowable window.
Probably one of the most common situations is a shortage of tape library capacity, which is at the root of many TSM issues. This situation is often temporarily remedied by removing full tapes from the library to accommodate scratch tapes, which seriously hinders the reclamation process when it attempts to access data on tapes that were ejected. Lack of scratch tapes, on the other hand, will cause the TSM database backups, client and storage pool backups, and data migration to fail. There are no quick fixes; there must be sufficient tape or disk-based virtual tape library (VTL) storage.
Tape storage pool collocation
Collocation should be reserved for backup clients hosting enough data to use at least one tape volume to near capacity. Servers with small amounts of data should not be pointed to collocated tape pools, as this will result in very poor tape utilization. Since version 5.3, TSM supports collocation by group, which is better suited for smaller data sets.
Reclamation is a process that is often not allowed to complete in environments facing a shortage of tape drives or library capacity. This typically leads to a process backlog resulting in even poorer tape utilization, which might have been the cause of the library capacity shortage in the first place.
TSM recovery log capacity
A TSM database set to run in "roll forward" mode (linear) to allow the latest point in time recovery can cause the recovery logs to run out of capacity. The recovery log is flushed following a TSM database backup. Should there be no scratch tapes available in the library, the scheduled database backups to tape will fail, preventing the recovery log from being flushed. If this situation goes unnoticed for some time, the recovery log will reach its maximum capacity (currently 13 GB) and halt the TSM server (much like a RDBMS running out of redo log space). This of course is above and beyond the fact that a failed database backup is never a good thing.
Backup data retention
In these days of regulatory compliance, overly liberal backup data retention parameters can cause a tape subsystem to quickly run out of capacity. This will only precipitate the other capacity related issues mentioned earlier. It goes without saying that a TSM backup environment should only be sized for capacity once the corporate backup retention policies have been set. TSM also provides an archive feature that should be used for compliance. Remember that backups are designed for data protection, not for data retention compliance.
TSM offers granular control over which files get backed up or not. A successful backup does not mean that all vital files were backed up; it only means that files the software was instructed to backup actually were. It should never be assumed that all files on a given system are subject to backup; regular reviews of the TSM clients' include/exclude settings and periodic restore tests are in order.
Lack of monitoring and reporting
Absence of proper monitoring (or the lack of someone to read the reports) can cause some serious problems that can go unnoticed for some time. Newer TSM installations with ample tape storage capacity can easily fool a novice administrator into thinking that "it runs itself." A failed TSM database backup is best noticed way before it is needed.
This write up cannot possibly provide answers to all possible TSM related issues. However, through proactive monitoring, and by ensuring the tape library is properly sized and houses an adequate number of tape devices and scratch tape supply, most of the issues outlined in this paper can be avoided or at least, caught before they become a serious problem.
This of course, is not an exhaustive list of all the TSM configuration items requiring attention to maintain a healthy backup environment. More information can be found in the IBM Tivoli Storage Management Concepts Redbook (SG24-4877).
About the author: Pierre Dorion is a certified business continuity professional for Mainland Information Systems Inc.
This was first published in April 2007