In July's column, I offered 10 steps for better backup operations. The column generated many responses. My intent...
was to identify the fundamental activities that must be performed to ensure that backup data is properly managed and protected.
To review, here's the list:
- Plan proactively
- Establish a lifecycle operations calendar
- Review backup logs daily
- Protect your backup database or catalog
- Identify and resolve backup window failures daily
- Locate and back up orphan systems and volumes
- Centralize and automate backup management
- Create and maintain an open issues report
- Ensure that backup is integrated with the change control process
- Leverage your vendors effectively
When you publish such a list, inevitably some items are overlooked or discussed only briefly. I received some excellent feedback on additional areas worthy of discussion, so this month I'd like to delve into some of those suggestions.
What about restore?
The July column focused more on backup than restore. Some of you pointed out that there should be more emphasis on restore--after all, that's the whole point of backup. I could argue that by ensuring that the backup process is well managed, you're laying the foundation for successful data restores, but I agree that I could have spent more time on restoring data.
One reader said that in his company, the daily backup checklist includes the number of restores performed and the success rate. This is an excellent suggestion. He also said, however, they experience only an 80% success rate for restores. Further discussion revealed that the majority of the problems were with second- and third-tier applications. While the specific reasons for failure vary, one common thread appears to be that these restores are performed by the platform group, where backup is a secondary responsibility.
This is a classic organizational problem. In effect, the system administrators are part of the "virtual" backup team, but don't necessarily have the focus or skills needed to perform this task effectively. This division of labor is common, and it's important that staff in these "part-time" backup roles get adequate training and that the backup-related configuration of the systems they're responsible for are reviewed and validated regularly.
Restores should figure into the planning process. Item one on the list suggested that it's critical to consider the impact on backup when planning, budgeting and rolling out a new application. Application testing and acceptance to verify the successful backup and recovery of the application and all servers related to it is essential. If an application includes several tiers, the testing and release process should include documenting and testing backup and recovery for each component. The documentation should specify configuration details and recovery methodology. This should be validated and made part of any operational run-book for the servers. No server should be placed into production without demonstrating that it can be backed up and restored successfully.
Backup catalog protection
Item four discussed some basics concerning backup catalog protection. Here are additional steps to protect this critical component:
- Monitor your backup catalog and, at least weekly, ensure that the catalog or database and its underlying file system are not approaching maximum capacity.
- Ensure that you have multiple generations of catalog backups. The mechanism provided for catalog backup in some products tends to encourage reuse of the same tape on a repeated basis. This is dangerous, as an undetected corruption could render the backups unrecoverable.
- Be sure to print or e-mail a copy of the list of tapes required for catalog recovery daily.
More about orphans
Discovering what's not backed up is a challenge, but the effort could avert disaster. Item six addressed finding orphan systems: scanning subnets, mapping to Active Directory or DNS dumps, and identifying system owners. But another category of orphans also requires attention: database orphans. Servers often support multiple database instances, and it's possible new database instances could be created and not properly added to a backup schedule, thus creating an orphan database.
Checking servers and identifying database instances and associated volumes for backup isn't typically done. None of the traditional backup applications is able to do this sort of discovery, so it's likely to require a custom scripting effort.
Follow your tracks
Item nine stressed the importance of including backup in change management, but changes within the backup application itself are often overlooked. Certainly infrastructure changes, software upgrades and the like would be tracked accordingly. But minor tuning and other administrative-level changes within the backup application must also be properly logged.
Some backup applications help by logging commands as they're executed. If administrators have their own application login, it's possible to track who made a change and when it occurred. Not all backup applications have this capability, so it's important to make change log entries for all changes made within the application.
The versatile backup administrator
Backup (and restore) impacts, and is impacted by, numerous systems and applications, and must accommodate all of them. So a backup administrator must know and understand more than just the backup application.
Many tools are available for backing up applications, including specialized application agents; database dumps; host-, network- and storage-based snapshotting and mirroring; replication and others. Implementing the best one requires broad knowledge and understanding.
When an application must be recovered, responsibility for getting it back online must be clearly defined. But determining who has the required skills can be difficult, and it may not be a single person. Recovery often requires a team effort, but the backup administrator is often looked to for recovery expertise.
The backup administrator must be an expert in the backup application, and must also have an understanding of the technologies and applications in the environment. This includes storage and particularly storage area networks, IP networking, system administration, security, databases, system-level applications such as DNS and Active Directory, and the applications.
Trying to get by with a mindset of just being responsible for moving the data usually isn't sufficient. The backup staff must cross-train and communicate with other disciplines to raise the awareness and competency of all concerned with regard to data protection.
Paying attention to policy
Policies pertaining to data protection and availability should dictate how backups function. Ideally, there should be strategic policies that support business objectives, which translate into tactical policies for IT. Too often, policies are not clearly defined and, as a result, either no clear rules exist or ad hoc rules are defined "on the fly." This can cause problems, resulting in a backup infrastructure that meets nobody's needs.
Rules must be established regarding the service levels provided, and the roles and responsibilities of all involved parties. By successfully performing the 10 basic backup operational steps and demonstrating competency, the backup group should develop enough professional respect and rapport with users and peers that rigorous enforcement of the rules will be understood.
Thanks to all who sent comments, and to GlassHouse consultant Jorgen Lie for many valuable suggestions.