10 basic steps for better backup
In the newspaper business, bad news sells. When it comes to backup, it's easy to focus on the bad news. There's simply so much of it: nightly failures, lost tapes and unrecoverable data.
But the news isn't all bad. There are shops where backups are completed successfully, where data is restored and backup operations run smoothly. The most evident common denominator in well-functioning backup infrastructures is effective process and control. Well-run environments have a clear understanding of the tasks to be performed and a consistent way to accomplish them.
How does your organization measure up in regard to the basics of backup operations? Here's a checklist of 10 areas you should focus on to build a more effective backup practice.
1. Plan ahead. Backup is one strategic component of data protection; others include mirrors, snapshots and replication. In most environments, traditional backup serves as the last resort for data recovery. But as a strategic element, backup planning should be a fundamental part of the overall storage plan.
Your backup infrastructure needs to be factored into the planning process for rolling out apps, servers and primary storage growth. Too often, changes in the environment aren't taken into account until the eleventh hour. This causes disruptions and has a detrimental impact on the overall backup operation.
Proper planning enables the backup team to fully understand an application's business requirements and design characteristics with respect to data protection. The backup policies and approach necessary for a database application that employs split mirrors and replication is considerably different than those needed for a file-based environment having no additional data protection. Similarly, a large enterprise application deployed across multiple servers may have complex data interdependencies that require proper backup synchronization to enable a usable recovery.
2. Establish a lifecycle operations calendar. An effective backup operation requires certain tasks to be completed successfully every day. There are also weekly, monthly, quarterly and even annual tasks that are as important as daily tasks. While short-term tasks are highly tactical, long-term tasks tend to be more strategic. In an effective backup operations environment, all tasks should be documented and performed on schedule (see "The backup operations lifecycle," next page).
Daily tasks are the operational fundamentals that most backup administrators are familiar with and include items such as:
- Job monitoring
- Success/failure reporting
- Problem analysis and resolution
- Tape handling and library management
- Performance analysis
- Capacity trending and planning
- Policy review and analysis
- Recovery testing and verification
- Architecture planning and validation
Evaluate your daily/weekly/monthly/as-needed tasks. Document them and make sure they're performed and reported on schedule.
Keep in mind that time flies. Before you know it, a year will have gone by and a complete annual cycle will have passed. It may seem tedious at first, but eventually you'll come to realize the benefits of a more optimized environment.
3. Review backup logs daily. A review of backup application error and activity logs is a key daily task--but one that's often easier said than done. Log analysis can be time-consuming, but it can pay extremely valuable dividends and is essential to reliable backup.
Backup problems tend to manifest themselves in a cascading effect. One event results in a series of subsequent problems that don't have an immediate, obvious linkage. For example: A backup job doesn't kick off because a required tape drive was never released from an earlier job. This prior job was backing up an application server executing an unscheduled batch process, consuming system resources and causing the backup to finish late. The system administrator responsible never informed the backup administrator to reschedule the backup.
It takes considerable skill and detective work to determine whether or not one is observing a root cause or a symptom of some other problem. You must also establish good communications and working relationships with system administrators, DBAs, network administrators and others to effectively troubleshoot complex problems.
|The backup operations lifecycle|
Daily: Validate backup activities
Weekly: Validate backup system
Monthly: Validate backup process
Quarterly/Annually: Validate backup solution
The catalog should be treated like any other critical application database. It should be mirrored, or at least RAID-protected, and you should verify successful multiple-copy backup of the database or catalog on a scheduled basis.
5. Identify and resolve backup window failures daily. Backup window failures are successful backups that exceed the expected backup window. Because the backup job itself completes, no errors are reported in the error log, so this problem is often overlooked. In addition to affecting production environments and creating user dissatisfaction, jobs that approach or exceed the backup window may be warning signs of impending capacity limits or performance bottlenecks. Recognizing and addressing these issues as early as possible can prevent future failures and avoid user dissatisfaction.
6. Locate and back up orphan systems and volumes. Your backup software invariably provides you with some level of reporting information about daily backup success. If you depend on this as the authoritative source on backup, then you're likely still at risk.
The backup application reports only on the servers it knows about. Large environments often have orphan systems--systems that have been brought into production but not incorporated into the backup plan. This can happen for a variety of reasons, but it's often the result of a business unit purchasing a system outside of IT's purview. The system may have been backed up independently at one time, but over time has fallen through the cracks. Usually these systems are discovered after it's too late: Data loss occurs and a restore request comes to IT for a system it knows nothing about.
Addressing this problem can be challenging and time-consuming. It entails regularly discovering and mapping new network addresses to nodes, filtering out unrelated addresses (e.g., additional NIC cards, network devices, printers, etc.), identifying the locations and owners of these nodes and establishing policies for managing the addition of storage volumes. Regular reporting to system and application owners of exactly what's being backed up and what's not being backed up (by choice) is also critical.
7. Centralize and automate backup management as much as possible. A key to successful data protection is consistency. This doesn't mean that all data must be treated in the same manner. What it does mean is that all data of equivalent value and importance to the organization should be managed in a similar fashion. The orphan problem is an excellent example of an inconsistency that can result from non-centralized backup administration.
In many environments, backup operations for Unix and Windows servers are run independently. This organizational alignment may pre-date networked storage--but it's questionable if the old arrangement still makes sense. Besides being inefficient, it suggests a different set of policies and procedures should be applied to data based on its operating platform. Is there any line-of-business owner that would apply that measure to data valuation?
Geographic considerations and functions within backup operations can be delegated, but given communication capabilities and the management tools available today, there's little justification for decentralized backup.
As the complexities of the backup infrastructure grow, automation can help by providing tools to facilitate repetitious processes. As discussed earlier, tasks such as checking logs on a scheduled basis are key. Deploying automation to provide automated alerts for previously identified errors in logs can make life easier. The inverse is also true--providing automation to aggregate repetitive error entries in a log can be helpful.
In an unadulterated log, if I see one SCSI error, I see 1,000 of them. Scanning through all the entries of the same error can be daunting--so much so that I may be tempted to not perform the necessary daily log scan. Automation tools can successfully facilitate various activities if you identify the task to be performed and define the expected result.
8. Create and maintain an open issues report. Finding and fixing problems like the ones I've discussed are tactical activities critical to backup success. However, the process of managing those problems effectively and establishing appropriate metrics indicative of backup quality is essential to drive systemic improvement of backup infrastructures.
In larger environments, problems may be tracked through a formal ticketing system. If you don't use such a system, an open-issues log can be an important tool to help a backup operation evolve from fire-fighting mode and ensure an optimized steady-state operation. Either way, regular reports detailing open problems that indicate the rate at which new problems are added and existing problems are closed can speak volumes about the overall health of the backup operation. A simple trending report with appropriate supporting data can uncover fundamental operational problems and help you reach an appropriate resolution.
9. Ensure that backup is integrated with the change control process. Backup environments are by their nature highly dynamic. Unfortunately, within backup organizations, too often the change process for backup is equally dynamic. Just as backup must be part of the strategic planning process, on an operational level, backup must be part of an organization's formal change control process.
This implies a two-way relationship because changes directly and indirectly related to the backup infrastructure must be part of the notification, impact assessment and contingency planning process that's included within change control. Stories abound of unintended backup outages due to SAN switch topology or zoning changes, or system bottlenecks due to backup configuration modifications. They can and should be avoided with the proper process in place.
If a monthly outage window is necessary for the backup infrastructure to facilitate upgrades or verification tests, then this outage window shouldn't overlap with outage windows for other production systems. There's an increased demand for restores when changes occur in the systems as system files are upgraded and backing out of change is desired. If the backup infrastructure is down for maintenance at the same time, data can't be restored in a timely manner. The backup infrastructure is a production system, just like the most important application in an organization's environment, and it requires the same respect and support.
10. Leverage your vendors effectively. Backup environments are complex and get more so with the introduction of new technologies. Hardware and software vendors are racing to add new features and functionality in the struggle to differentiate themselves from one another.
While much of this technology can be helpful, and it certainly all sounds good, there's a considerable challenge in understanding the nuances of functionality of one technology option vs. another. For example, there are a significant number of different approaches to disk-based backup. Which one is right for your environment and what precisely is the impact?
A fundamental question that you must be able to answer is: Does your vendor have the right skills to support your needs? All technical problems get resolved eventually. If your technical problem isn't being resolved in a reasonable amount of time, then you may not be working with the right vendor. This becomes extremely apparent when multiple products from multiple vendors are integrated.
These 10 tasks may seem basic, but accomplishing them isn't always easy. They depend on a number of key elements: appropriate reporting and measurement capabilities, a high degree of staff competency within the backup organization and solid cross-functional communication. The impediments can be significant, including costs, resource availability, skill levels, organizational politics and a host of others.
If you can't accomplish all of these things, try to address the most critical. If time and resources are the issue, develop a plan to justify them. Against these hurdles you must consider the risk of unrecoverable data and major outages. After all, the news is full of those kinds of stories.