As business-critical applications continue to drive the demand for storage throughout the enterprise, managing the backup and recovery processes is becoming increasingly difficult. Although a well-thought-out BRMP can't ensure hassle-free operations, it can help eliminate costly mistakes, build consensus on how to allocate scarce storage resource dollars and dictate what to do when Murphy's Law kicks in and things get ugly.
To minimize the effects of unplanned downtime and to maximize data availability and recoverability, smart IT organizations must create, implement and maintain a BRMP. A BRMP provides a framework for understanding the backup environment, a vehicle for documenting the standard procedures to be followed for backup and restore operations and a repository for the corporate best practices and backup policy definitions that have been implemented. Here's how to create a BRMP plan along with tips for best practices.
What a backup and restore management plan should contain
Staffing requirements. The BRMP should spell out staffing needs. What's required depends on several factors including backup schedules, backup windows and service level agreements with customers. Many large enterprises will require some form of coverage on a 24X7 basis. A 24X7 coverage model requires a minimum of seven to 10 people to adequately staff all shifts throughout the week, including management support.
Motivating staff isn't part of the BRMP, but a critical area of concern. Typically, the backup and restore function is delegated to a junior-level system or backup administrator. As part of the management plan, a training program should be implemented to ensure the person(s) responsible for successful backups are up to speed on the latest enhancement and functions of the systems and backup software. Consider implementing a mentoring program in which senior-level personnel work closely with their less experienced co-workers to coach them and help with their career development. Another motivational initiative would be to provide a bonus structure based upon a successful backup percentage, or elimination of unplanned outages.
Define operational procedures. The management plan must contain the procedures for monitoring the backup infrastructure, for ensuring successful backup and recovery job completion, for complying with the change management process and for testing the restore process.
What should be monitored? All components of the backup infrastructure must be monitored to quickly identify and resolve any problems that will surface. These components include backup and restore job status, backup servers and clients, automated libraries, LANs, network-attached storage (NAS), storage area networks (SANs), backup networks and the storage itself. Most of these components may be monitored via in-band and out-of-band communication methods such as real-time backup and restore activity monitors, error and event logs and SNMP traps aggregated up to enterprise-level frameworks.
What's an appropriate monitoring frequency? Unfortunately, many organizations only monitor their backup job status daily. Typically, this is done in the morning. The disadvantage of this approach is that typically, the backup window has closed and the backup jobs can't be restarted. If a restore is required, the enterprise would potentially lose a full day's worth of changes, resulting in lost revenue and productivity. To ensure the highest levels of backup and restore success rates, the backup software should be monitored during the entire backup window.
What actions should be taken in the event of a backup or restore failure? What information must be captured to facilitate root cause analysis? When should the backup be restarted? These operational processes must be documented and tested. The process should also contain a technical and business escalation procedure defining whom to contact at the appropriate time.
How are changes to the environment recorded? A solid backup and restore management plan will outline a change management procedure requiring signature authority from all stakeholders. This process should be invoked for changes such as adding or removing backup clients, upgrading backup servers, adding additional capacity to storage subsystems, reconfiguring backup networks and capturing software/microcode revisions.
Are the restore procedures documented? How often is the restore process tested? It's imperative to have an up-to-date disaster recovery plan. An effective plan captures the actions to take for multiple levels of incidents ranging from a server crash to a full-blown disaster declaration. Test the plan at least every six months.
With myriad storage methods such as servers, disk and tape storage subsystems, storage area networks (SANs) and network-attached storage (NAS) topologies, successful backup and restore management can be a daunting task for even the most seasoned storage professionals. Every day, administrators wage a war against data corruption, virus attacks, network problems and a host of other incidents in a valiant effort to keep their mission-critical systems up and running.
Additionally, enterprise organizations face an array of other storage challenges, such as squeezing more data into shortened backup windows while meeting demanding service level agreements and performing ongoing backup infrastructure capacity planning. The 24X7 data access requirements of database and Web-based applications are forcing many organizations to rethink their traditional backup and restore strategies. Not only must these applications be backed up while online, but in most cases, they must be restored in less than half the time it takes to back them up.
Today's complex environments demand highly skilled IT professionals to ensure the backup solution is working as designed. Unfortunately, managing the backup and recovery environment is a job no one really wants. It can be a thankless job with high expectations for success and no tolerance for failure. A general perception among administrators is that no one has ever been promoted for ensuring successful backups. And, sorry to say, the opposite is all too true: Jobs have been lost as the result of unsuccessful backups.
Without proper backup schedules and retention policies, backup media can't be used efficiently, resulting in increased costs for data cartridges, automated libraries and off-site storage. Lack of media management policies can also result in lost or damaged backup media, impacting data availability and recoverability.
The following seven steps can help you create a BRMP.
Step 1: Understand the backup environment
Before a successful BRMP can be created, it's important to conduct a thorough assessment and inventory of the existing backup environment, including backup servers and clients, automated libraries, backup media and storage networking components. At a minimum, the following questions should be answered:
- Is the current infrastructure designed for backup and recovery? Most backup solutions are designed to move a fixed amount of data to backup media within a given backup window. While this is certainly an important consideration, the primary emphasis for solutions design should be on ensuring that the business-critical applications can be restored quickly in the event of a disaster.
- Which systems are mission- critical? What are the availability requirements? What's the cost of downtime?
- What are the backup software and licensing requirements? Have enough licenses been purchased to satisfy the requirements?
- What are the database or application backup requirements? Is there a requirement for hot backup?
Step 2: Perform capacity planning
Once the assessment and inventory are completed and the backup infrastructure is understood and documented, the next step is to perform capacity planning. The purpose of capacity planning is to identify the sources of storage growth and perform a gap analysis to determine the differences between the current infrastructure capabilities vs. expected requirements. Important questions to answer at this stage include:
- What is the expected storage growth over the next six months and in one to three years?
- What are the anticipated increases in the number and types of backup clients?
- Will the current backup architecture and infrastructure scale to meet this growth?
How often should backups occur?
Policy definition includes documenting backup schedules and windows for each client in the management plan. These policies will be dictated by customer and application requirements. For example, an enterprise may require a full backup of their critical Oracle database on a daily basis to maximize data availability and minimize restore times. In the absence of a defined requirement, a generally accepted backup schedule is as follows:
- Perform full backups of all data on a weekly basis. The fourth full backup per month becomes a monthly backup.
- Perform incremental backups on a daily basis - an incremental backup is defined as the backup of all data changed since the last backup.
- If possible, stagger the weekly full backups throughout the week to balance resource utilization.
Backup windows continue to shrink with application uptime becoming critical to employee productivity and revenue generation. Most storage administrators are looking for ways to reduce the backup window. They are deploying hot database backup agents, snapshot functionality, point-in-time copy and remote data replication solutions to safeguard business-critical data, in addition to the more conventional tape backup and recovery methods.
Effective media management policies are essential for ensuring data protection, while controlling media and storage costs. Media management policy includes setting appropriate retention periods for backups, tape duplication policies and thresholds for automated library management. Business groups or application requirements will drive retention policy definitions.
Consider the following rules of thumb for retention periods:
- Retention period for daily incremental backups - one month
- Retention period for weekly full backups - three months
- Retention period for monthly full backups - one year
These recommended minimum retention periods provide an enterprise with the ability to easily recover month-end, quarter-end, or year-end data while also employing an efficient media utilization plan.
If backing up to tape, consider tape duplication for the weekly full and monthly full backups. Tape duplication is a process in which the primary backup tape is copied to a secondary tape after the backup is complete. Typically, one copy is kept on-site and the other copy is sent to an off-site vaulting provider for disaster recovery purposes. Some backup software packages will perform inline tape duplication by concurrently writing separate backup streams to two tape drives. Although tape duplication consumes more tape resources, it has two advantages: enhanced data availability in the event of a media failure on the primary copy, and with one copy on-site, faster local restore times.
Many organizations have purchased and deployed automated libraries to improve backup and recovery performance. Another benefit is the reduction of errors associated with manual handling of backup media. Unfortunately, management of these libraries is often overlooked. Without proper media management and retention policies, the libraries fill to capacity and require more human intervention. Proper automated library sizing is critical to ensure that the library is performing the function for which it was deployed in the first place - automation.
Step 3: Analyze current policies and procedures
The foundation of a successful BRMP is the documentation of policies and operational procedures. In this step, internal and external customer requirements for backup and recovery must be reviewed and documented. Questions that should be answered include:
- What are the service level commitments that must be met for application and data availability?
- What backup schedules and windows are needed? (See "How often should backups occur?" sidebar.)
- What are the appropriate retention policies for this data? Are there any regulatory requirements?
- What are the corporate requirements for a disaster recovery plan?
Step 4: Determine resource constraints
In an ideal world, an enterprise would have unlimited resources to accomplish their business objectives - including ensuring a successful backup and recovery. Unfortunately, this isn't the case. A realistic BRMP will take into account the business constraints most organizations face. Key resource areas that must be reviewed include personnel constraints, physical infrastructure constraints and financial constraints. Consider the following questions:
- Is there enough staff to effectively manage backup and restore operations? Do they have the right skill sets?
- Are there adequate data center resources (floor space, rack space, power, cooling, etc.) to accommodate potential increases in backup infrastructure components?
- Is there budgetary approval for any new acquisitions or improvements to the backup and restore infrastructure?
Step 5: Create a BRMP
At this point, there will be a wealth of information available to provide a baseline for the management plan. This includes information about the existing backup infrastructure, requirements for storage growth, backup policies and procedures and resource constraints. The last step before actually writing the plan is to define staffing requirements, operational procedures and the backup and media management policies (see "What a BRMP should contain"). Once that's accomplished, it's time to write the plan. The final step is to obtain consensus and approval for the plan.
No doubt, this is a formidable task that can take months. And, of course, most users want all of their data backed up and retained indefinitely, or the legal department usually wants a limited amount of backup data and short retention periods. A good management plan should also reflect the disparate requirements. Reality and consensus lies somewhere in the middle.
Step 6: Implement the plan
Once the BRMP is completed and approved, it's time to implement the plan. Take a phased approach to implementation. First, hire and train the required operational staff or select an outsourcing vendor. Second, acquire and install any of the backup hardware and software identified in the capacity planning phase. Next, implement and test the operational procedures and backup policies in a controlled environment to avoid impacting production backups. This is also the time to implement and test any new backup management software tools. Be prepared to make some adjustments to the plan as required.
After testing is complete, you should schedule a full roll out of the policies and procedures across the enterprise. Consider using a professional project planning software package when implementing the BRMP. Don't make the same mistakes other organizations have made - assuming that just because the project is approved and paid for, it will be properly implemented. Be proactive: Follow the project plan and stay on schedule.
Step 7: Monitor the management plan
Obviously, a company's business changes - in some cases, on almost a daily basis. New applications drive revenue and profit growth. And of course, the storage environment continues to grow at an exponential rate. Due to these ever-changing requirements, it's important to continuously monitor the backup and recovery management plan to ensure its meeting the business and data protection needs of the enterprise.
Selecting the software to support your plan
Storage administrators require a robust set of software tools to properly monitor and manage the backup and recovery infrastructure. These tools include messaging and event notification frameworks such as HP's OpenView, Tivoli from IBM, and CA Unicenter, with backup and recovery software from vendors such as Veritas, Computer Associates, Legato, and Tivoli among others. While the leading software vendors provide a rich set of features and functionality in their products, a more holistic view is required for expert management of the backup infrastructure.
Many IT organizations are evolving into internal storage service providers. They are adding value to their organizations by offering expert storage and backup knowledge, improved quality of service and customized backup solutions to meet their customer requirements. As such, these organizations are looking for new software solutions that provide enhanced monitoring, reporting, asset management and chargeback capabilities. When researching backup and recovery management tools, look for the following functionality:
- Global view of the backup infrastructure. Many large enterprises have multiple data centers that are geographically dispersed. A consolidated, global view of the enterprise environment simplifies backup administration and reporting. A storage administrator may quickly identify information at risk in the event of failed backups, and take corrective action as required.
- Event driven notification and response. The software management tool should provide cohesive in-band and/or out-of-band monitoring capability for all components in the backup and recovery infrastructure including backup servers, host clients, automated libraries and storage networks.
- Service level agreement compliance reporting. As internal service providers, IT organizations need to provide their internal and external customers with detailed reports demonstrating their performance against agreed upon SLAs. The software management tools should provide the capability to report on backup and restore success rates, storage area network (SAN) or network-attached storage (NAS) utilization, amount of data backed up per client, backup window utilization and automated library and media utilization. Additionally, the software should provide a mechanism for billing or chargeback based on actual storage consumption and/or backups performed.
Smart storage administrators should perform an internal audit of the plan on a quarterly basis. Some questions that should be brought up when planning include: Is your current backup and recovery infrastructure meeting your needs? Do you have a written and effective backup policies and operational procedures? Are the backups being performed successfully within the defined window? Are restores of file systems and databases successfully tested according to the defined schedule? Are service level agreements being met? Is the disaster recovery process tested on a semi-annually basis?
Other questions you should ask yourself are: How often should you test your recovery procedures? What staffing levels are required for successful backup operations?
In addition, the plan must be flexible enough to handle backup growth. Will it accommodate system upgrades, additional backup clients, and new hardware, software and storage network components?
Navigating the labyrinth of backup and recovery is a significant challenge, with pitfalls at every turn. Creating, implementing, monitoring and maintaining a BRMP can help ensure that your organization's data is protected, available and recoverable.