Remember the backup/recovery salesman's pitch? "If there's a problem with a job, e-mail alerts, pager messages...
and SNMP traps will be sent immediately." It sounded like a great deal at the time, but as they say, "Be careful of what you wish for."
Industry analysts estimate that between 5% to 20% of backup jobs fail on a nightly basis, and the error messages resulting from these failed jobs can pile up faster than snow in a New England blizzard. Backup/recovery (B/R) applications can usually organize these error messages by nature and priority, but a storage administrator must still sift through possibly hundreds of messages to determine the problem's root cause, priority and remedy. There are an increasing number of backup reporting products on the market that go beyond the reporting capabilities of backup hardware and software, and promise to lessen backup pain.
Backup reporting products
Click here for a comparison table about backup reporting products (PDF).
B/R reporting tools emerged as a category in 2000-2001, driven by new vendors that have provided most of the innovation and momentum in the market. The newest B/R reporting tools are multivendor solutions as opposed to product-specific solutions. They can also be found in storage resource management (SRM) products and integrated products (see Backup reporting products, this page).
Among the multivendor products, Bocada Inc. can probably be credited with pioneering the backup reporting market. It has since been joined by other vendors, including Servergraph, SysDM Inc., Tavata Software Corp. and Tek-Tools Inc. In the next three to five years, you can expect some consolidation, most likely through the acquisition of smaller vendors by larger ones.
A second group of storage companies, including Computer Associates (CA) International Inc. and Veritas Software Corp., offer multivendor B/R reporting modules as part of a larger SRM product suite. In fact, backup reporting sits somewhere between backup and recovery management and SRM. Backup reporting tools are logically a subset of SRM, but are tightly linked to B/R. SRM products may cost substantially more than pure-play B/R reporting products, but they also offer broader capabilities for reporting and trend analysis on disk arrays, tape management and so on.
The third group of products includes those integrated with a specific B/R product. In some cases, such as with CommVault Galaxy, from CommVault Systems Inc., and Veritas NetBackup, the B/R reporting function is a separate module available for an additional cost. In other cases, such as with CA's ARCserve, EMC/Legato's NetWorker and IBM Tivoli Storage Manager (TSM), the reporting module is included with the base B/R product. CommVault's QNet product is packaged somewhat differently in that it's part of the firm's QiNetix SRM product; however, it's specific to CommVault's Galaxy backup app and tightly integrated with it.
In general, integrated products offer fundamental functionality like event notification, success/failure reporting and summarization. QNet extends this functionality by including capacity planning, predictive analysis, B/R costing and a "recovery readiness check." IBM's TSM module includes health monitoring, extensive reporting and a DR Manager. While this functionality may meet the day-to-day needs of many organizations, SRM and multivendor products take B/R reporting to the next level (see Which product is best?).
Several vendors tout their products as either agent or agentless architectures, and there are benefits and limitations to each. An agent architecture distributes a portion of the reporting app to various devices and allocates the workload across the infrastructure. In addition, distributed agents can continue to collect data even if the central server is unavailable. An agentless architecture eliminates the hassle of installing and maintaining agents on hundreds of devices. Which solution is best for an IT organization depends on the specific situation and personal preference.
Key B/R features
There are distinct steps that build upon each other to create a more robust, less effort-intensive storage environment. They are:
- Historical analysis
- Performance tuning
- Predictive analysis and forecasting
- Process improvement
Historical analysis is the table stakes portion of B/R reporting--without it, a vendor really doesn't have a product. Historical analysis reports on the spectrum of B/R issues, from job reporting to device reporting to the grouping of job-related problems. For example, this analysis should enable the storage admin to identify unusually problem-plagued elements and other items by exception. These might include application groups (e.g., application servers, clients, media servers), tape libraries and devices, file systems and databases. Even without more sophisticated event-correlation analysis, an administrator will be able to take steps to improve the worst-performing areas. The key element is effective filtering of information to allow the admin to hone in on relevant events.
Historical reporting usually feeds into root-cause analysis, the first area where IT should look to improve operations. Hardware infrastructure is often the culprit, but problem areas may include poorly scheduled jobs, spindle contention in database volumes and poor media management. Each B/R reporting product can identify at least some of these root-cause elements. But root-cause analysis is not as easy as it sounds once you get beyond basic tape device/media failure. While finding network bottlenecks and database design issues extends beyond the boundaries of storage management, they're often critical to backup performance improvement. SRM tools have the breadth to address some of these non-storage issues.
Most products, regardless of category, scan the B/R application's database, catalog and error log. This functionality is enabled by APIs and command line interfaces (CLIs), or a combination of the two. This capability is little more than a report writer, but it produces very useful information.
Homegrown scripts can also be used to get B/R information. In many cases, IT organizations don't want to abandon scripts because they're tailored to the organization in ways that no off-the-shelf package can be. For example, Tavata Enterprise Storage Manager (TESM) can incorporate user script output into its data collection database for additional reporting and data mining.
The WysDM product from SysDM has a very intriguing reporting function called the "Top 10 worst-performing components" of the IT environment. Although these lists will never be seen on the Late Show with David Letterman, a storage administrator will find them very useful for prioritizing actions to improve B/R.
Within historical analysis, event correlation is the most sophisticated and useful element. Event correlation goes beyond identifying what broke by linking events such as poor network performance and bottlenecks to failed backup jobs. The Servergraph Product Suite stacks a variety of charts based on time and allows the administrator to correlate events using a movable vertical line that connects all of the charts. Although Servergraph's charts can be a bit cryptic to read, the capability can be a powerful one once you're accustomed to the view.
|Which product is best?|
Performance tuning is one of the most vexing elements of B/R. The basics of performance tuning begin with a bit of math with which the storage administrator determines how much data must be moved in what period of time, yielding a GB/min value. The following questions must be answered:
- Can the available tape drives handle the load?
- Is network bandwidth sufficient?
- Does the backup server have sufficient capacity?
- Can the disk arrays deliver the data fast enough?
There are several products that offer performance-related information. These include WysDM, TESM, Tek-Tools' Profiler Rx and Bocada's BackupReport. For example, tape drive pools can become unbalanced whereby some drives are 100% utilized and jobs are suspended while waiting for an available drive, while other drives sit idle. WysDM is one product that can identify such an imbalance. Similarly, Profiler Rx offers a module to monitor Fibre Channel switch performance, which can be useful for determining throughput bottlenecks within a backup zone.
Predictive analysis and forecasting go beyond historical analysis and performance tuning, which address the operational aspects of B/R improvement. Knowing what went wrong, and even why it went wrong, isn't enough to create a best-practice organization. Problems must not be merely solved, but avoided. Predictive analysis and forecasting addresses the risk aspect of B/R. To that extent, it's the most important element to the organization as a whole.
Using B/R reporting tools to mitigate risk begins with identifying holes in backup schemes. As noted earlier, organizations can miss recently added data elements in the backup schedule. Surprisingly, only a few B/R reporting tools have the capability to find missed data elements. Servergraph offers a "missed not scheduled" report that identifies missed backup nodes. CA's BrightStor SRM offers the most robust capability in this regard. When combined with ARCserve, BrightStor SRM not only finds backup holes, but creates policies that automatically detect and back up groups by application or users. (Note: Some B/R applications have what amounts to a "back up all" option that will back up everything on disk or node.)
Predictive analysis can also ensure service level agreement (SLA) compliance. Several B/R reporting products chart when a backup job was completed against the SLA's time allocated for the job. Using this chart, administrators can project when the backup window will be exceeded. By using performance tuning analysis, administrators can identify why the window will be exceeded. Moreover, service level compliance applies to recovery analysis as well as disaster recovery planning. Several products provide "recovery preparedness" reports, including BrightStor SRM, TESM, BackupReport and QNet.
WysDM offers a unique capability whereby it can report on "deviations from normal operations" even when those operations complete successfully. For example, the program sends an alert to the storage administrator if a job takes longer than normal. After these jobs are identified, root-cause analysis can be applied to solve the problem before an actual failure occurs.
Process improvement is the key to achieving best-practice operations, regardless of how many tools, reports or forecasts are applied. Bocada specifically markets BackupReport as a product that identifies B/R process deficiencies. BackupReport ties SLA and business metrics to operational results; it also detects bottlenecks, device errors and other problems, and correlates them to specific targets. Armed with this information, an administrator can begin to refine and improve company backup processes over a sustained period of time.
Any discussion of storage issues would seem incomplete without addressing compliance-related issues, and B/R reporting products are no exception. Indeed, these products can provide a unique capability that allows organizations to perform compliance risk analysis, regardless of the regulation.
Two factors are at the heart of all regulatory compliance: data retention for a specific period of time, and the ability to restore that data if required. B/R reporting tools can help by verifying how many images of a specific data set exist, at what points in time the data images were taken (i.e., are the images continuous or are there unexplained gaps), and the likelihood of a successful restore.
Although many products have the raw capabilities to provide compliance-readiness testing, the necessary correlation capabilities and reporting elements are still being developed. Profiler Rx, BackupReport, BrightStor SRM and TESM are currently marketed as specifically having this capability at some level. Servergraph has a very interesting feature the company calls a "hog factor" report, which can tell how many images of a given data set are stored. However, the company markets this feature primarily for reducing and eliminating unnecessary save sets. It's not a huge leap to correlate this information back to a compliance-related backup set, but Servergraph doesn't currently take the final step to identify gaps in the save set coverage.
Disk-to-disk backup is rapidly becoming more popular, and B/R vendors claim to be able to differentiate between actual tape volumes and virtual tape volumes. These virtual volumes will be included during restore readiness testing in most cases. Moreover, Profiler Rx, as well as SRM products, can report on logical snapshot and other replication copies. However, they don't correlate these disk-based data copies to B/R.
Despite significant benefits associated with improved backup operations, this isn't a pain-free process. Organizations should plan for a gradual phase-in of changes, beginning with a reduction in the daily care and feeding of backup. This will reduce wasted effort and gradually improve processes. Think of it like you would an aspirin: gradual relief until the pain is gone.