Industry analysts estimate that between 5% to 20% of backup jobs fail on a nightly basis, and the error messages resulting from these failed jobs can pile up faster than snow in a New England blizzard. Backup/recovery (B/R) applications can usually organize these error messages by nature and priority, but a storage administrator must still sift through possibly hundreds of messages to determine the problem's root cause, priority and remedy. There are an increasing number of backup reporting products on the market that go beyond the reporting capabilities of backup hardware and software, and promise to lessen backup pain.
B/R reporting tools emerged in 2000-2001, driven by new vendors that have provided most of the innovation and market momentum. The newest B/R reporting tools are multivendor solutions as opposed to product-specific solutions. They can also be found in storage resource management (SRM) products and integrated products (see "Backup reporting products," PDF link next page).
Among multivendor products, Bocada Inc. can probably be credited with pioneering the backup reporting field. It has since been joined by other vendors, including Servergraph Inc., SysDM Inc., Tavata Software Corp. and Tek-Tools Inc. In the next three to five years, you can expect some consolidation, most likely through the acquisition of smaller vendors by larger ones.
A second group of companies, including Computer Associates (CA) International Inc. and Veritas Software Corp. (now owned by Symantec), offer multivendor B/R reporting modules as part of larger SRM suites. In fact, backup reporting sits somewhere between B/R management and SRM. Backup reporting tools are logically a subset of SRM, but are tightly linked to B/R. SRM products may cost substantially more than pure-play B/R reporting products, but they offer broader capabilities for reporting and trend analysis on disk arrays, tape management and so on.
The third group of products includes those integrated with a specific B/R product. In some cases, such as with CommVault Galaxy, from CommVault Systems Inc., and Veritas' NetBackup, the B/R reporting function is a separate module offered at an additional cost. In other cases, such as with CA's ARCserve, EMC/Legato's NetWorker and IBM's Tivoli Storage Manager (TSM), the reporting module is part of the base B/R product. CommVault's QNet product is packaged somewhat differently in that it's part of the firm's QiNetix SRM product; however, it's specific to CommVault's Galaxy backup app and tightly integrated with it.
In general, integrated products offer basic functionality like event notification, success/failure reporting and summarization. QNet expands on this functionality with capacity planning, predictive analysis, B/R costing and a "recovery readiness check." IBM's TSM module includes health monitoring, extensive reporting and a DR Manager. While this functionality may meet the day-to-day needs of many organizations, SRM and multivendor products take B/R reporting to the next level (see "Which product is best?,").
The products have either agent or agentless architectures, and there are benefits and limitations to each. An agent architecture distributes a portion of the reporting app to various devices and spreads the workload around. Distributed agents can also continue to collect data even if the central server is unavailable. An agentless architecture eliminates the hassle of installing and maintaining agents on hundreds of devices.
Key B/R reporting features
There are distinct steps that build upon each other to create a more robust, less effort-intensive storage environment. They are:
- Historical analysis
- Performance tuning
- Predictive analysis and forecasting
- Process improvement
Historical analysis is the table stakes portion of B/R reporting--without it, a vendor really doesn't have a product. Historical analysis reports on the spectrum of B/R issues, from job reporting to device reporting to the grouping of job-related problems. For example, this analysis should enable the storage admin to identify unusually problem-plagued elements and other items by exception. These might include application groups (e.g., app servers, clients, media servers), tape libraries and devices, file systems and databases. Even without more sophisticated event-correlation analysis, an administrator will be able to take steps to improve the worst-performing areas. The key element is effective filtering of information to allow the admin to hone in on relevant events.
Historical reporting usually feeds into root-cause analysis, the first area where IT should look to improve operations. Hardware infrastructure is often the culprit, but problem areas may include poorly scheduled jobs, spindle contention in database volumes and poor media management. Each B/R reporting product can identify at least some of these root-cause elements. But root-cause analysis can become complex once you get beyond basic tape device/media failure. While finding network bottlenecks and database design issues extends beyond typical storage management, they're often critical to backup performance improvement. SRM tools have the breadth to address some of these non-storage issues.
Most products, regardless of category, scan the B/R application's database, catalog and error log. This functionality is enabled by APIs or command-line interfaces, or a combination of the two. This capability is essentially a report writer, but it produces useful information.
Homegrown scripts can also be used for B/R reporting. Often, IT organizations don't want to abandon scripts because they're tailored to the organization in ways that no off-the-shelf package can be. For example, Tavata Enterprise Storage Manager (TESM) can incorporate script output into its data collection database for additional reporting and data mining.
SysDM's WysDM product has an intriguing reporting function called the "Top 10 worst-performing components" of the IT environment. Although these lists aren't likely to show up on the Late Show with David Letterman, a storage administrator will find them useful for prioritizing actions to improve B/R.
Within historical analysis, event correlation is the most sophisticated and useful element. Event correlation goes beyond identifying what broke by linking events such as poor network performance and bottlenecks to failed backup jobs. The Servergraph Product Suite stacks a variety of charts based on time and allows correlation of events using a movable vertical line that connects all of the charts. The charts can be a bit cryptic to read, but the capability can be a powerful analysis tool once you're accustomed to the view.
Performance tuning is one of the most vexing elements of B/R. The basics of performance tuning begin with a bit of math that enables the storage administrator to determine how much data must be moved in what period of time, yielding a GB/min value. The following questions must be answered:
- Can the available tape drives handle the load?
- Is network bandwidth sufficient?
- Does the backup server have sufficient capacity?
- Can the disk arrays deliver the data fast enough?
Several products offer performance-related information, including Bocada's BackupReport, Tek-Tools' Profiler Rx, TESM and WysDM. For example, tape drive pools can become unbalanced, whereby some drives are 100% utilized and jobs are suspended while waiting for an available drive, and other drives sit idle. WysDM, for one, can identify such an imbalance. Similarly, Profiler Rx offers a module to monitor Fibre Channel switch performance, which can be useful for determining throughput bottlenecks within a backup zone.
Predictive analysis and forecasting go beyond historical analysis and performance tuning, which address the operational aspects of B/R improvement. Knowing what went wrong, and even why it went wrong, isn't enough to create a best-practice organization. Problems must not be merely solved, but avoided. Predictive analysis and forecasting addresses the risk aspect of B/R.
Using B/R reporting tools to mitigate risk begins with identifying holes in backup schemes. As noted earlier, organizations can miss recently added data elements in the backup schedule. Surprisingly, only a few B/R reporting tools can find missed data elements. Servergraph offers a "missed, not scheduled" report that identifies missed backup nodes. CA's BrightStor SRM offers the most robust capability in this regard. When combined with ARCserve, BrightStor SRM not only finds backup holes, but creates policies that automatically detect and back up groups by application or users. (Note: Some B/R applications have what amounts to a "back up all" option that backs up everything on disk or node.)
Predictive analysis can also ensure service level agreement (SLA) compliance. Several B/R reporting products chart a backup job's completion time against the time allocated by an SLA. Using this chart, administrators can project when the backup window will be exceeded. By using performance tuning analysis, administrators can identify why the window will be exceeded. Moreover, service level compliance applies to recovery analysis and disaster recovery planning. Several products provide "recovery preparedness" reports, including BackupReport, BrightStor SRM, QNet and TESM.
WysDM offers a unique capability to report on "deviations from normal operations" even when those operations complete successfully. For example, the program sends an alert if a job takes longer than normal. Root-cause analysis can then be applied to resolve the problem before an actual failure occurs.
Process improvement is the key to achieving best-practice operations, regardless of how many tools, reports or forecasts are applied. Bocada specifically markets BackupReport as a product to identify B/R process deficiencies. BackupReport ties SLA and business metrics to operational results; it also detects bottlenecks, device errors and other problems, and correlates them to specific targets. With this information, an administrator can begin to refine and improve company backup processes over a sustained period of time.
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Which product is best? |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
When should you consider an integrated product, a multivendor product or a storage resource management (SRM) tool? Here are some rules of thumb:
- If your shop is standardized on a backup/recovery (B/R) vendor that offers integrated reporting, start there. It's often free and will give you a good feel for what you need, like and don't like before shopping for a fuller featured reporting tool.
- If you're ready to tackle a broad range of storage management reporting beyond B/R, consider an SRM tool. SRM products are more complicated and generally more expensive, but the payback of an integrated product is often greater than a B/R tool; SRM products also have other long-term advantages.
- If you have numerous B/R applications, don't want to tackle SRM or want to pursue best of breed, you should consider a multivendor backup reporting solution.
If you have three or more B/R applications (some firms have as many as six), you should consider consolidation to reduce training and maintenance costs.
|
 |
 |
 |
 |
 |
 |
 |
Future considerations
Any discussion of storage issues wouldn't be complete without addressing compliance-related issues, and B/R reporting products are no exception. Indeed, these products can provide a unique capability that allows organizations to perform compliance risk analysis, regardless of the regulation.
Two factors are at the heart of regulatory compliance: data retention for a specific period of time, and the ability to restore that data if required. B/R reporting tools can help by verifying how many images of a specific data set exist, at what points in time the data images were taken and the likelihood of a successful restore.
While many products have the raw capabilities to provide compliance-readiness testing, the necessary correlation capabilities and reporting elements are still being developed. BackupReport, BrightStor SRM, Profiler Rx and TESM are marketed as specifically having this capability at some level. Servergraph has an interesting feature it calls a "hog factor" report, which shows how many images of a given data set are stored. However, the company markets this feature primarily for reducing and eliminating unnecessary save sets. It's not a huge leap to correlate this information back to a compliance-related backup set, but Servergraph doesn't currently take the final step to identify gaps in the save set coverage.
Disk-to-disk backup is rapidly becoming more popular, and B/R vendors claim to be able to differentiate between actual tape volumes and virtual tape volumes. These virtual volumes will be included during restore readiness testing in most cases. Moreover, Profiler Rx, as well as SRM products, can report on logical snapshot and other replication copies. However, they don't correlate these disk-based data copies to B/R.
Despite the significant benefits associated with improved backup operations, it isn't a pain-free process. Organizations should plan for a gradual phase-in of changes, beginning with a reduction in the daily care and feeding of backup. This will reduce wasted effort and gradually improve processes. Think of it like you would an aspirin: gradual relief until the pain is gone.