This article can also be found in the Premium Editorial Download "Storage magazine: How to scale up with storage clusters."
Download it now to read this article plus other related content.
Remember the backup/recovery salesman's pitch? "If there's a problem with a job, e-mail alerts, pager messages and SNMP traps will be sent immediately." It sounded like a great deal at the time, but as they say, "Be careful of what you wish for."
Industry analysts estimate that between 5% to 20% of backup jobs fail on a nightly basis, and the error messages resulting from these failed jobs can pile up faster than snow in a New England blizzard. Backup/recovery (B/R) applications can usually organize these error messages by nature and priority, but a storage administrator must still sift through possibly hundreds of messages to determine the problem's root cause, priority and remedy. There are an increasing number of backup reporting products on the market that go beyond the reporting capabilities of backup hardware and software, and promise to lessen backup pain.
Backup reporting products
Click here for a comparison table about backup reporting products (PDF).
B/R reporting tools emerged as a category in 2000-2001, driven by new vendors that have provided most of the innovation and momentum in the market. The newest B/R reporting tools are multivendor solutions as opposed to product-specific solutions. They can also be found in storage resource management (SRM) products and integrated products (see Backup reporting products, this page).
A second group of storage companies, including Computer Associates (CA) International Inc. and Veritas Software Corp., offer multivendor B/R reporting modules as part of a larger SRM product suite. In fact, backup reporting sits somewhere between backup and recovery management and SRM. Backup reporting tools are logically a subset of SRM, but are tightly linked to B/R. SRM products may cost substantially more than pure-play B/R reporting products, but they also offer broader capabilities for reporting and trend analysis on disk arrays, tape management and so on.
The third group of products includes those integrated with a specific B/R product. In some cases, such as with CommVault Galaxy, from CommVault Systems Inc., and Veritas NetBackup, the B/R reporting function is a separate module available for an additional cost. In other cases, such as with CA's ARCserve, EMC/Legato's NetWorker and IBM Tivoli Storage Manager (TSM), the reporting module is included with the base B/R product. CommVault's QNet product is packaged somewhat differently in that it's part of the firm's QiNetix SRM product; however, it's specific to CommVault's Galaxy backup app and tightly integrated with it.
In general, integrated products offer fundamental functionality like event notification, success/failure reporting and summarization. QNet extends this functionality by including capacity planning, predictive analysis, B/R costing and a "recovery readiness check." IBM's TSM module includes health monitoring, extensive reporting and a DR Manager. While this functionality may meet the day-to-day needs of many organizations, SRM and multivendor products take B/R reporting to the next level (see Which product is best?).
Several vendors tout their products as either agent or agentless architectures, and there are benefits and limitations to each. An agent architecture distributes a portion of the reporting app to various devices and allocates the workload across the infrastructure. In addition, distributed agents can continue to collect data even if the central server is unavailable. An agentless architecture eliminates the hassle of installing and maintaining agents on hundreds of devices. Which solution is best for an IT organization depends on the specific situation and personal preference.
Key B/R features
There are distinct steps that build upon each other to create a more robust, less effort-intensive storage environment. They are:
- Historical analysis
- Performance tuning
- Predictive analysis and forecasting
- Process improvement
Historical analysis is the table stakes portion of B/R reporting--without it, a vendor really doesn't have a product. Historical analysis reports on the spectrum of B/R issues, from job reporting to device reporting to the grouping of job-related problems. For example, this analysis should enable the storage admin to identify unusually problem-plagued elements and other items by exception. These might include application groups (e.g., application servers, clients, media servers), tape libraries and devices, file systems and databases. Even without more sophisticated event-correlation analysis, an administrator will be able to take steps to improve the worst-performing areas. The key element is effective filtering of information to allow the admin to hone in on relevant events.
Historical reporting usually feeds into root-cause analysis, the first area where IT should look to improve operations. Hardware infrastructure is often the culprit, but problem areas may include poorly scheduled jobs, spindle contention in database volumes and poor media management. Each B/R reporting product can identify at least some of these root-cause elements. But root-cause analysis is not as easy as it sounds once you get beyond basic tape device/media failure. While finding network bottlenecks and database design issues extends beyond the boundaries of storage management, they're often critical to backup performance improvement. SRM tools have the breadth to address some of these non-storage issues.
Most products, regardless of category, scan the B/R application's database, catalog and error log. This functionality is enabled by APIs and command line interfaces (CLIs), or a combination of the two. This capability is little more than a report writer, but it produces very useful information.
Homegrown scripts can also be used to get B/R information. In many cases, IT organizations don't want to abandon scripts because they're tailored to the organization in ways that no off-the-shelf package can be. For example, Tavata Enterprise Storage Manager (TESM) can incorporate user script output into its data collection database for additional reporting and data mining.
The WysDM product from SysDM has a very intriguing reporting function called the "Top 10 worst-performing components" of the IT environment. Although these lists will never be seen on the Late Show with David Letterman, a storage administrator will find them very useful for prioritizing actions to improve B/R.
Within historical analysis, event correlation is the most sophisticated and useful element. Event correlation goes beyond identifying what broke by linking events such as poor network performance and bottlenecks to failed backup jobs. The Servergraph Product Suite stacks a variety of charts based on time and allows the administrator to correlate events using a movable vertical line that connects all of the charts. Although Servergraph's charts can be a bit cryptic to read, the capability can be a powerful one once you're accustomed to the view.
This was first published in April 2005