Published: 15 Jun 2003
College students may be too busy partying in the wee hours of the night to notice when the university servers go down, but Lou Ramirez isn't. Every night, administrative systems at the University of Southern California (USC), in Los Angeles, are shut down for nearly four hours while approximately 1TB of Sun Microsystems StorEdge T3 array enterprise storage is backed up. Servicing more than 28,800 students and 3,800 faculty spread over 50 USC departments, the administrative systems--custom-built applications based on UniVerse databases running on more than 30 Sun servers--manage student records and grades, university financials, alumni communications and other critical information.
For Ramirez, associate director of administrative information services at USC, the four-hour backup window meant the applications were staying down much longer than she liked. Aiming to close the window to a mere crack, USC recently implemented Sun's StorEdge Availability Suite to begin doing backups using point-in-time snapshots.
The move has reduced downtime and laid the foundation for a more robust data availability model. In this model, databases are quickly duplicated using Sun's StorEdge Instant Image snapshot software. One mass data copying job will initially establish the working data set, while regular delta updates will ensure the snapshots stay up to date as data changes.
Once the snapshot has been made, USC will back it up onto tape--a process that is no longer time-sensitive because the primary system no longer has to go down during the backup. Maintenance of a complete audit trail of point-in-time backups will allow the system to be restored to its condition at any point in the past.
"It's going to provide higher reliability and protection when it comes to data recovery," Ramirez says. "We can be back up and running within half an hour using the snapshot, and get our users back onto the systems faster."
|Getting your data from here to there
Both software and hardware vendors offer a number of snapshot and replication solutions. Here are the tools to look for, depending on your need.
Different types of snapshots
As data volumes steadily increase, the need to back up that data--often within shrinking timeframes mandated by a growing demand for 24 x 7 operation--has made snapshotting and replication critical tools for storage managers. Both have been available for some time as features of proprietary storage management applications, but their use has become easier thanks to their integration deeper into enterprise storage systems.
Depending on the technology in use and the dictates of the storage environment, snapshotting can take several forms. Volume-based snapshots work by maintaining a mirrored secondary volume. When the snapshot is executed, it's logically split from the primary volume and this volume can then be backed up. Because data remains internal to the storage unit, the procedure is completed within seconds or minutes, compared with minutes or hours using conventional tape-based backups.
|Providing more flexible snapshots|
Snapshots are a quick way of putting some distance between the real-time data processing environment and the army of tape drives or near-line disks being rolled out to manage backups. If a snapshot is copied to faster near-line disk, it can also be used as a data source by other applications for tasks such as data mining and reporting. This approach is valuable because it provides a stable data set for analysis while the primary database continues to process transactions.
Snapshots can also be file-based, a more granular approach in which changes to specific directories or files are recorded, then duplicated onto remote copies once the snapshot is executed. This is a faster method of replicating just the data that has changed, and keeps snapshot times down to a minimum.
In a point-in-time snapshot, systems maintain an ongoing log of changes to the volume. This log--rather than the full data set itself--is duplicated at snapshot time. This approach provides a full record of data changes and the ability to roll back the data environment to any point in the past.
Snapshotting is a simple, elegant solution that's saving USC and many other organizations precious hours nightly. Its considerable value to customers has made it a standard feature in broader storage volume management suites from EverStor, FalconStor, Legato, StoreAge, Veritas, and many other independent software vendors (ISVs).
But they're not alone: Leading storage hardware vendors--including EMC, Hewlett-Packard, Hitachi Data Systems, IBM, Network Appliance, and Sun Microsystems--have all embedded proprietary snapshot capabilities into their respective storage boxes. This has allowed for highly optimized snapshotting, but has also restricted snapshots to use on the same host platform because there's currently no standard for snapshot structure.
Some ISVs have gone part of the way towards resolving this issue by providing support for snapshots from specific vendors. But Microsoft is aiming to eliminate this problem altogether with the creation of Volume Shadow Copy Service (VSS), which will debut in Windows Server 2003 and may become a de facto standard for snapshot structure.
Snapshotting isn't the only option for time-starved storage managers. Increasingly, it's being complemented by live replication, which synchronizes data between two or more computers in real time. Replication has long been possible through use of dedicated mirroring interface cards, within RAID arrays or as a software component on many key enterprise systems.
Both embedded and software-based approaches have their respective benefits--vendor independence is a key benefit of software solutions. For application hosting provider BlueStar Solutions, of Cupertino, CA, software replication--in the form of Veritas Volume Manager--has helped customers replicate their data between BlueStar data centers in Dallas and Phoenix.
The company's customers--which include Autodesk, eBay, and Solvay Pharmaceuticals--use a wide variety of storage devices including EMC Clariion, Hitachi 9500, IBM Fast, and other boxes totaling more than 150TB of storage. Given the broad range of devices it had to replicate between, software-based replication was the natural choice for BlueStar.
"We have customers with a RTO [Recovery Time Objective] of four hours or less, and the only way for us to do that is to give them an environment where there's replication between our data centers," says Bill Augustadt, BlueStar's CTO.
However it's implemented, replication occurs in either a synchronous or asynchronous manner. In a synchronous setup, data is replicated between primary and secondary storage arrays continuously and in real time over the fastest path available--Fibre Channel in a SAN, any IP network when data is being moved between sites or the device's backplane when replicating between volumes.
However, synchronous replication is inherently unsuited to covering longer distances--many set 10 km as the practical limit--because latency and interference on intervening cables can disrupt the smooth flow of data. Furthermore, closed-loop replication only tends to work with a second, similarly configured box from the same vendor. This may be fine for many companies, but can present problems in situations where mergers and acquisitions have introduced a heterogeneous storage environment.
"People are really looking at disaster recovery plans and realizing that their traditional tape back or restore strategies may not meet RTOs or backup windows," says Matt Fairbanks, senior manager of technical marketing with Veritas, which incorporates replication as a pay-to-use option within its Volume Manager software.
Answering the cost question
The need to maintain an identical secondary storage array is a major problem for storage managers wanting to introduce replication. This approach presents a major expense that may be difficult to justify if the equipment is intended to just sit and wait for a disaster.
In a recent survey of storage managers, StorageTek, of Louisville, CO, found that just 9.4% were mirroring all their data; 37.7% mirrored critical systems; 35.8% mirrored some critical systems and 17.0% weren't mirroring data at all. When asked why they didn't mirror more systems, 83.3% of respondents named the cost of storage as an inhibitor to their fully mirroring data volumes, while 53.8% cited implementation costs as a culprit.
Clearly, a better value proposition will increase the chance of getting funding. If you can plan on using some of the backup data center's capacity to receive and process snapshots--particularly if those snapshots are going to return business value by being fed into data mining applications--the cost can be spread over several important applications.
In this model, the disaster recovery (DR) site becomes not just a big warehouse of idle disks, but a secondary data center where storage of snapshots allows analytical applications to conduct backup and detailed analysis of point-in-time data copies. Recruiting all this extra computing capacity to the cause makes a DR investment a little more palatable.
"We've been seeing people saving millions of dollars over a traditional hardware-based approach," says Fairbanks. "If you deploy replication at a disaster recovery location, [the location] serves two functions. That makes it much easier to justify the cost as opposed to having a completely dark data center that's not normally used."
As the current uncertain political climate forces companies to assess and reassess their disaster recovery plans, the need to physically separate copies of data is becoming an issue. However, stubborn distance limitations--and the high cost of the fiber that gets around them--have hindered implementation of synchronous replication across long distances. Instead, more distance-tolerant asynchronous replication--embodied in standalone software solutions from various vendors--is proving to be a much better answer. (For more information on the pros and cons of synchronous and asynchronous replication, see Marc Farley's "Cost-effective business continuity")
More bandwidth means faster replication but, of course, costs more. "We can calculate [usage] and build a pipe that allows us to meet RTOs with a master file that drains to a slave at a speed that guarantees we have no more than an hour, for example, in the pipe at a time," says BlueStar's Augustadt, who says customers typically budget for 80% utilization of their replication links. "Bandwidth determines how fast the log can drain, not how fast you write to it."
In some cases, companies assume that if they have a big database to replicate, they need a fat bandwidth pipe. That's a misnomer: Replication should ideally be performing delta updates, and therefore transferring just a small subset of the entire database.
Most companies focus on optimizing their replication rather than throwing bandwidth at the problem, says Kelly Polanski, vice president of application availability marketing with Legato Systems. "It's very rare that we find companies capable of investing in lots of bandwidth," she says. "That means efficient use of the network--throttling, scheduling so that the network can be shared with WAN usage--is very instrumental in getting a disaster recovery plan in place. Replication provides a way to get data offsite as fast as you can."
Getting data out the door
A high-speed metropolitan connection was just the ticket for America West, America's eighth-largest airline, which is currently setting up a data replication strategy that will improve its business continuity planning.
With some 6TB of EMC Symmetrix storage in place, America West has long relied on EMC's TimeFinder snapshotting tool to generate two hourly snapshots of its Systems Operations and Control (SOC) center, a Phoenix facility that uses custom applications to manage myriad details of every flight.
In its current configuration, the snapshots are copied from one Symmetrix volume to another. A backup server can be booted from the snapshots if necessary. "This is a key component of our business continuity plan," says Joe Beery, senior VP and CIO of American West. "These systems drive the daily operation of the airline, and they're what we consider to be our mission-critical applications."
As the company's recent SAN investment takes hold, its replication is being shifted onto four T1 lines that link the SOC with a secondary data center 6.5 miles away. Other types of data, including data from the airline's Unisys mainframe, will eventually be replicated as well by using an Inrange Technologies ESCON interface that moves the mainframe data over the T1 lines.
Snapshot and replication tools have become important allies in the fight to ensure business continuity. By mapping out business processes to identify which applications need to be fully replicated and which can be backed up less frequently, it's possible to bring a high level of redundancy to key applications at a manageable cost--and then expand the scope of data protection as disk space and corporate priorities allow.
Remember, however, that replication only works well if you have tried-and-true procedures for failing over your primary data center to the secondary site--and back again.
"It's very easy to use this stuff wrong," says James Staten, director of marketing for networked storage with Sun Microsystems. "The majority of customers think they have a DR plan in place, then realize they've never tested it and don't know for sure that it's going to work."