Snapshots provide many benefits across all types of applications and storage requirements. They can greatly enhance and simplify the data recovery process by reducing recovery time and providing more recovery point objectives (RPOs).
They can enhance QA by reducing testing cycle times, and provide a basis for efficient updates of disaster recovery (DR) sites. But implementing snapshots requires planning and application analysis to determine what type of snapshot to use and how many to take.
A snapshot is a point-in-time image of a collection of data. In general, snapshots can be broken into two types: full copy and differential copy. While most full-copy snapshots use similar techniques, there are several ways to implement differential snapshots, each with its advantages and disadvantages. The interesting thing about both techniques is that some of the advantages are also disadvantages.
With a full-copy snapshot, the entire contents of a data set are copied to a different set of spindles. Copying the data may be done on a continuous basis, so the creation of the actual snapshot can happen quickly. Examples include EMC Corp.'s TimeFinder/Mirror, Hitachi Data Systems' (HDS) ShadowImage and Network Appliance (NetApp) Inc.'s SnapMirror. The primary advantage of this approach is that the entire data set is maintained on a different set of spindles, providing a high level of protection if primary data is destroyed. It also allows the data in the snapshot to be accessed without placing any additional I/O load on the primary data spindles, a distinct advantage when the snapshot is used as the basis of a tape backup.
Simplified capacity management is another advantage of a full-copy snapshot. If you have a 1TB source, then you need a 1TB destination. As long as the destination is expanded with the source, everything continues to work.
The disadvantage of using a completely separate set of spindles is the cost. Each snapshot requires 100% of the disk space associated with the source, and the cost can escalate sharply if there's a need to maintain multiple snapshots. A possible solution is to store the full-copy snapshot on lower-cost media such as ATA disk. The viability of doing this depends on the motivation for creating the snapshot in the first place. If the snapshot is used as the basis of a tape backup, then ATA could be a good solution. If the snapshot is to be used in the event of a disaster, the ATA disks may not be able to provide the performance required in a production environment.
|Differential snapshots: Copy-on-write|
Differential-copy snapshots only store changes to a file system. As existing files are deleted or modified, the disk blocks associated with those changes are preserved. With this approach, significantly less disk space is required to maintain the snapshot. Examples of this approach are EMC's TimeFinder/Snap, HDS' Copy-on-Write Snapshot, Microsoft Corp.'s Volume Shadow Copy Service (VSS) and NetApp's Snapshot. Depending on the technique used, the creation of a differential snapshot can happen almost instantaneously.
The primary advantage of this approach is that less disk space is required. Depending on the application, the storage overhead for maintaining the snapshot could be as little as 3% of the primary volume size. This potentially allows many snapshots to be stored. But, as with full-copy snapshots, the primary advantage of differential-copy snapshots is also its primary disadvantage.
A differential snapshot requires access to the primary volume's data blocks to reconstruct the point-in-time image of the file system because all unchanged data exists only on the primary spindles. There are three considerations when choosing differential snapshots: performance, permanence and space management.
Performance. Performance can be a concern when the snapshot is created and later when the snapshot is accessed. Depending on the underlying technique used, the creation of the snapshot can result in additional I/O load on the production, primary data spindles. This load could potentially impact users or systems accessing that data.
Permanence. A differential snapshot requires access to the primary data set's blocks to reconstruct the point-in-time image of the file system. If the primary volumes are lost, then all associated snapshots are also lost.
Space management. This is the trickiest of the three. A predetermined amount of capacity needs to be set aside to accommodate the changed data, which we'll call the "snap reserve." How a storage system reacts when the snap reserve is not large enough depends on the manufacturer. Typically, one of two things happens: the oldest snapshot is deleted or free space in the primary file system is consumed. Neither is an appealing choice.
The size of the snap reserve is a function of how quickly the data changes in the primary file system and the number of snapshots to be retained. The rate of change is a function of what the primary file system is used for. For network-attached storage (NAS), Windows home directories and public shares have a data change rate that's typically 3% to 5% of the volume size per day. For storage area networks (SANs), the change rate is dependent on the application using the storage.
|Network Appliance goes virtual|
Full copy. All full-copy snapshot techniques are essentially the same. In many ways, they're just another use of traditional volume mirroring. The first step is to initialize the relationship between the two data sets. Most systems allow modifications to be made to the primary data set while the initialization occurs. When the initialization is complete, the two data sets are kept in sync using synchronous or asynchronous updates. When it's time to use the snapshot, the application accessing the primary data set is quiescent. The method used to keep the application quiescent is application dependent; for example, Oracle table spaces would be put into backup mode. If the updates are done asynchronously, a final update to the snapshot data set is done. Finally, the relationship is broken and the destination becomes writable so that another host can access the data. Many products allow access to the destination in a read-only mode without breaking the relationship with the primary data set (see Full-copy snapshots).
Differential: Copy-on-write. The most popular technique for creating a differential snapshot is the copy-on-write method. The first step is to specify a snap reserve area, which is usually on a different set of spindles than the primary data set. Microsoft's VSS implementation, for example, allows a portion of the primary data disk to be set aside as the snap reserve.
The second step is to initialize or enable the snapshot service. This notifies the storage subsystem to track changes to the primary data set. As changes are made to the primary data set, the blocks of data affected by those changes are copied to the snap reserve location. There are two things to consider when using copy-on-write snapshots. The first involves performance. A file delete will cause all the blocks associated with that file to be read off the primary data spindles and then written to the snap reserve area. An overwrite of a block of data in the primary data set will result in two additional I/Os: a read of the old data block and a write of the old data block to the snap reserve location. The second involves the size of the snap reserve. Most storage systems will proactively delete the oldest snapshot to free up space when the snap reserve nears full capacity (see Differential snapshots: Copy-on-write).
NetApp's approach. NetApp uses a slightly different approach to implement its snapshot technology. It first specifies a snap reserve, a percentage of the primary data volume set aside for snapshots. The snap reserve is used only for space accounting on a NetApp device. Next, the snapshot is created. As with the copy-on-write approach, the snapshot view points to the existing data blocks in the primary data set. As changes are made to the primary data set, the blocks of data affected by those changes remain in place. This is the key difference between the copy-on-write approach and what NetApp does. The new information is written to free space in the primary data set (even if snapshots are turned off). The obvious benefit is that enabling snapshots has no effect on write performance. The size of the snap reserve must be monitored, however; if the snap reserve is exceeded, free space in the primary data set is consumed. Excessive deletes or overwrites, coupled with maintaining several snapshots, can cause the primary data set to reach 100% capacity even though no "new" data has been written to the volume (see "Network Appliance goes virtual" on this page).
|A sampling of SAN and NAS snapshot products|
|Implementing snapshots for e-mail storage|
The most popular use of snapshots is to create more frequent recovery points to reduce the overall RPO beyond what can be achieved with tape-based solutions. This is the domain of the differential-copy snapshot.
With a SAN, creating several differential-copy snapshots throughout the day provides multiple recovery points. Instead of recovering the production database from tape and rolling 10 hours of logs forward, you may be able to recover the database from a 45-minute-old snapshot and roll the logs forward from that point. This assumes a logical corruption, not a physical loss of data. This example is not as straightforward as the NAS example. Here are several points to consider.
- How easy is it to create a consistent image of the application's data? For the snapshot to be useful, application data needs to be in a consistent state when the snapshot is created.
- Does the application have a "hot backup" mode? Most database-type applications have a "hot backup" mode that ensures data files and associated logs are in a state that allows them to be backed up cleanly. If the application doesn't have such a mode, it should be shut down for the data in the snapshot to be useful.
- What impact does the hot backup mode have on application performance? If the application's hot backup mode will have a significant negative effect on application performance, then a synchronous full-copy snapshot may be a better alternative.
- How long will it take to create the snapshot? If it takes 30 minutes to create the snapshot, it doesn't make sense to do a snapshot every hour. Ideally, snapshots should take only seconds.
- Are third-party software tools needed to create a consistent image? Sometimes third-party tools provide a better interface for coordinating snapshot creation. Often, the same effect could be achieved by writing custom scripts. If a packaged product meets the organization's needs, the ease of management typically outweighs the cost of purchasing the utility.v
- Will the data files need to be checked before they can be used in production? In many cases, this check can be done immediately after the snapshot is created. Consistency checks typically generate considerable I/O and CPU load. Ideally, the check is done by another host that connects to the snapshot.
- For large applications, will multiple snapshots need to be created at the same time and across several different storage devices? Large applications typically have several storage locations spread across SAN, NAS or locally attached disks. An analysis is required to determine how "synchronous" the snapshot creation associated with all of the storage locations needs to be. For example, document management systems typically have a database on a SAN or direct-attached storage (DAS) disk that stores the locations of documents. The documents may be stored on a NAS server. It's important that every document reference in the database snapshot have a corresponding document in the NAS snapshot. Otherwise, if the system is restored to a previous state using a snapshot, it will be necessary to verify that documents referenced in the database exist on the NAS device.
QA and debug
Quality assurance (QA) and debug environments benefit significantly from both full-copy and differential-copy snapshots. The following steps are typical of a QA/debug scenario:
- Create a dump of production data
- Copy dump to test bed
- Load dump data into test system
- Run through first pass of QA/debug tests
- Reload original dump data
- Repeat Steps 3 and 4 until done
- Repeat Steps 1 through 5 for next QA cycle
Many times, Step 1 represents only a portion of the production environment. This is because it would be too expensive to recreate the production environment, or loading the dump data into the test system takes too long. Using a combination of full-copy and differential snapshots, the process could change to the following:
- Update full-copy snapshot of production data
- Break mirror relationship
- Create a differential snapshot of full-copy snapshot data
- Load dump data into test system
- Create a differential snapshot of the test system configuration
- Run through first pass of QA/debug tests
- Restore test system to differential snapshot created in Step 5
- Repeat Steps 3 and 4 until done
- Repeat Steps 1 through 5 for next QA cycle
Eliminating the production system dump and multiple reloads of the dump data shortens the QA/debug cycle. This makes it easier to test using the full system data, and allows more tests to be completed in a shorter timeframe.
|Application integration enhances snapshot value|
Full-copy snapshots are the obvious choice for DR; however, differential-copy snapshots can play a role. There are two update methods for a DR site: asynchronous and synchronous. Asynchronous updates are less expensive and can be as simple as shipping backup tapes to a remote site. Synchronous updates are typically reserved for the most critical applications. The disadvantage of synchronous replication is that it replicates a corrupted database as quickly as it does a clean database. Differential snapshots can be used at both the DR site and the production site. In the event the primary site becomes unavailable, and/or the data set is corrupted, a rollback could be performed at either the DR or production site.
An additional advantage to using differential snapshots at the DR site is that one of these snapshots could be used as the basis of a tape backup. That way, the production data is kept synchronous and a point-in-time image of the data can be sent to tape without requiring a third copy of the data.
Here is one way to use snapshots in a NAS storage environment with, for example, a data center and two remote sites. The two remote sites use differential-copy snapshots to provide seven days' worth of local file recovery. Five hourly snapshots are created during the day to allow the user to recover files lost the same day they were created. Each night, a full-copy snapshot is replicated to the data center. This full-copy snapshot would send only the blocks of data that have changed at the remote site since the last update to the data center. The NAS device at the data center is used as the source of the tape backup. This lets you eliminate tape backup systems at the remote sites for the data stored on the NAS devices. The NAS device at the data center also keeps 30 differential snapshots of the remote site data. This combination allows files to be recovered directly from disk for a period of 30 days. Any data older than 30 days would have to be recovered from tape.
This scenario also provides a DR location for the remote site data. In the event of a disaster, users could be redirected to the data center. The ease of doing this depends on how users access the local NAS data. The redirection can happen almost transparently using Microsoft's Distributed File System (DFS) in the Windows space. For Unix clients, a combination of Domain Name System (DNS) updates and IP aliases can achieve similar results.
Many NAS devices allow self-directed restores. In this case, users have access to the snapshot file system and can recover files on their own. Most organizations, especially large ones, tend to keep this information from their users because of the confusion it may introduce. Additionally, each NAS device has its own access method and naming conventions for snapshots. In a multivendor environment, the user education process would not be worth the effort. With the release of Microsoft Shadow Copy Client, this may change. The Shadow Copy Client provides a common recovery interface for NAS snapshots through the familiar Properties dialog box. When this client is installed, a new tab labeled Previous Versions is shown when the properties of directory or file are displayed on a network drive. This interface is currently supported by Windows 2003 and NetApp NAS devices when used in conjunction with a Windows desktop client.
Using snapshots in a in a database environment allows a database administrator to roll back a production database to one of several recovery points during the day. The storage at a secondary site is used for QA and DR, and forms the basis of the nightly tape backup.
The choice of using differential snapshots at the primary site depends on the update and delete rate associated with the database. A high rate of changes and deletes may force the use of full-copy snapshots. As with all applications, the performance impact of using differential snapshots needs to be understood prior to implementation. If asynchronous updates are going to the secondary site, the updated interval should be coordinated with the differential snapshot schedule at the primary site. This reduces the amount of time the database needs to be in hot backup mode.
Coordinating snapshots to support critical applications that typically span multiple servers, operating systems and storage systems can be daunting. But the big win may be with mid-tier applications, which often require far less analysis to determine the impact of snapshots and may use only tape for recovery.