They can enhance QA by reducing testing cycle times, and provide a basis for efficient updates of disaster recovery (DR) sites. But implementing snapshots requires planning and application analysis to determine what type of snapshot to use and how many to take.
A snapshot is a point-in-time image of a collection of data. In general, snapshots can be broken into two types: full copy and differential copy. While most full-copy snapshots use similar techniques, there are several ways to implement differential snapshots, each with its advantages and disadvantages.
With a full-copy snapshot, the entire contents of a data set are copied to a different set of spindles. Copying the data may be done on a continuous basis, so the creation of the actual snapshot can happen quickly. Examples include EMC Corp.'s TimeFinder/Mirror, Hitachi Data Systems' (HDS) ShadowImage and Network Appliance (NetApp) Inc.'s SnapMirror. The primary advantage of this approach is that the entire data set is maintained on a different set of spindles, providing a high level of protection if primary data is destroyed. It also allows the data in the snapshot to be accessed without placing any additional I/O load on the primary data spindles, a distinct advantage when the snapshot is used as the basis of a tape backup.
Simplified capacity management is another advantage of a full-copy snapshot. If you have a 1TB source, then you need a 1TB destination. As long as the destination expands with the source, everything continues to work.
The disadvantage of using a completely separate set of spindles is the cost. Each snapshot requires 100% of the disk space associated with the source, and costs can escalate sharply if multiple snapshots must be maintained. An alternative is to store the full-copy snapshot on lower-cost media such as ATA disk. The viability of doing this depends on why the snapshot was created in the first place. If the snapshot is used as the basis for tape backup, then ATA could be a good fit. If the snapshot is to be used in the event of a disaster, ATA disks may not provide the performance required in a production environment.
Differential-copy snapshots
Differential-copy snapshots store only changes to a file system. As existing files are deleted or modified, the disk blocks associated with those changes are preserved. With this approach, significantly less disk space is required to maintain the snapshot. Examples of this approach are EMC's TimeFinder/Snap, HDS' Copy-on-Write Snapshot, Microsoft Corp.'s Volume Shadow Copy Service (VSS) and NetApp's Snapshot. Depending on the technique used, the creation of a differential snapshot can happen almost instantaneously.
The key advantage of this approach is that less disk space is required. The storage needed for maintaining the snapshot could be as little as 3% of the primary volume size. This could allow many snapshots to be stored. But the primary advantage of differential-copy snapshots is also its main disadvantage.
A differential snapshot requires access to the primary volume's data blocks to reconstruct the point-in-time image of the file system because all unchanged data exists only on the primary spindles. There are three considerations when choosing differential snapshots: performance, permanence and space management.
PERFORMANCE. Performance can be a concern when the snapshot is created and later when it's accessed. Depending on the underlying technique used, creating the snapshot can result in additional I/O load on the production, primary data spindles. This load could potentially impact users or systems accessing that data.
PERMANENCE. A differential snapshot requires access to the primary data set's blocks to reconstruct the point-in-time image of the file system. If the primary volumes are lost, then all associated snapshots are also lost.
SPACE MANAGEMENT. This is the trickiest of the three. A predetermined amount of capacity needs to be set aside to accommodate the changed data, which we'll call the "snap reserve." How a storage system reacts when the snap reserve isn't large enough depends on the manufacturer. Typically, one of two things happens: the oldest snapshot is deleted or free space in the primary file system is consumed. Neither is an appealing choice.
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
Differential snapshots: Copy-on-write |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
 |
The size of the snap reserve is a function of how quickly the data changes in the primary file system and the number of snapshots to be retained. The rate of change is a function of what the primary file system is used for. For NAS, Windows home directories and public shares have a data change rate that's typically 3% to 5% of the volume size per day. For SANs, the change rate is dependent on the application using the storage.
Techniques for creating snapshots
FULL COPY. All full-copy snapshot techniques are essentially the same. In many ways, they're just another use of traditional volume mirroring. The first step is to initialize the relationship between the two data sets. Most systems allow modifications to be made to the primary data set while the initialization occurs. When the initialization is complete, the two data sets are kept in sync using synchronous or asynchronous updates. When it's time to use the snapshot, the application accessing the primary data set is quiescent. The method used to keep the app quiescent is application dependent. If the updates are done asynchronously, a final update to the snapshot data set is done. Finally, the relationship is broken and the destination becomes writable so that another host can access the data. Many products allow access to the destination in a read-only mode without breaking the relationship with the primary data set (see "Full-copy snapshots,").
DIFFERENTIAL: COPY-ON-WRITE. The most popular technique for creating a differential snapshot is the copy-on-write method. The first step is to specify a snap reserve area, which is usually on a different set of spindles than the primary data set.
The second step is to initialize or enable the snapshot service. This notifies the storage subsystem to track changes to the primary data set. As changes are made to the primary data set, the blocks of data affected by those changes are copied to the snap reserve location. There are two things to consider when using copy-on-write snapshots. The first involves performance. A file delete will cause all the blocks associated with that file to be read off the primary data spindles and then written to the snap reserve area. An overwrite of a block of data in the primary data set will result in two additional I/Os: a read of the old data block and a write of the old data block to the snap reserve location. The second involves the size of the snap reserve. Most storage systems will proactively delete the oldest snapshot to free up space when the snap reserve nears full capacity (see "Differential snapshots: Copy-on-write," above).
NETAPP'S APPROACH. NetApp uses a slightly different approach for its snapshot technology. It first specifies a snap reserve, a percentage of the primary data volume set aside for snapshots. The snap reserve is used only for space accounting on a NetApp device. Next, the snapshot is created. As with the copy-on-write approach, the snapshot view points to the existing data blocks in the primary data set. As changes are made to the primary data set, the blocks of data affected by the changes remain in place. This is the key difference between the copy-on-write approach and NetApp's. The new information is written to free space in the primary data set (even if snapshots are turned off). The obvious benefit is that enabling snapshots doesn't affect write performance. But the size of the snap reserve must be monitored; if the snap reserve is exceeded, free space in the primary data set is consumed. Excessive deletes or overwrites, coupled with maintaining several snapshots, can cause the primary data set to reach 100% capacity even though no "new" data has been written to the volume (see "Network Appliance goes virtual," this page).
Data recovery points
The most popular use of snapshots is to create more frequent recovery points to reduce the overall RPO beyond what can be achieved with tape-based solutions. This is the domain of the differential-copy snapshot.
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
A sampling of SAN and NAS snapshot products |
 |
 |
 |
 |
 |
 |
 |
 |
 |
 |
|
 |
 |
 |
 |
 |
 |
 |
With a SAN, creating several differential-copy snapshots throughout the day provides multiple recovery points. Instead of recovering the production database from tape and rolling 10 hours of logs forward, you may be able to recover the database from a 45-minute-old snapshot and roll the logs forward from that point. This assumes a logical corruption, not a physical loss of data. This example is not as straightforward as the NAS
example. Here are several points to consider.
- How easy is it to create a consistent image of the application's data? For a useful snapshot, the data must be in a consistent state when a snapshot is created.
- Does the application have a "hot backup" mode? Most database apps have this mode to ensure data files and associated logs are in a state that allows them to be backed up cleanly. If the app lacks such a mode, it should be shut down for the data in the snapshot to be useful.
- What impact does the hot backup mode have on application performance? If the hot backup mode has a significant negative effect on performance, then a synchronous full-copy snapshot may be a better choice.
- How long will it take to create the snapshot? If it takes 30 minutes to create the snapshot, it doesn't make sense to do a snapshot every hour. Ideally, snapshots should take only seconds.
- Are third-party software tools needed to create a consistent image? Sometimes third-party tools provide a better interface for coordinating snapshot creation. If a packaged product meets a company's needs, the ease of management typically outweighs the utility's cost.
- Will the data files need to be checked before they can be used in production? In many cases, this check can be done immediately after the snapshot is created. Consistency checks typically generate considerable I/O and CPU load. Ideally, the check is done by another host that connects to the snapshot.
- For large applications, will multiple snapshots need to be created at the same time and across several different storage devices? Large apps typically have several storage locations spread across SAN, NAS or locally attached disks. An analysis is required to determine how "synchronous" the snapshot creation associated with all of the storage locations needs to be. For example, document management systems typically have a database on a SAN or DAS disk that stores the locations of documents. The documents may be stored on a NAS server. It's important that every document reference in the database snapshot have a corresponding document in the NAS snapshot. Otherwise, if the system is restored to a previous state using a snapshot, it will be necessary to verify that documents referenced in the database exist on the NAS device.
QA and debug
Quality assurance (QA) and debug environments benefit significantly from both full-copy and differential-copy snapshots. The following steps are typical of a QA/debug scenario:
- Create a dump of production data
- Copy dump to test bed
- Load dump data into test system
- Run through first pass of QA/debug tests
- Reload original dump data
- Repeat Steps 3 and 4 until done
- Repeat Steps 1 through 5 for next QA cycle
Many times, Step 1 represents only a portion of the production environment. This is because it would be too expensive to recreate the production environment, or loading the dump data into the test system takes too long. Using a combination of full-copy and differential snapshots, the process could change to the following:
- Update full-copy snapshot of production data
- Break mirror relationship
- Create a differential snapshot of full-copy snapshot data
- Load dump data into test system
- Create a differential snapshot of the test system configuration
- Run through first pass of QA/debug tests
- Restore test system to differential snapshot created in Step 5
- Repeat Steps 3 and 4 until done
- Repeat Steps 1 through 5 for next QA cycle
Eliminating the production system dump and multiple reloads of the dump data shortens the QA/debug cycle. This makes it easier to test using the full system data, and allows more testing in a shorter timeframe.
Disaster recovery
Full-copy snapshots are the obvious choice for DR; however, differential-copy snapshots can play a role. There are two update methods for a DR site: asynchronous and synchronous. Asynchronous updates are less expensive and can be as simple as shipping backup tapes to a remote site. Synchronous updates are typically reserved for the most critical applications. The disadvantage of synchronous replication is that it replicates a corrupted database as quickly as it does a clean database. Differential snapshots can be used at both the DR and production sites. If the primary site becomes unavailable, and/or the data is corrupted, a rollback could be performed at either site.
Another advantage to using differential snapshots at the DR site is that one of the snapshots could be used for tape backup. That way, production data is kept synchronous and a point-in-time image of the data can be sent to tape without requiring a third copy of the data.
NAS snapshots
Here is one way to use snapshots in a NAS storage environment with, for example, a data center and two remote sites. The two remote sites use differential-copy snapshots to provide seven days' worth of local file recovery. Five hourly snapshots are created during the day to allow the user to recover files lost the same day they were created. Each night, a full-copy snapshot is replicated to the data center. This full-copy snapshot would send only the blocks of data that have changed at the remote site since the last update to the data center. The NAS device at the data center is used as the source of the tape backup. This lets you eliminate tape backup systems at the remote sites for the data stored on the NAS devices. The NAS device at the data center also keeps 30 differential snapshots of the remote site data. This combination allows files to be recovered directly from disk for a period of 30 days. Any data older than 30 days would have to be recovered from tape.
This scenario also provides a DR location for the remote site data. In the event of a disaster, users could be redirected to the data center. The ease of doing this depends on how users access the local NAS data. The redirection can happen almost transparently using Microsoft's Distributed File System (DFS). For Unix clients, a combination of Domain Name System (DNS) updates and IP aliases can achieve similar results.
Many NAS devices allow self-directed restores, giving users access to the snapshot file system to recover files on their own. Many organizations won't offer this capability because of the confusion it may cause. Also, each NAS device has its own access method and naming conventions for snapshots. In a multivendor environment, the user education process wouldn't be worth the effort. But Microsoft's Shadow Copy Client may change that. The Shadow Copy Client provides a common recovery interface for NAS snapshots via the familiar Properties dialog box. When this client is installed, a new Previous Versions tab appears when displaying the properties of a directory or file on a network drive. This interface is currently supported by Windows 2003 and NetApp NAS devices used with a Windows desktop client.
Database snapshots
Using snapshots in a database environment allows a database administrator to roll back a production database to one of several recovery points during the day. The storage at a secondary site is used for QA and DR, and forms the basis of the nightly tape backup.
The choice of using differential snapshots at the primary site depends on the update and delete rate associated with the database. A high rate of changes and deletes may force the use of full-copy snapshots. As with all applications, the performance impact of using differential snapshots needs to be assessed prior to implementation. If asynchronous updates are going to the secondary site, the updated interval should be coordinated with the differential snapshot schedule at the primary site. This reduces the amount of time the database needs to be in hot backup mode.
Coordinating snapshots to support critical applications that typically span multiple servers, operating systems and storage systems can be daunting. But the big win may be with mid-tier applications, which often require far less analysis to determine the impact of snapshots and may use only tape for recovery.