The skinny on data deduplication

Data deduplication technology identifies and eliminates redundant data, drastically reducing the amount of disk needed to store the deduped data. This inside look reveals the differences among deduplication systems and explains what key features you need to consider.

Data deduplication products drastically cut the amount of data you need to back up, but the way these systems reduce and store data varies.

Data deduplication changes all the rules in secondary storage. Most notably, it belies the rules that say every gigabyte of primary storage is represented by 10 GB of backups, and the canard that tape is cheaper than disk.

Backup information
Compression, deduplication and encryption 

Beyond tape backups

Best practices: Optimizing your backups
There's been a flurry of debate about deduplication--both for and against--that has generated confusion, fear, uncertainty, doubt and misconceptions about the technology. Simply put, deduplication technologies identify and eliminate redundant data, significantly reducing the amount of disk needed to store the deduped data. Though various deduplication systems eliminate redundant data differently, all of the approaches look at the data on a subfile (block) level to determine if the system has seen the data before. If it hasn't, it stores it. If it has seen the data before, it ensures that it's stored only once and all other references to that data will just be pointers.

For example, a deduplication system would store the following data only one time:

  • The same file backed up from five different servers
  • Five percent of a weekly full backup if 95% of it was duplicate blocks of data stored last week
  • A daily full backup of a database that doesn't support incremental backups (most of it would be duplicate blocks from the day before)
  • Incremental backups of files that change constantly, such as a spreadsheet that's updated every day

Content-Addressable Storage
(CAS) uses the same techniques as deduplication systems to uniquely identify data, but has a very different purpose than a deduplication system. As its name implies, a CAS system creates an address for a particular file or e-mail based on its content. To do this, a CAS system creates a single polynomial (e.g., MD5 or SHA-1) for each file or e-mail, and then uses that as the unique identifier for that object. When the file or e-mail is retrieved from the CAS system, the polynomial is recalculated and compared against the original value to verify that the data didn't change.

The purpose of the unique identifier is to verify immutability. Its primary purpose isn't to eliminate redundant data; however, this is done in some CAS systems. When data is eliminated, it's done only at the file level. A deduplication system looks at data on a much more granular level. For example, several versions of the same PowerPoint file would result in several different files in a CAS system. A deduplication system would understand that much of the files were the same, and store only the new unique blocks each time.

Perhaps the biggest benefit deduplication brings to the table is the ability to have onsite and offsite backups without touching a single tape. A deduplicating virtual tape library (VTL) stores only the new, unique blocks from each night's backups. Those new, unique blocks could then be easily replicated to a second VTL residing outside the main data center; replication becomes more practical when you're replicating only new, unique blocks.

Data can be deduplicated at the target or source. A system that deduplicates at the target, such as a VTL, uses your current backup software. The backup system operates as usual, and the target identifies and eliminates redundant data sent by the backup system.

To use deduplication at the source, you must install backup client software from the deduplication vendor. That client then communicates with a backup server running the same software. If the client and server determine that data on the client has already been stored on the backup server, that data isn't sent to the backup server, saving disk space and network bandwidth.

All deduplication systems have three primary tasks: fingerprinting, redundancy identification and redundancy elimination.

During the redundancy identification phase, the data is split into chunks, which are essentially blocks of various sizes. We'll refer to this process as chunking. The purpose of fingerprinting is to look at the incoming data to see if it's similar to previous data so that it may be chunked in a way that will result in the greatest amount of commonality. If you can imagine someone lining up two nearly identical fingerprints on top of each other, you'll get the basic idea, as well as an understanding of why this is called the fingerprinting stage.

Most fingerprinting systems are content-agnostic; that is, they look at what the backup image looks like, not necessarily what's contained within the backup image. Some fingerprinting systems, however, are content-aware, meaning they interpret the backup image and can view its content as it was originally backed up. This allows the system to fingerprint to a more granular level using file names, path names and other meta data (see "Is CAS the same as deduplication?").

Redundancy identification
The next step, redundancy elimination, takes each chunk and determines if it has been seen and stored before. If so, it will just create another reference to the chunk. If not, it will store the chunk in the data store. There are three basic methods used to identify redundant chunks.

SHA-1: Originally (and still) used as a method for creating cryptographic signatures for security purposes, SHA-1 creates a 160-bit value that's considered statistically unique for each chunk of data. If two chunks have the same SHA-1 hash, they should contain the same information.

MD5 is a 128-bit hash that was also designed for cryptographic purposes. Although many security experts are recommending the use of stronger hashes for cryptographic reasons, that doesn't diminish its value for use in data deduplication.

Custom: Some vendors use custom methods to identify unique data. For example, they might have their own hash function that's used to identify redundancy candidates. Content-aware systems can use methods other than hashing to identify redundancy.

Bit-level comparison: The best way to ensure two chunks of data are the same is to perform a bit-level comparison of the two blocks. The downside to this method is the I/O required to read and compare both blocks.

Some vendors use multiple methods to identify redundant chunks. For example, Diligent Technologies Corp. and Sepaton Inc. use a custom method to identify redundancy candidates, and then follow with a bit-level comparison. FalconStor Software uses SHA-1 to identify redundant blocks, and its VTL can be configured to run an additional MD5 check on any redundancy candidates.

Hash collisions occur when two different chunks produce the same hash. It's widely acknowledged in cryptographic circles that a determined hacker could create two blocks of data that would have the same MD5 hash. If a hacker could do that, they might be able to create a fake cryptographic signature. That's why many security experts are turning to SHA-1. Its bigger key space makes it much more difficult for a hacker to crack. However, at least one group has already been credited with creating a hash collision with SHA-1.

The ability to forcibly create a hash collision means absolutely nothing in the context of deduplication. What matters is the chance that two random chunks would have a hash collision. With a 128-bit and 160-bit key space, the odds of that happening are 1 in 2128 with MD5, and 1 in 2160 with SHA-1. That's 1038 and 1048, respectively. If you assume that there's less than a yottabyte (1 billion petabytes) of data on the planet Earth, then the odds of a hash collision with two random chunks are roughly 1,461,501,637,330,900,000,000,000,000 times greater than the number of bytes in the known computing universe.

Let's compare those odds with the odds of an unrecoverable read error on a typical disk--approximately 1 in 100 trillion or 1014. Even worse odds are data miscorrection, where error-correcting codes step in and believe they have corrected an error, but miscorrect it instead. Those odds are approximately 1 in 1021. So you have a 1 in 1021 chance of writing data to disk, having the data written incorrectly and not even knowing it. Everybody's OK with these numbers, so there's little reason to worry about the 1 in 1048 chance of a SHA-1 hash collision.

If you want to talk about the odds of something bad happening and not knowing it, keep using tape. Everyone who has worked with tape for any length of time has experienced a tape drive writing something that it then couldn't read. Compare that to successful deduplication disk restores. According to Avamar Technologies Inc. (recently acquired by EMC Corp.), none of its customers has ever had a failed restore. Hash collisions are a nonissue.

Redundancy elimination
Once a deduplication device has identified a redundant chunk of data, it must decide how to record the existence of that chunk. There are two ways to do that: Reverse referencing creates a pointer to the original occurrence if there are additional occurrences of the original chunk; the second method, forward referencing, writes the latest version of the chunk to the system, then makes the previous occurrence of the chunk a pointer to the most recent occurrence. It's unclear at this time whether either of these methods will have an impact on the performance of older or newer restores. Be sure to test this feature when evaluating a potential deduplication solution. Determine if there's a performance difference when restoring newer or older versions of the file system/ database you're backing up.

For a comparison of companies offering deduplication products, click here.

In-band vs. out-of-band
Another important differentiator among deduplication products is whether they work in-band or out-of-band. That is, do they deduplicate the data as they're writing it to the array or VTL (in-band), or is deduplication a secondary process that may run asynchronously (out-of-band). There are advantages and disadvantages to each method.

The advantage to the in-band method is that it works with the data only one time. The drawback is that, depending on the implementation, it could slow down the incoming backup. The inline camp argues that while they'll probably slow down the backup somewhat, when they're done, they're done. The out-of-band camp still has important work to do: Store the data.

The out-of-band method has to write the original data, read it, identify its redundancies, and then write one or more pointers if it's redundant. The advantage to this is that you can apply more parallel processes (and processors) to the problem, whereas the in-band method can apply only one process per backup stream. The disadvantage is that the data is written and read more than once, and the multiple reads and writes could cause contention for disk. In addition, the out-of-band method requires slightly more disk than an in-band setup because an out-of-band system must have enough disk to hold the latest set of backups before they're deduplicated. The out-of-band camp counters that slowing down the original backup is unacceptable, and that they'll be able to deduplicate the data in time for tomorrow's backup.

You probably shouldn't dismiss a vendor simply because it uses in-band or out-of-band methods, but definitely test the different deduplication methods to determine how fast they work in your environment. Remember to test the product against many slower backups as well as a smaller number of backups where speed matters. Some systems perform well for single streams, but don't scale for many streams. Some work well only when you send them many streams, but don't perform well with a very fast single stream. Finally, test the deduplication product with enough data to see whether it will handle the amount of data you back up every day. If it doesn't get the deduplication job done every day in time for the next night's backup, you're going to be in trouble.

One final area to consider is whether the vendor's implementation of deduplication is scalable beyond a single instance. Multiple instances of deduplication engines are nowhere near as effective as a single, large deduplication engine. While there's some data, such as the operating system and applications, that's common among all systems, backups sometimes move between different targets in a large backup system. If those multiple targets don't share a single, large deduplication engine, the amount of deduplication performed will be greatly reduced.

For a typical data center performing weekly full backups and daily incremental backups with a mix of database and file system data, a deduplication system could reduce the amount of storage needed for its backups by 20:1 or more. Those performing monthly full backups will see a lower deduplication ratio. But not all deduplication engines are the same (see "What RAID levels does the dedupe device support?").

Test multiple deduplication products for performance and scalability. Besides the obvious tests that the disk device can successfully back up and restore the data it's given, make sure you test single-stream backup performance as well as the maximum performance of a given disk device. Some deduplication products will perform similarly to others if you give them enough streams, but individual streams can be slowed by some methods. If you're able to implement one of these systems, it will allow you to do a lot more with disk than you could without it.

About the author: W. Curtis Preston is vice president, data protection services at GlassHouse Technologies, Framingham, MA. He's also the author of "Using SANs and NAS, Unix Backup and Recovery" and the "Storage Security Handbook."

Dig Deeper on Storage management tools

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.