This article can also be found in the Premium Editorial Download "Storage magazine: The top email archiving strategies for storage managers."
Download it now to read this article plus other related content.
The dedupe stage is getting crowded and confusing. Avoid trouble by knowing where to look for it.
Deduplication promises to reduce the transfer and storage of redundant data, which optimizes network bandwidth and storage capacity. Storing data more efficiently on disk lets you retain data for longer periods or "recapture" data to protect more apps with disk-based backup, increasing the likelihood that data can be recovered rapidly. Transferring less data over the network also improves performance. Reducing the data transferred over a WAN connection may allow organizations to consolidate backup from remote locations or extend disaster recovery to data that wasn't previously protected. The bottom line is that data dedupe can save organizations time and money by enabling more data recovery from disk and reducing the footprint and power and cooling requirements of secondary storage. It can also enhance data protection.
Sounds great, doesn't it? But the devil is in the details. For starters, vendors offering data dedupe technology are often polarized when it comes to their approaches. There are debates over deduplication in backup software
| vs. in the hardware storing backup data, inline vs. post-process deduplication and even the type of hash algorithm used. These topics are discussed in vendor blogs and marketing materials, which can cause a lot of confusion among users.
The fine print
In data protection processes, dedupe is a feature available in backup apps and disk storage systems to reduce disk and bandwidth requirements. Data dedupe technology examines data to identify and eliminate redundancy. For example, data dedupe may create a unique data object with a hash algorithm and check that fingerprint against a master index. Unique data is written to storage and only a pointer to the previously written data is stored.
File-level dedupe (or single-instance storage) removes duplicated data at the file level by checking file attributes and eliminating redundant copies of files stored on backup media. This method delivers less capacity reduction than other methods, but it's simple and fast.
Deduplicating at the sub-file level (block level) carves the data into chunks. In general, the block or chunk is "fingerprinted" and its unique identifier is then compared to the index. With smaller block sizes, there are more chunks and, therefore, more index comparisons and a higher potential to locate and eliminate redundancy (and produce higher reduction ratios). One tradeoff is I/O stress, which can be greater with more comparisons; in addition, the size of the index will be larger with smaller chunks, which could result in decreased backup performance. Performance can also be impacted because the chunks have to be reassembled to recover the data.
Byte-level reduction is a byte-by-byte comparison of new files and previously stored files. While this method is the only one that guarantees full redundancy elimination, the performance penalty could be high. Some vendors have taken other approaches. A few concentrate on understanding the format of the backup stream and evaluating duplication with this "content-awareness."
This was first published in June 2008