Hot Spots: Data deduplication drama


This article can also be found in the Premium Editorial Download "Storage magazine: The top email archiving strategies for storage managers."

Download it now to read this article plus other related content.

The dedupe stage is getting crowded and confusing. Avoid trouble by knowing where to look for it.

Data deduplication for secondary storage is prompting as much passion from vendors as it is confusion for some users. Why? Because dedupe is one of those influential, crucial technologies that comes along every few years.

Deduplication promises to reduce the transfer and storage of redundant data, which optimizes network bandwidth and storage capacity. Storing data more efficiently on disk lets you retain data for longer periods or "recapture" data to protect more apps with disk-based backup, increasing the likelihood that data can be recovered rapidly. Transferring less data over the network also improves performance. Reducing the data transferred over a WAN connection may allow organizations to consolidate backup from remote locations or extend disaster recovery to data that wasn't previously protected. The bottom line is that data dedupe can save organizations time and money by enabling more data recovery from disk and reducing the footprint and power and cooling requirements of secondary storage. It can also enhance data protection.

Sounds great, doesn't it? But the devil is in the details. For starters, vendors offering data dedupe technology are often polarized when it comes to their approaches. There are debates over deduplication in backup software

Requires Free Membership to View

vs. in the hardware storing backup data, inline vs. post-process deduplication and even the type of hash algorithm used. These topics are discussed in vendor blogs and marketing materials, which can cause a lot of confusion among users.

The fine print
The first point of confusion lies in the many ways storage capacity can be optimized. Data deduplication is often a catch-all category for technologies that optimize capacity. Archiving, single-instance storage, incremental "forever" backup, delta differencing and compression are just a few technologies or methods employed in the data protection process to eliminate redundancy and the amount of data transferred/stored. Unfortunately, firms have to wade through a lot of marketing hype to understand what's being offered by vendors who toss around these terms.

In data protection processes, dedupe is a feature available in backup apps and disk storage systems to reduce disk and bandwidth requirements. Data dedupe technology examines data to identify and eliminate redundancy. For example, data dedupe may create a unique data object with a hash algorithm and check that fingerprint against a master index. Unique data is written to storage and only a pointer to the previously written data is stored.

Another issue is the level of granularity the dedupe solution offers. Dedupe can be performed at the file, block and byte levels. There are tradeoffs for each method, including computational time, accuracy, level of duplication detected, index size and, potentially, the scalability of the solution.

File-level dedupe (or single-instance storage) removes duplicated data at the file level by checking file attributes and eliminating redundant copies of files stored on backup media. This method delivers less capacity reduction than other methods, but it's simple and fast.

Deduplicating at the sub-file level (block level) carves the data into chunks. In general, the block or chunk is "fingerprinted" and its unique identifier is then compared to the index. With smaller block sizes, there are more chunks and, therefore, more index comparisons and a higher potential to locate and eliminate redundancy (and produce higher reduction ratios). One tradeoff is I/O stress, which can be greater with more comparisons; in addition, the size of the index will be larger with smaller chunks, which could result in decreased backup performance. Performance can also be impacted because the chunks have to be reassembled to recover the data.

Byte-level reduction is a byte-by-byte comparison of new files and previously stored files. While this method is the only one that guarantees full redundancy elimination, the performance penalty could be high. Some vendors have taken other approaches. A few concentrate on understanding the format of the backup stream and evaluating duplication with this "content-awareness."

This was first published in June 2008

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: