Catching up with deduplication


This article can also be found in the Premium Editorial Download "Storage magazine: Surprise winner: BlueArc earns top NAS quality award honors."

Download it now to read this article plus other related content.

Deduplication backup products differ in how they recognize and reduce duplicate data. Here's how to pick the product that will best fit into your environment.

Backup has seen the future and it's disk. As the backup target gradually, but persistently, changes from tape to disk, data deduplication is becoming a key component of the backup process. Because vendors implement deduplication differently, the fear, uncertainty and doubt surrounding deduplication products has increased as have the questions about when to deploy what product.

Deduplication resides in the backup process in two primary places: backup software and disk libraries. Asigra Inc.'s Televaulting, EMC Corp.'s Avamar and Symantec Corp.'s Veritas NetBackup PureDisk are backup software products that deduplicate data at the host level, minimizing the amount of data that needs to be sent over corporate networks to backup targets or replicated to disaster recovery sites. Disk libraries from Data Domain Inc., Diligent Technologies Corp., Quantum Corp. and Sepaton Inc. deduplicate data at the target, which allows companies to deploy disk libraries without disrupting current backup processes.

With the underlying deduplication algorithms essentially the same across both sets of products, the real issues are how each product implementation impacts performance, and data management in the short and long term. Neither approach is yet ideal for all backup requirements, so a crossover

Requires Free Membership to View

period is emerging in which some storage managers will likely use backup software and disk library methods for specific needs. Hidden issues like undeduplicating data to store it on tape, integration with enterprise backup software products, and the ability to selectively turn off deduplication to accommodate specific compliance requirements and preexisting encryption conditions should be evaluated closely to determine whether those issues outweigh the benefits of deduplication.

Data reduction and compression algorithms
Backup software and disk library products deduplicate data in similar ways, with most using a combination of data-reduction and compression algorithms. Both types of deduplication approaches initially identify whether chunks of data or files are the same by first performing a file-level compare or using a hashing algorithm such as MD5 or SHA-1. Unique files or data chunks are preserved, while duplicate files or data chunks may be optionally rechecked. This recheck is done using a bit-level comparison or secondary hash to ensure the data is truly a duplicate and not a rare hash collision. This first stage in the deduplication process typically reduces data stores by factors of approximately 10 or more over time.

To achieve data-reduction factors of 20 times or greater requires the product to compress the unique deduplicated files or data chunks. To accomplish this, vendors use a lossless data compression algorithm, such as Huffman coding or Lempel-Ziv coding, which executes against the unique file or deduplicated data chunk. Compression squeezes out items like leading zeros or spaces to reduce the data to its smallest possible footprint before it's stored.

This was first published in June 2007

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: