This article can also be found in the Premium Editorial Download "Storage magazine: Surprise winner: BlueArc earns top NAS quality award honors."
Download it now to read this article plus other related content.
Deduplication backup products differ in how they recognize and reduce duplicate data. Here's how to pick the product that will best fit into your environment.
Backup has seen the future and it's disk. As the backup target gradually, but persistently, changes from tape to disk, data deduplication is becoming a key component of the backup process. Because vendors implement deduplication differently, the fear, uncertainty and doubt surrounding deduplication products has increased as have the questions about when to deploy what product.
Deduplication resides in the backup process in two primary places: backup software and disk libraries. Asigra Inc.'s Televaulting, EMC Corp.'s Avamar and Symantec Corp.'s Veritas NetBackup PureDisk are backup software products that deduplicate data at the host level, minimizing the amount of data that needs to be sent over corporate networks to backup targets or replicated to disaster recovery sites. Disk libraries from Data Domain Inc., Diligent Technologies Corp., Quantum Corp. and Sepaton Inc. deduplicate data at the target, which allows companies to deploy disk libraries without disrupting current backup processes.
With the underlying deduplication algorithms essentially the same across both sets of products, the real issues are how each product implementation impacts performance, and data management in the short and long term. Neither approach is yet ideal for all backup requirements, so a crossover
Data reduction and compression algorithms
Backup software and disk library products deduplicate data in similar ways, with most using a combination of data-reduction and compression algorithms. Both types of deduplication approaches initially identify whether chunks of data or files are the same by first performing a file-level compare or using a hashing algorithm such as MD5 or SHA-1. Unique files or data chunks are preserved, while duplicate files or data chunks may be optionally rechecked. This recheck is done using a bit-level comparison or secondary hash to ensure the data is truly a duplicate and not a rare hash collision. This first stage in the deduplication process typically reduces data stores by factors of approximately 10 or more over time.
To achieve data-reduction factors of 20 times or greater requires the product to compress the unique deduplicated files or data chunks. To accomplish this, vendors use a lossless data compression algorithm, such as Huffman coding or Lempel-Ziv coding, which executes against the unique file or deduplicated data chunk. Compression squeezes out items like leading zeros or spaces to reduce the data to its smallest possible footprint before it's stored.
This was first published in June 2007