For example, one significant argument is whether inline deduplication is more efficient than post-processing dedupe. While dedupe requires processing, which takes time and resources, the issue is where to spend the time: at the start of the backup process or the end; and which CPU you want to absorb the processing overhead.
The City of Lenexa, Kan., prefers post-processing deduplication. "It's just a question of how fast we can get our data onto the box," said Michael Lawrence, CISO and network administrator for the city. The box is an ExaGrid Systems Inc. storage device used for virtual tape backup. With data deduplication technology, the city can keep 15 days' worth of backups on the ExaGrid. Once the data lands there, it can be deduped, further backed up to actual tape or processed in other ways.
"Vendors will tout incredible ratios, but that may not be realistic for you," said Tim Malfara, storage architect at GSI Commerce Solutions Inc. in King of Prussia, Pa. Not every workload or backup benefits from data deduplication. GSI Commerce opted not to deploy deduplication. "The biggest backup areas we have, high-rez images and structured databases, don't dedupe well," Malfara said.
The City of Lenexa's Lawrence doesn't yet know what his dedupe ratio will be. "The ratio gets better over time," he noted, because the chance of newly arriving data being a duplicate of previously stored data increases as more backups are made.
Another debate focuses on the particular dedupe algorithms: proprietary or public. Algorithms may seem exotic, but the science of hash-based and content-aware algorithms is widely known and debated online. As a result, you'll end up with roughly the same performance regardless of the algorithm.
Public algorithms, such as SHA-1 or MD5, are good for most situations. There are so many points in the process where latency creeps in or bits are dropped that slightly better hardly matters. Many storage managers don't even know what specific data deduplication algorithm they use.
You also don't need to worry about hash collisions, which increase data bit-error rates as the environment grows. Although this is statistically true, you don't need to lose sleep over it.
W. Curtis Preston, executive editor of TechTarget's Storage Media Group and an independent backup expert, did the math in his blog and found that with 95 exabytes of data there's a 0.00000000000001110223024625156540423631668090820313% chance your system will discard a block from a hash collision that it should have kept. The chance that the corrupted block will actually be needed in a restore is even more remote.
"And if you have something less than 95 exabytes of data, then your odds don't appear in 50 decimal places," reads a quote from Preston's blog. "I think I'm OK with these odds."
Four simple steps to maximize your data deduplication experienceSo what can you do to maximize your dedupe experience? Here are four simple steps:
1. Know your data. Is it structured database data, graphical data or general office files? Different types of data, such as general office files, lend themselves better to deduplication.
2. Test dedupe with your actual data and insist vendors demonstrate their systems with a large chunk of your actual data. Better yet, ask them to let you demo the system with your data for a month before committing to a purchase.
3. Don't bother deduping compressed data. Deduplication is just another form of compression. Compressed data, in effect, has already been deduped.
4. Understand that deduplication is a feature, not a product. You don't have to buy a dedupe product to get deduplication. The capability is increasingly being incorporated into a range of storage products, including virtual tape libraries (VTLs), backup software and storage arrays.
With the right data in the right situation, data deduplication works well. While dedupe continues to be used primarily to reduce backup volumes, the technology should eventually expand and may even be applied to archiving.
This was first published in July 2009