| Data deduplication products can dramatically lower capacity requirements, but picking the best one for your needs can be tricky.
Exaggerated claims, rapidly changing technology and persistent myths make navigating the deduplication landscape treacherous. But the rewards of a successful dedupe installation are indisputable.
"We're seeing the growing popularity of secondary storage and archival systems with single-instance storage," says Lauren Whitehouse, analyst at Enterprise Strategy Group (ESG), Milford, MA. "A couple of deduplication products have even appeared for use with primary storage."
The technology is maturing rapidly. "We looked at deduplication two years ago and it wasn't ready," says John Wunder, director of IT at Milpitas, CA-based Magnum Semiconductor, which makes chips for media processing. Recently, Wunder pulled together a deduplication process by combining pieces from Diligent Technologies Corp. (deduplication engine), Symantec Corp. Veritas NetBackup and Quatrio (servers and storage).
Assembling the right pieces requires a clear understanding of the different dedupe technologies, a thorough testing of products prior to production, and keeping up with major product changes such as the introduction of hybrid deduplication (see "Dedupe alternatives," below) and the emergence of global deduplication.
"Global deduplication is the process of fanning in multiple sources of data and performing deduplication across those sources," says ESG's Whitehouse. Currently, each appliance maintains its own index of duplicate data. Global deduplication requires a way to share those indexes across appliances (see "Global deduplication," below).
Storage capacity optimization
The key to the symbols is stored in an index. When the deduplication engine encounters a pattern, it checks the index to see if it has encountered it before. The more repetitive patterns the engine discovers, the more it can reduce the storage capacity required, although the index can still grow quite large.
The more granular the deduplication engine gets, the greater the likelihood it will find repetitive patterns, which saves more capacity. "True deduplication goes to the sub-file level, noticing blocks in common between different versions of the same file," explains W. Curtis Preston, VP of data protection at GlassHouse Technologies Inc., Framingham, MA. Single-instance storage, a form of deduplication, works at the file level.
In-line deduplication is better than post-processing. "If your backups aren't slowed down, and you don't run out of hours in the day, does it matter which method you chose? I don't think so," declares Preston.
Magnum Semiconductor's Wunder says his in-line dedupe works just fine. "If there's a delay, it's very small; and since we're going directly to disk, any delay doesn't even register."
The realistic answer is that it depends on your specific data, your deduplication deployment environment and the power of the devices you choose. "The in-line approach with a single box only goes so far," says Preston. And without global dedupe, throwing more boxes at the problem won't help. Today, says Preston, "post-processing is ahead, but that will likely change. By the end of the year, Diligent [now an IBM company], Data Domain [Inc.] and others will have global dedupe. Then we'll see a true race."
Post-process dedupe happens only after all backups have been completed. Post-process systems typically wait until a given virtual tape isn't being used before deduping it, not all the tapes in the backup, says Pres-ton. Deduping can start on the first tape as soon as the system starts backing up the second. "By the time it dedupes the first tape, the next tape will be ready for deduping," he says.
Vendors' ultra-high deduplication ratio claims. Figuring out your ratio isn't simple, and ratios claimed by vendors are highly manipulated. "The extravagant ratios some vendors claim--up to 400:1--are really getting out of hand," says Whitehouse. The "best" ratio depends on the nature of the specific data and how frequently it changes over a period of time.
"Suppose you dedupe a data set consisting of 500 files, each 1GB in size, for the purpose of backup," says Dan Codd, CTO at EMC Corp.'s Software Group. "The next day one file is changed. So you dedupe the data set and back up one file. What's your backup ratio? You could claim a 500:1 ratio."
Grey Healthcare Group, a New York City-based healthcare advertising agency, works with many media files, some exceeding 2GB in size. The company was storing its files on a 13TB EqualLogic (now owned by Dell Inc.) iSCSI SAN, and backing it up to a FalconStor Software Inc. VTL and eventually to LTO-2 tape. Using FalconStor's post-processing deduplication, Grey Healthcare was able to reduce 175TB to 2TB of virtual disk over a period of four weeks, "which we calculate as better than a 75:1 ratio," says Chris Watkis, IT director.
Watkis realizes that the same deduplication process results could be calculated differently using various time frames. "So maybe it was 40:1 or even 20:1. In aggregate, we got 175TB down to 2TB of actual disk," he says.
Proprietary algorithms deliver the best results. Algorithms, whether proprietary or open, fall into two general categories: hash-based, which generates pointers to the original data in the index; and content-aware, which looks to the latest backup.
"The science of hash-based and content-aware algorithms is widely known," says Neville Yates, CTO at Diligent. "Either way, you'll get about the same performance."
Yates, of course, claims Diligent uses yet a different approach. Its algorithm, he explains, uses small amounts of data that can be kept in memory, even when dealing with a petabyte of data, thereby speeding performance. Magnum Semiconductor's Wunder, a Diligent customer, deals with files that typically run approximately 22KB and felt Diligent's approach delivered good results. He didn't find it necessary to dig any deeper into the algorithms.
"We talked to engineers from both Data Domain and ExaGrid Systems Inc. about their algorithms, but we really were more interested in how they stored data and how they did restores from old data," says Michael Aubry, director of information systems for three central California hospitals in the 19-hospital Adventist Health Network. The specific algorithms each vendor used never came up.
FalconStor opted for public algorithms, like SHA-1 or MD5. "It's a question of slightly better performance [with proprietary algorithms] or more-than-sufficient performance for the job [with public algorithms]," says John Lallier, FalconStor's VP of technology. Even the best algorithms still remain at the mercy of the transmission links, which can lose bits, he adds.
Hash collisions increase data bit-error rates as the environment grows. Statistically this appears to be true, but don't lose sleep over it. Concerns about hash collisions apply only to deduplication systems that use a hash to identify redundant data. Vendors that use a secondary check to verify a match, or that don't use hashes at all, don't have to worry about hash collisions.
GlassHouse Technologies' Preston did the math on his blog and found that with 95 exabytes of data there's a 0.00000000000001110223024625156540423631668090820313% chance your system will discard a block it should keep as a result of a hash collision. The chance the corrupted block will actually be needed in a restore is even more remote.
"And if you have something less than 95 exabytes of data, then your odds don't appear in 50 decimal places," says Preston. "I think I'm OK with these odds."
Adventist Health got lucky. It made a decision based on lengthy onsite meetings with engineers from Data Domain and ExaGrid. Based on those meetings and their internal analysis, it opted for ExaGrid. Once the decision was made, Adventist Health's Aubry called Data Domain as a courtesy. Data Domain wouldn't give up and offered to send an appliance.
"I was a little nervous I might have made a wrong decision. We put in both and ran a bake off," says Aubry. ExaGrid was already installed on Adventist Health's routed network. It put the Data Domain appliance on a private network connected to its media server.
"I was expecting Data Domain to outperform because of the private network," he says. Measuring the time it took to complete the end-to-end process, ExaGrid performed 20% faster, much to Aubry's relief as he was already committed to buying the ExaGrid.
Just about every consumer cliché applies to deduplication today: buyer beware, try before you buy, your mileage may vary, past performance is no indicator of future performance, one size doesn't fit all and so on. Fortunately, the market is competitive and price is negotiable. With the technology-industry analyst firm The 451 Group projecting the market to surpass $1 billion by 2009, up from $100 million just three years earlier, dedupe is hot. Shop around. Informed storage managers should be able to get a deduplication product that fits their needs at a competitive price.