Dedupe myths and methods

Exaggerated claims, rapidly changing technology and persistent myths make navigating the deduplication landscape treacherous. We list the top five dedupe myths and provide tips to help you get a deduplication product that fits your organization's needs at a competitive price.

This article can also be found in the Premium Editorial Download: Storage magazine: Using two midrange backup apps at once:

Data deduplication products can dramatically lower capacity requirements, but picking the best one for your needs can be tricky.

Exaggerated claims, rapidly changing technology and persistent myths make navigating the deduplication landscape treacherous. But the rewards of a successful dedupe installation are indisputable.

"We're seeing the growing popularity of secondary storage and archival systems with single-instance storage," says Lauren Whitehouse, analyst at Enterprise Strategy Group (ESG), Milford, MA. "A couple of deduplication products have even appeared for use with primary storage."

The technology is maturing rapidly. "We looked at deduplication two years ago and it wasn't ready," says John Wunder, director of IT at Milpitas, CA-based Magnum Semiconductor, which makes chips for media processing. Recently, Wunder pulled together a deduplication process by combining pieces from Diligent Technologies Corp. (deduplication engine), Symantec Corp. Veritas NetBackup and Quatrio (servers and storage).

Assembling the right pieces requires a clear understanding of the different dedupe technologies, a thorough testing of products prior to production, and keeping up with major product changes such as the introduction of hybrid deduplication (see "Dedupe alternatives," below) and the emergence of global deduplication.

Dedupe alternatives

Until recently, deduplication was performed either in-line or post-processing. Now vendors are blurring those boundaries.

  • FalconStor Software Corp. offers what it calls a hybrid model, in which it begins the post-process deduping of a backup job on a series of tapes without waiting for the entire backup process to be completed, thereby speeding the post-processing effort.

  • Quantum Corp. offers what it calls adaptive dedup-lication, which starts as in-line processing with the data being deduped as it's written. Then it adds a buffer that can increase dynamically as the data input volume outpaces the processing. It dedupes the data in the buffer in post-processing style.

"Global deduplication is the process of fanning in multiple sources of data and performing deduplication across those sources," says ESG's Whitehouse. Currently, each appliance maintains its own index of duplicate data. Global deduplication requires a way to share those indexes across appliances (see "Global deduplication," below).

Global deduplication
"Global deduplication is the process of fanning in multiple sources of data and performing deduplication across those sources," says Lauren Whitehouse, analyst at Enterprise Strategy Group (ESG), Milford, MA. Global dedupe generally results in higher ratios and allows you to scale input/output. The global dedupe process differs when you're deduping on the target side or the source side, notes Whitehouse.
  • Target side: Replicate indexes of multiple silos to a central, larger silo to produce a consolidated index that ensures only unique files/segments are transported.

  • Source side: Fan in indexes from remote offices/ branch offices (ROBOs) and dedupe to create a central, consolidated index repository.

Storage capacity optimization
Deduplication reduces capacity requirements by analyzing the data for unique repetitive patterns that are then stored as shorter symbols, thereby reducing the amount of storage capacity required. This is a CPU-intensive process.

The key to the symbols is stored in an index. When the deduplication engine encounters a pattern, it checks the index to see if it has encountered it before. The more repetitive patterns the engine discovers, the more it can reduce the storage capacity required, although the index can still grow quite large.

The more granular the deduplication engine gets, the greater the likelihood it will find repetitive patterns, which saves more capacity. "True deduplication goes to the sub-file level, noticing blocks in common between different versions of the same file," explains W. Curtis Preston, VP of data protection at GlassHouse Technologies Inc., Framingham, MA. Single-instance storage, a form of deduplication, works at the file level.

Dedupe myths
Because deduplication products are relatively new, based on different technologies and algorithms, and are upgraded often, there are a number of myths about various forms of the technology.

In-line deduplication is better than post-processing. "If your backups aren't slowed down, and you don't run out of hours in the day, does it matter which method you chose? I don't think so," declares Preston.

Magnum Semiconductor's Wunder says his in-line dedupe works just fine. "If there's a delay, it's very small; and since we're going directly to disk, any delay doesn't even register."

The realistic answer is that it depends on your specific data, your deduplication deployment environment and the power of the devices you choose. "The in-line approach with a single box only goes so far," says Preston. And without global dedupe, throwing more boxes at the problem won't help. Today, says Preston, "post-processing is ahead, but that will likely change. By the end of the year, Diligent [now an IBM company], Data Domain [Inc.] and others will have global dedupe. Then we'll see a true race."

Post-process dedupe happens only after all backups have been completed. Post-process systems typically wait until a given virtual tape isn't being used before deduping it, not all the tapes in the backup, says Pres-ton. Deduping can start on the first tape as soon as the system starts backing up the second. "By the time it dedupes the first tape, the next tape will be ready for deduping," he says.

Vendors' ultra-high deduplication ratio claims. Figuring out your ratio isn't simple, and ratios claimed by vendors are highly manipulated. "The extravagant ratios some vendors claim--up to 400:1--are really getting out of hand," says Whitehouse. The "best" ratio depends on the nature of the specific data and how frequently it changes over a period of time.

"Suppose you dedupe a data set consisting of 500 files, each 1GB in size, for the purpose of backup," says Dan Codd, CTO at EMC Corp.'s Software Group. "The next day one file is changed. So you dedupe the data set and back up one file. What's your backup ratio? You could claim a 500:1 ratio."

Grey Healthcare Group, a New York City-based healthcare advertising agency, works with many media files, some exceeding 2GB in size. The company was storing its files on a 13TB EqualLogic (now owned by Dell Inc.) iSCSI SAN, and backing it up to a FalconStor Software Inc. VTL and eventually to LTO-2 tape. Using FalconStor's post-processing deduplication, Grey Healthcare was able to reduce 175TB to 2TB of virtual disk over a period of four weeks, "which we calculate as better than a 75:1 ratio," says Chris Watkis, IT director.

Watkis realizes that the same deduplication process results could be calculated differently using various time frames. "So maybe it was 40:1 or even 20:1. In aggregate, we got 175TB down to 2TB of actual disk," he says.

Proprietary algorithms deliver the best results. Algorithms, whether proprietary or open, fall into two general categories: hash-based, which generates pointers to the original data in the index; and content-aware, which looks to the latest backup.

"The science of hash-based and content-aware algorithms is widely known," says Neville Yates, CTO at Diligent. "Either way, you'll get about the same performance."

Yates, of course, claims Diligent uses yet a different approach. Its algorithm, he explains, uses small amounts of data that can be kept in memory, even when dealing with a petabyte of data, thereby speeding performance. Magnum Semiconductor's Wunder, a Diligent customer, deals with files that typically run approximately 22KB and felt Diligent's approach delivered good results. He didn't find it necessary to dig any deeper into the algorithms.

"We talked to engineers from both Data Domain and ExaGrid Systems Inc. about their algorithms, but we really were more interested in how they stored data and how they did restores from old data," says Michael Aubry, director of information systems for three central California hospitals in the 19-hospital Adventist Health Network. The specific algorithms each vendor used never came up.

FalconStor opted for public algorithms, like SHA-1 or MD5. "It's a question of slightly better performance [with proprietary algorithms] or more-than-sufficient performance for the job [with public algorithms]," says John Lallier, FalconStor's VP of technology. Even the best algorithms still remain at the mercy of the transmission links, which can lose bits, he adds.

Hash collisions increase data bit-error rates as the environment grows. Statistically this appears to be true, but don't lose sleep over it. Concerns about hash collisions apply only to deduplication systems that use a hash to identify redundant data. Vendors that use a secondary check to verify a match, or that don't use hashes at all, don't have to worry about hash collisions.

GlassHouse Technologies' Preston did the math on his blog and found that with 95 exabytes of data there's a 0.00000000000001110223024625156540423631668090820313% chance your system will discard a block it should keep as a result of a hash collision. The chance the corrupted block will actually be needed in a restore is even more remote.

"And if you have something less than 95 exabytes of data, then your odds don't appear in 50 decimal places," says Preston. "I think I'm OK with these odds."

Dedupe tips
Sorting out the deduplication myths is just the first part of a storage manager's job. The following tips will help managers deploy deduplication while avoiding common pitfalls.

  1. Know your data. "People don't have accurate data on their daily changes and retention periods," says Wunder. That data, however, is critical in estimating what kind of dedupe ratio you'll get and planning how much disk capacity you'll need. "We planned for a 60-day retention period to keep the cost down," he says.

    "The vendors will do capacity estimates and they're pretty good," says ESG's Whitehouse. Adventist Health's Aubry, for example, asked Data Domain and ExaGrid to size a deduplication solution. "We told them what we knew about the data and asked them to look at our data and what we were doing. They each came back with estimates that were comparable," says Aubry. Almost two years later the estimates have still proven pretty accurate.

  2. Know your applications. Not all deduplication products handle all applications equally. Special data structures, unusual data formats, and other ways an application treats data and variable-length data can all fool a dedupe product.

    When Philadelphia law firm Duane Morris LLP finally got around to using Avamar Technologies' Axiom (now EMC Avamar) for deduplication, the company had a surprise: "It worked for some applications, but it didn't work with Microsoft Exchange," says Duane Morris CIO John Sroka.

    Avamar had no problem deduping the firm's 6 million Word documents, but when it hit Exchange data "it saw the Exchange data as completely new each time, no duplication," he reports. (The latest version of Avamar dedupes Exchange data.) Duane Morris, however, won't bother to upgrade Avamar. "We're moving to Double-Take [from Double-Take Software Inc.] to get real-time replication," says Sroka, which is what the firm wanted all along.

  3. Avoid deduping compressed data. As a corollary to the above tip, "it's a waste of time to try to dedupe compressed files. We tried and ended up with some horrible ratios," says Kevin Fiore, CIO at Thomas Weisel Partners LLC, a San Francisco investment bank. A Data Domain user for more than two years, the company gets ratios as high as 35:1 with uncompressed file data. With database applications and others that compress files, the ratios fell into the single digits.

    When deduping a mix of applications, Thomas Weisel Partners experiences acceptable ratios ranging from 12:1 to 16:1. Similarly, data the company doesn't keep very long isn't worth deduping at all. Unless the data is kept long enough to be backed up multiple times, there's little to gain from deduplication for that data.

  4. Avoid the easy fix. "There's a point early in the process where companies go for a quick fix, an appliance. Then they find themselves plopping in more boxes when they have to scale. At some point, they can't get the situation under control," says ESG's Whitehouse. Appliances certainly present an easy solution, but until the selected appliance supports some form of global dedupe, a company will find itself managing islands of deduplication. In the process, it will miss opportunities to remove data identified by multiple appliances.

    Magnum Semiconductor's Wunder quickly spotted this trap. "We looked at Data Domain, but we realized it wouldn't scale. At some point we would need multiple appliances at $80,000 apiece," he says.

  5. Test dedupe products with a large quantity of your real data. "This kind of testing is time consuming, so many companies avoid it. Usually a company will try the product with little bits of data, and the results won't compare with large data sets," says GlassHouse Technologies' Preston. Ideally, you should demo the product onsite by having it do real work for a month or so before opting to buy it. However, most vendors won't go along with this unless they believe they're on the verge of losing the sale.

Adventist Health got lucky. It made a decision based on lengthy onsite meetings with engineers from Data Domain and ExaGrid. Based on those meetings and their internal analysis, it opted for ExaGrid. Once the decision was made, Adventist Health's Aubry called Data Domain as a courtesy. Data Domain wouldn't give up and offered to send an appliance.

"I was a little nervous I might have made a wrong decision. We put in both and ran a bake off," says Aubry. ExaGrid was already installed on Adventist Health's routed network. It put the Data Domain appliance on a private network connected to its media server.

"I was expecting Data Domain to outperform because of the private network," he says. Measuring the time it took to complete the end-to-end process, ExaGrid performed 20% faster, much to Aubry's relief as he was already committed to buying the ExaGrid.

Just about every consumer cliché applies to deduplication today: buyer beware, try before you buy, your mileage may vary, past performance is no indicator of future performance, one size doesn't fit all and so on. Fortunately, the market is competitive and price is negotiable. With the technology-industry analyst firm The 451 Group projecting the market to surpass $1 billion by 2009, up from $100 million just three years earlier, dedupe is hot. Shop around. Informed storage managers should be able to get a deduplication product that fits their needs at a competitive price.

This was first published in September 2008

Dig deeper on Storage Resources



Enjoy the benefits of Pro+ membership, learn more and join.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: