Dedupe myths and methods


This article can also be found in the Premium Editorial Download "Storage magazine: Using two midrange backup apps at once."

Download it now to read this article plus other related content.

Storage capacity optimization
Deduplication reduces capacity requirements by analyzing the data for unique repetitive patterns that are then stored as shorter symbols, thereby reducing the amount of storage capacity required. This is a CPU-intensive process.

The key to the symbols is stored in an index. When the deduplication engine encounters a pattern, it checks the index to see if it has encountered it before. The more repetitive patterns the engine discovers, the more it can reduce the storage capacity required, although the index can still grow quite large.

The more granular the deduplication engine gets, the greater the likelihood it will find repetitive patterns, which saves more capacity. "True deduplication goes to the sub-file level, noticing blocks in common between different versions of the same file," explains W. Curtis Preston, VP of data protection at GlassHouse Technologies Inc., Framingham, MA. Single-instance storage, a form of deduplication, works at the file level.

Dedupe myths
Because deduplication products are relatively new, based on different technologies and algorithms, and are upgraded often, there are a number of myths about various forms of the technology.

In-line deduplication is better than post-processing. "If your backups aren't slowed down, and you don't run out of hours

Requires Free Membership to View

in the day, does it matter which method you chose? I don't think so," declares Preston.

Magnum Semiconductor's Wunder says his in-line dedupe works just fine. "If there's a delay, it's very small; and since we're going directly to disk, any delay doesn't even register."

The realistic answer is that it depends on your specific data, your deduplication deployment environment and the power of the devices you choose. "The in-line approach with a single box only goes so far," says Preston. And without global dedupe, throwing more boxes at the problem won't help. Today, says Preston, "post-processing is ahead, but that will likely change. By the end of the year, Diligent [now an IBM company], Data Domain [Inc.] and others will have global dedupe. Then we'll see a true race."

Post-process dedupe happens only after all backups have been completed. Post-process systems typically wait until a given virtual tape isn't being used before deduping it, not all the tapes in the backup, says Pres-ton. Deduping can start on the first tape as soon as the system starts backing up the second. "By the time it dedupes the first tape, the next tape will be ready for deduping," he says.

Vendors' ultra-high deduplication ratio claims. Figuring out your ratio isn't simple, and ratios claimed by vendors are highly manipulated. "The extravagant ratios some vendors claim--up to 400:1--are really getting out of hand," says Whitehouse. The "best" ratio depends on the nature of the specific data and how frequently it changes over a period of time.

"Suppose you dedupe a data set consisting of 500 files, each 1GB in size, for the purpose of backup," says Dan Codd, CTO at EMC Corp.'s Software Group. "The next day one file is changed. So you dedupe the data set and back up one file. What's your backup ratio? You could claim a 500:1 ratio."

This was first published in September 2008

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: