Where does deduplication belong in backup? (Hot Spots)

Should you go with a software-based approach that allows for policy-based deduplication or a hardware-based approach because it can be implemented quickly and easily?

The impact that data growth is having on backup windows is driving more organizations to implement disk-to-disk backup. This has created tremendous interest in data deduplication because the capacity optimization resulting from deduplication means that data can be retained longer on disk, which increases the likelihood of a disk-based recovery vs. a slower, manual, tape-based recovery.

While deduplication has been a feature of several backup offerings for years, the technology has been most widely adopted in backup hardware, such as virtual tape libraries (VTLs) and network-attached storage (NAS)-based disk targets. Meanwhile, deduplication implementations in backup software require organizations to switch out legacy solutions, which the hardware-based deduplication vendors have made sure to point out isn't always a desirable path. Now that mainstream backup software vendors such as CommVault, EMC Corp., IBM Corp. and Symantec Corp. are incorporating data deduplication into their backup products (reducing the amount of disruption caused by implementing deduplication), the question is being asked again: Where does deduplication belong in backup?

Software-based deduplication

Software-based approaches are differentiated in a few ways. First, they have knowledge

Requires Free Membership to View

about the data in the backup stream; they can look at patterns in the data stream (the bytes that make up a file) and determine the optimal segment boundaries, which maximizes the likelihood of identifying duplicates. In short, backup software understands the content, whereas target-side deduplication solutions typically don't. Targets simply receive a "blob" of data from the backup application. Those target-side deduplication devices that are content-aware typically have to extract the meta data associated with the backup and "reverse engineer" the backup stream to understand its contents.

Second, integration with the backup software allows for policy-based deduplication. Deduplication can be disabled for selected data sets where it doesn't make sense to turn it on (such as an MRI image) or for other data types (like databases) where you don't want to interfere with performance.

One of the drawbacks of a software-based approach is that adopting a deduplication feature could require an upgrade in backup application and/or client agents. Another factor is that deduplication may be processor-intensive and, when performed at the source application server, it may compete with and slow down apps. The scalability and performance of the media server performing deduplication could also be limiting factors. It will be important to investigate the upper limits of deduplication "pools" and performance capabilities for large volumes of data.

This was first published in March 2009

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: