Hot Spots: Data deduplication drama


This article can also be found in the Premium Editorial Download "Storage magazine: The top email archiving strategies for storage managers."

Download it now to read this article plus other related content.

Where and when
The work of data dedupe can be performed at one or more places between the data source and the target storage destination. Dedupe occurring at the app or file-server level (before the backup data is transmitted across the network) is referred to as client-side deduplication (a must-have if bandwidth reduction is important). Alternatively, dedupe of the backup stream can happen at the backup server, which can be referred to as proxy deduplication, or on the target device, which is called target deduplication.

Deduplication can be timed to occur before data is written to the disk target (inline processing) or after data is written to the disk target (post-processing).

Post-process deduplication will write the backup image to a disk cache before starting to dedupe. This lets the backup complete at full disk performance. Post-process dedupe requires disk cache capacity sized for the backup data that's not deduplicated plus the additional capacity to store deduped data. The size of the cache depends on whether the dedupe process waits for the entire backup job to complete before starting deduplication or if it starts to deduplicate data as it's written and, more importantly, when the deduplication process releases storage space.

Inline dedupe could negatively impact backup performance when the app uses a fingerprint database that grows over time.

Requires Free Membership to View

Inline approaches inspect and dedupe data on the way to the disk target. Performance degradation depends on several factors, including the method of fingerprinting, granularity of dedupe, where the inline processing occurs, network performance, how the dedupe technology workload is distributed and more.

Long-term viability
Many of today's most popular hardware-based approaches may solve the immediate problem of reducing data in disk-to-disk backup environments, but they mask the issues that will arise as the environment expands and evolves.

The issue is software vs. hardware. On the hardware side, purpose-built appliances offer faster deployments, integrating with existing backup software and providing a plug-and-play experience. The compromise? There are limitations when it comes to flexibility and scalability. Additional appliances may need to be added as demand for capacity increases, and the resulting appliance "sprawl" not only adds management complexity and overhead, but may limit deduplication to each individual appliance.

With software approaches, disk capacity may be more flexible. Disk storage is virtualized, appearing as a large pool that scales seamlessly. In a software scenario, the impact on management overhead is less and the effect on deduplication may be greater since deduplication occurs across a larger data set than most individual appliance architectures.

Software-based client-side and proxy dedupe optimize performance by distributing dedupe processing across a large number of clients or media servers. Target dedupe requires powerful, purpose-built storage appliances as the entire backup load needs to be processed on the target. Because software implementations offer better workload distribution, inline dedupe performance may be improved over hardware-based equivalents.

Choosing a software or hardware approach may depend on your current backup software implementation. If the backup software in place doesn't have a dedupe feature or option, switching to one that does may pose challenges.

This was first published in June 2008

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: