This article can also be found in the Premium Editorial Download "Storage magazine: The top email archiving strategies for storage managers."
Download it now to read this article plus other related content.
Where and when
Deduplication can be timed to occur before data is written to the disk target (inline processing) or after data is written to the disk target (post-processing).
Post-process deduplication will write the backup image to a disk cache before starting to dedupe. This lets the backup complete at full disk performance. Post-process dedupe requires disk cache capacity sized for the backup data that's not deduplicated plus the additional capacity to store deduped data. The size of the cache depends on whether the dedupe process waits for the entire backup job to complete before starting deduplication or if it starts to deduplicate data as it's written and, more importantly, when the deduplication process releases storage space.
Inline dedupe could negatively impact backup performance when the app uses a fingerprint database that grows over time.
| Inline approaches inspect and dedupe data on the way to the disk target. Performance degradation depends on several factors, including the method of fingerprinting, granularity of dedupe, where the inline processing occurs, network performance, how the dedupe technology workload is distributed and more.
The issue is software vs. hardware. On the hardware side, purpose-built appliances offer faster deployments, integrating with existing backup software and providing a plug-and-play experience. The compromise? There are limitations when it comes to flexibility and scalability. Additional appliances may need to be added as demand for capacity increases, and the resulting appliance "sprawl" not only adds management complexity and overhead, but may limit deduplication to each individual appliance.
With software approaches, disk capacity may be more flexible. Disk storage is virtualized, appearing as a large pool that scales seamlessly. In a software scenario, the impact on management overhead is less and the effect on deduplication may be greater since deduplication occurs across a larger data set than most individual appliance architectures.
Software-based client-side and proxy dedupe optimize performance by distributing dedupe processing across a large number of clients or media servers. Target dedupe requires powerful, purpose-built storage appliances as the entire backup load needs to be processed on the target. Because software implementations offer better workload distribution, inline dedupe performance may be improved over hardware-based equivalents.
Choosing a software or hardware approach may depend on your current backup software implementation. If the backup software in place doesn't have a dedupe feature or option, switching to one that does may pose challenges.
This was first published in June 2008