This article can also be found in the Premium Editorial Download "Storage magazine: What you need to know about data dedupe tools for backup."
Download it now to read this article plus other related content.
In a relatively short time, data deduplication has revolutionized disk-based backup, but the technology is still evolving with new applications and more choices than ever.
Data deduplication technology identifies and eliminates redundant data segments so that backups consume significantly less storage capacity. It lets organizations hold onto months of backup data to ensure rapid restores (better recovery time objective [RTO]) and lets them back up more frequently to create more recovery points (better recovery point objective [RPO]). Companies also save money by using less disk capacity and by optimizing network bandwidth.
Deduplication was first adopted by companies with tight backup windows and those looking to reduce tape usage. The primary considerations were seamless integration with incumbent backup apps and processes, and ease of implementation.
In the next wave of adoption, concerns shifted to scaling capacity and performance. Vendors beefed up disk capacity, performance, network connectivity and system interfaces, and also improved deduplication processing. Recovery was improved with the use of optimized replication.
With ongoing data growth and highly distributed environments, organizations and data dedupe vendors have been driven to investigate other ways to optimize deduplication, including new architectures, packaging and deduplication techniques.
Deduplication is definitely desirable
Research from Milford, Mass.-based Enterprise Strategy Group (ESG) reveals that the use of deduplication is increasing. Thirty-eight percent of survey respondents cited adoption of deduplication in 2010 vs. 13% in 2008. By 2012, another 40% plan to adopt deduplication (ESG Research Report, Data Protection Trends, January 2008 and ESG Research Report, Data Protection Trends, April 2010).
In addition, according to the ESG Research Report 2011 IT Spending Intentions, data reduction ranked in the top one-third of all storage priorities for enterprise-scale organizations (those with 1,000 or more employees).
While debates continue about the nuances of deduplication products such as file vs. virtual tape library (VTL) interface, source vs. target, hardware vs. software, inline vs. post process, fixed-block size vs. variable-block size, it’s important to remember that the goal of any deduplication approach is to store less data.
Target deduplication systems
Products that deduplicate at the end of the backup data path are called target deduplication systems. They’re often storage appliances with disk storage or gateways that can be paired with any disk.
Target dedupe vendors include EMC Corp., ExaGrid Systems Inc., FalconStor Software Inc., Fujitsu, GreenBytes Inc., Hewlett-Packard (HP) Co., IBM, NEC Corp., Quantum Corp., Sepaton Inc. and Symantec Corp. What often distinguishes these products is their underlying architecture. Aside from appliance vs. gateway differences (FalconStor and IBM offer gateways), another key factor is whether they’re single- or multi-node configurations.
With a single-node architecture, performance and capacity scaling is limited to an upper threshold for the configuration. While some of these products can be sized to handle tremendous scale, you may have to over-purchase now to accommodate future growth. When the upper limit is hit, a “forklift” upgrade is required to move up in performance or capacity, or another deduplication unit must be added. The latter option results in deduplication “islands” because backup data isn’t compared for redundancy across systems.
Vendors with a single-node architecture include EMC, Fujitsu, GreenBytes and Quantum. EMC does offer the Data Domain Global Deduplication Array (GDA), a composite system consisting of two DD890 devices that appear as a single system to the backup application. EMC might argue that GDA meets the criteria to be considered a multi-node configuration with global deduplication, but it has two controllers, two deduplication indexes and two storage silos. The devices also aren’t in a high-availability configuration; in fact, if one DD890 goes down, then neither DD890 is available. EMC distributes a portion of deduplication processing upstream from its appliance, but only for EMC backup apps and backup apps that support Symantec OpenStorage Technology (OST). For example, at the media server, EMC performs pre-processing, creating 1 MB chunks to compare with the deduplication index. If the pattern of the content contained in the large chunks has redundancy, the data is broken down into the more traditional 8 KB chunks, compressed, and transferred to one DD890 controller or the other for further processing, depending on where there’s a better chance of eliminating redundant data.
This was first published in August 2011