In a relatively short time, data deduplication has revolutionized disk-based backup, but the technology is still evolving with new applications and more choices than ever.
Data deduplication technology identifies and eliminates redundant data segments so that backups consume significantly less storage capacity. It lets organizations hold onto months of backup data to ensure rapid restores (better recovery time objective [RTO]) and lets them back up more frequently to create more recovery points (better recovery point objective [RPO]). Companies also save money by using less disk capacity and by optimizing network bandwidth.
Deduplication was first adopted by companies with tight backup windows and those looking to reduce tape usage. The primary considerations were seamless integration with incumbent backup apps and processes, and ease of implementation.
In the next wave of adoption, concerns shifted to scaling capacity and performance. Vendors beefed up disk capacity, performance, network connectivity and system interfaces, and also improved deduplication processing. Recovery was improved with the use of optimized replication.
APIs and open standards
Symantec Corp.’s OpenStorage Technology (OST) is an API for NetBackup (Versions 6.5 and higher) and Backup Exec 2010. Target deduplication system vendors leverage the API to write a software plug-in module that’s installed on the backup media server to communicate with the storage device, creating tighter integration between the backup software and target storage. It enables features such as intelligent capacity management, media server load balancing, reporting and lifecycle policies. It also delivers optimized duplication -- network-efficient replication and direct disk-to-tape duplication that’s monitored and cataloged by the backup software. EMC Corp. offers similar functionality for EMC NetWorker; however, to date, the benefits are only extended to EMC Data Domain deduplication systems.
APIs facilitate interoperability, but could the industry take it one step further with a deduplication standard? A standard algorithm, similar to compression today, could emerge and open-source software could be the vehicle for it to develop and gain a following. The lobby for a standard is fueled by the need to seamlessly, efficiently and rapidly move data between disk and tape (without having to un-deduplicate or rehydrate the data), as well as to improve recovery operations. Any of the dedupe technologies added to open-source backup apps -- such as Bacula and Amanda -- and open-source ZFS and SDFS file systems could one day emerge as a standard.
With ongoing data growth and highly distributed environments, organizations and data dedupe vendors have been driven to investigate other ways to optimize deduplication, including new architectures, packaging and deduplication techniques.
Deduplication is definitely desirable
Research from Milford, Mass.-based Enterprise Strategy Group (ESG) reveals that the use of deduplication is increasing. Thirty-eight percent of survey respondents cited adoption of deduplication in 2010 vs. 13% in 2008. By 2012, another 40% plan to adopt deduplication (ESG Research Report, Data Protection Trends, January 2008 and ESG Research Report, Data Protection Trends, April 2010).
In addition, according to the ESG Research Report 2011 IT Spending Intentions, data reduction ranked in the top one-third of all storage priorities for enterprise-scale organizations (those with 1,000 or more employees).
While debates continue about the nuances of deduplication products such as file vs. virtual tape library (VTL) interface, source vs. target, hardware vs. software, inline vs. post process, fixed-block size vs. variable-block size, it’s important to remember that the goal of any deduplication approach is to store less data.
Target deduplication systems
Products that deduplicate at the end of the backup data path are called target deduplication systems. They’re often storage appliances with disk storage or gateways that can be paired with any disk.
Target dedupe vendors include EMC Corp., ExaGrid Systems Inc., FalconStor Software Inc., Fujitsu, GreenBytes Inc., Hewlett-Packard (HP) Co., IBM, NEC Corp., Quantum Corp., Sepaton Inc. and Symantec Corp. What often distinguishes these products is their underlying architecture. Aside from appliance vs. gateway differences (FalconStor and IBM offer gateways), another key factor is whether they’re single- or multi-node configurations.
With a single-node architecture, performance and capacity scaling is limited to an upper threshold for the configuration. While some of these products can be sized to handle tremendous scale, you may have to over-purchase now to accommodate future growth. When the upper limit is hit, a “forklift” upgrade is required to move up in performance or capacity, or another deduplication unit must be added. The latter option results in deduplication “islands” because backup data isn’t compared for redundancy across systems.
Vendors with a single-node architecture include EMC, Fujitsu, GreenBytes and Quantum. EMC does offer the Data Domain Global Deduplication Array (GDA), a composite system consisting of two DD890 devices that appear as a single system to the backup application. EMC might argue that GDA meets the criteria to be considered a multi-node configuration with global deduplication, but it has two controllers, two deduplication indexes and two storage silos. The devices also aren’t in a high-availability configuration; in fact, if one DD890 goes down, then neither DD890 is available. EMC distributes a portion of deduplication processing upstream from its appliance, but only for EMC backup apps and backup apps that support Symantec OpenStorage Technology (OST). For example, at the media server, EMC performs pre-processing, creating 1 MB chunks to compare with the deduplication index. If the pattern of the content contained in the large chunks has redundancy, the data is broken down into the more traditional 8 KB chunks, compressed, and transferred to one DD890 controller or the other for further processing, depending on where there’s a better chance of eliminating redundant data.
In a multi-node architecture, a product can manage multiple dedupe systems as one. This approach also provides linear throughput and capacity scaling, high availability and load balancing. There’s a reduction in administrative overhead and, importantly, global deduplication is typical. ExaGrid EX Series, FalconStor File-interface Deduplication System (FDS), HP’s Virtual Library Systems (VLS), IBM ProtecTier, NEC Hydrastor, Sepaton DeltaStor and Symantec NetBackup 5000 Series all have multi-node configurations and support global deduplication. The modular architectures of these products deliver impressive aggregate performance and let you grow the systems seamlessly.
Global refers to the domain of comparison for deduplication. Identification of duplicates occurs in two ways. Within a single domain, backup data passes through an individual system and is compared with data passing through the same system. With deduplication across domains, backup data passes through an individual system and is compared with data passing through the same system as well as other systems in the domain. Global deduplication can result in higher deduplication ratios because there are more comparisons and, therefore, more chances to find replicate data.
Symantec’s appliance is a new entrant in the target deduplication system field through a joint venture with Huawei. Symantec maintains a unique position in the data protection market as the only vendor to offer integrated deduplication in its own backup software- and hardware-based products as well as catalog-level integration with backup target devices of third-party vendors via its OST interface.
Deduplication in backup software
While originally limited to so-called “next-generation” backup apps like EMC’s Avamar, deduplication in backup software is now pervasive. Backup software products with deduplication include Arkeia Network Backup, Asigra Cloud Backup, Atempo Time Navigator, CA ARCserve, Cofio Software AIMstor, CommVault Simpana, Druva InSync and Phoenix, EMC Avamar, i365 EVault, IBM Tivoli Storage Manager (TSM), Quest Software NetVault Backup, Symantec Backup Exec and NetBackup, and Veeam Backup & Replication.
In software, client agents running on application servers identify and transfer unique data to the backup media server and target storage device, reducing network traffic. Other software solutions deduplicate the backup stream at the backup server, removing any potential performance burden from production application servers. The deduplication domain is limited to data protected by the backup application; multiple backup applications in the same environment create deduplication silos.
Global deduplication isn’t a given with software approaches either. First of all, not all vendors employ the same techniques for identifying duplicates. Some deduplicate by employing delta differencing (e.g., Asigra), which compares data segments for the same backup set over time. Deltas identify unique blocks for the current set vs. the previous backup of that set and only transfer unique blocks. It doesn’t make comparisons across different sets (i.e., no global deduplication).
Another approach is to use a hash algorithm. Some vendors segment the backup stream into fixed blocks (anywhere from 8 KB to 256 KB), generate a hash value and compare it to a central index of hashes calculated for previously seen blocks. A unique hash indicates unique data that should be stored. A repeated hash signals redundant data, so a pointer to the unique data is stored instead. Other vendors rely on variable block sizes that help increase the odds that a common segment will be detected even after a file is modified. This approach finds natural patterns or break points that might occur in a file and then segments the data accordingly. Even if blocks shift when a file is changed, this approach is more likely to find repeated segments. The trade-off? A variable-length approach may require a vendor to track and compare more than just one unique ID for a segment, which could affect index size and computational time.
Arkeia Software uses another approach it calls progressive deduplication. This method optimizes deduplication with a sliding-window block size and a two-phase progressive-matching deduplication technique. Files are divided into fixed blocks, but the blocks can overlap so that when a file is changed, the block boundaries accommodate the insertion of bytes. Arkeia adds another level of optimization by automatically assigning fixed block sizes (from 1 KB to 32 KB) based on file type. The technique also uses a sliding window to determine duplicate blocks at every byte location in a file. Progressive deduplication is designed to achieve high reduction ratios and to minimize false positives while accelerating processing.
Deduplication’s growing pains
As deduplication technology has matured, users have experienced most of the growing pains. Growing data volumes that tax backup and recovery have been a catalyst for performance and scale improvements, and have shifted attention to scale-out architectures for deduplication solutions. And replacing tape devices at remote and branch offices created requirements for optimized site-to-site replication, as well as a way to track those duplicate copies in the backup catalog.
In its most recent Data Protection Trends research report, ESG surveyed end users regarding their deduplication selection criteria and cost was the top purchase consideration. Some of the issues affecting cost include the following:
- Some backup software vendors add deduplication as a no-cost feature (CA and IBM TSM), while others charge for it.
- There are hidden costs, such as the added fee to enable replication between deduplication systems. And the recovery site has to be a duplicate (or nearly so) of the system at the primary location, which can double fees. There are exceptions, such as Symantec 5000 Series appliances, which include device-to-device replication at no charge. Symantec also licenses its product based on the front-end capacity of the data being protected vs. the back-end capacity of the data being stored, so replicated copies don’t incur additional costs.
- Target deduplication system vendors bundle their storage hardware with the deduplication software, so refreshing the hardware platform means the software is repurchased. Again, Symantec takes a different approach, licensing software and hardware separately.
Users drive new dedupe developments
In addition to Arkeia’s progressive deduplication approach, other developments have been pushing the dedupe envelope. CommVault’s ability to deduplicate on physical tape media is one such example. In spite of the initial hype regarding disk-only data protection and the potential to eliminate tape, for most companies the reality is that tape is still an obvious, low-cost choice for long-term data retention. Dedupe has been considered only a disk-centric process due to the need for the deduplication index and all unique data to be available and accessible to “rehydrate” what’s stored. That means when deduplicated data is copied or moved from the deduplication store to tape media, it must be reconstituted, reversing all the benefits of data reduction. But CommVault’s Simpana software enables archival copies of deduplicated data without rehydration, requiring less tape media. Importantly, data can be recovered from tape media without having to first recover the entire tape to disk.
When source deduplication approaches gained traction, the key benefits touted were the end-to-end efficiency of backing up closer to the data source (content-awareness, network bandwidth savings and faster backups) and distributing deduplication processing across the environment (vs. having the proverbial four-lane highway hit the one-lane bridge downstream at the target deduplication system). These two themes are evident in HP’s StoreOnce deduplication strategy and EMC Data Domain’s Boost approach.
With the introduction of IBM Linear Tape File System (LTFS), a data format that provides a file system interface to data stored on LTO-5 tape media, tape can be treated more like an external disk device. With LTFS, data doesn’t have to be written in a tape format, so the data is independent of the application that wrote it. It may also be a more appropriate long-term storage medium for uncompressible data types, such as medical images and video files. Does LTFS offer an opportunity for dedupe vendors to integrate tape as a long-term storage tier for deduplicated data? The jury’s still out on that one, as we’ll have to see if vendors adopt it.
While HP Data Protector software doesn’t have deduplication built into its backup architecture today, users can benefit from HP’s StoreOnce deduplication strategy. StoreOnce is a modular component that runs as a service in a file system. It can be integrated with HP Data Protector backup software and HP’s scale-out file system or embedded in HP infrastructure components. The StoreOnce algorithm involves two steps: sampling large data sequences (approximately 10 MB) to determine the likelihood of duplicates and routing them to the best node for deduplication, and then doing a hash and compare on smaller chunks. HP’s dedupe strategy is differentiated because it’s portable, scalable and global. The implication is that dedupe deployments can extend across a LAN or WAN and among storage systems without flip-flopping data between rehydrated and deduplicated states.
EMC Data Domain’s Boost option enables Data Domain to perform deduplication pre-processing earlier in the backup flow with NetBackup, Backup Exec, EMC Avamar or EMC NetWorker. A Data Domain software component is installed on the backup server or application client. The tasks performed there help improve deduplication performance by distributing the workload while introducing network efficiency between the backup server or application client and the Data Domain system.
What’s in store for deduplication?
Disk-based data protection addresses backup window issues and deduplication addresses the cost of disk used in backup configurations. But new capture techniques, such as array-based snapshots, are emerging to meet high-performance requirements for those organizations with little or no backup window and minimal downtime tolerance. In many cases, block-level incremental capture and deduplication are baked into snapshot products. NetApp’s Integrated Data Protection products (SnapMirror, SnapProtect and SnapVault), coupled with NetApp FAS-based deduplication, eliminate the need for deduplication in backup software or target deduplication systems.
Similarly, Actifio VirtualData Pipeline (VDP) takes a full image-level backup and continuous block-level incrementals thereafter, and deduplicates and compresses the data so a third-party data reduction application isn’t needed. Nimble Storage takes a similar approach. It combines primary and secondary storage in a single solution, leverages snapshot- and replication-style data protection, and employs capacity optimization techniques to reduce the footprint of backup data. These approaches undermine traditional-style backup and, therefore, traditional deduplication techniques.
BIO: Lauren Whitehouse is a senior analyst focusing on data protection software and systems at Enterprise Strategy Group, Milford, Mass.