This article can also be found in the Premium Editorial Download "Storage magazine: Testing data deduplication backup tools."
Download it now to read this article plus other related content.
Hardware-based products propelled deduplication into the mainstream, but now that most backup apps include dedupe, you'll have to carefully evaluate the options.
By Lauren Whitehouse
Data growth grabs most of today's IT headlines and many IT organizations believe data protection is one of the key contributors to the staggering data capacities that need to be managed. Why? Lots of copies are made by data protection processes -- at least once per day, but sometimes multiple times daily -- and kept locally for operational recovery. Copies of copies are also sent offsite for disaster recovery (DR) purposes. Most backup and replication solutions perform these processes inefficiently, making multiple copies of the same file despite only a small amount of the data within the file having been changed. Maintaining daily, weekly, monthly and yearly backup copies means that dozens of copies of the same data may be stored, and often for extended periods of time. It's this propagation of data that makes data deduplication a compelling technology for secondary storage environments. While the deduplication spotlight has been focused to date on hardware products that optimize storage capacity, the addition of dedupe capabilities in several backup apps could shift the focus in 2009.
As more organizations implement disk in the backup process to overcome the performance and reliability shortcomings of tape-based protection, data deduplication has emerged as a force to improve the economic feasibility of retaining data longer on disk (possibly eliminating tape) or increasing the number of workloads using disk as an interim stop on the way to longer-term retention on tape. Deduplication technology conserves storage space by writing only unique (new or changed) data to disk and linking it via pointers to the previously stored unchanged data.
Dedupe approaches compared
Hardware vendors spearheaded dedupe adoption with powerful, purpose-built deduplication appliances that process backup data before or after it's written to disk. Benign to the existing backup environment, this hardware-based approach made deploying dedupe relatively easy. Research from the Enterprise Strategy Group has found that the ability to integrate with existing backup processes and overall ease of use are more important adoption factors to organizations than specific technical considerations, such as a deduplication ratio or the granularity of deduplication.
Seamless integration with existing data protection practices, as well as IT's historic resistance to change when it comes to backup software, meant that backup solution providers that could offer deduplication had a more difficult time getting mindshare in the data center. When EMC Corp.'s Avamar came to market touting a better, more efficient way to back up data, the company faced an obstacle that was hard to overcome: reluctance to walk away from existing backup applications. IT organizations could clearly understand the benefits, but weren't motivated to initiate a technology change that would have a ripple effect on the operational aspects -- people and process -- of the data protection environment. EMC Avamar has therefore had to take a more circuitous route to the data center, providing a bandwidth- and storage-optimized backup solution for remote and branch offices, as well as an efficient data protection alternative for server virtualization environments.
However, the integration of acquired deduplication products by EMC (Avamar) and Symantec Corp. (PureDisk) with NetWorker and Veritas NetBackup, respectively, as well as recent introductions of native dedupe by CA, CommVault and IBM Corp. have a lot of IT organizations wondering which is the best implementation of deduplication -- hardware or software? Bottom line: It's not a one-size-fits-all scenario.
Factors to consider
Cost, performance, scalability and the deduplication domain are just a few of the considerations when evaluating deduplication in the backup process to determine whether a backup application's built-in dedupe capability or a feature built into a backup storage system will best serve your environment.
Cost. Presumably, an investment made in technology that can reduce storage capacity requirements by a factor of 20 will be easily justified. Is there an added fee to enable the feature whether it's a backup app capability or an "add-on" feature in a hardware device? Is an upgrade to a higher version or model required? Even if deduplication is standard in the product (hardware or software), what other cost implications are there for implementing it (e.g., will it require additional network, server or storage resources)?
Performance. Deduplication comes in all shapes and sizes as backup workloads have different requirements. Deduplication may be mixed and matched, taking advantage of features of both software and hardware products. Source-side dedupe in backup software may make the most sense for remote systems because it delivers greater network efficiency, while target-side approaches may make more sense for workloads with the most stringent backup windows.
Scalability. While deduplication should mitigate the need to expand storage capacity, the impact of growth on the dedupe environment should be thought through. You need to determine how easy or difficult it is to expand the deployed product, and if expansion will introduce silos of storage (and thereby limit deduplication) and increase management. And does scaling require a forklift upgrade or can it be achieved more seamlessly?
Deduplication domain. You also need to consider the scope of the deduplication effort. Will your dedupe effort be limited to the confines of a single container -- whether it's logical or physical -- or are your goals broader?
Such a wealth of deduplication options provides ample choices, but it can also lead to some confusion. Vendors have the opportunity to educate users about deduplication technology in general, and specifically how their own solutions approach the task. And you need to understand your backup environment and requirements before short-listing solutions. Vet the vendors and their products, check their references and, most importantly, test the products using your own data over several backup cycles.
BIO: Lauren Whitehouse is an analyst focusing on backup and recovery software and replication solutions at Enterprise Strategy Group, Milford, Mass.
This was first published in May 2009