Home > How data deduplication eases storage requirements
Data Deduplication Explained:
EMAIL THIS

How data deduplication eases storage requirements

09 Apr 2007 | Stephen J. Bigelow, Features Writer

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   

Data is flooding the enterprise. Storage administrators are struggling to handle a spiraling volume of documents, audio, video and images, along with an alarming proliferation of large email attachments. Adding storage is not always the best solution; storage costs money and the sheer number of files eventually burdens the company's backup and disaster recovery (DR) plans.

Rather than finding ways to store more data, companies are turning to data reduction technologies such as data deduplication. This article explains the basic principles of data deduplication and looks at some of the implementation issues for data deduplication technology.

Understanding data deduplication

Data deduplication is a means of reducing storage space. It works by eliminating redundant data and ensuring that only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy.

Data
Data deduplication information
In-band vs. out-of-band deduplication

The skinny on data deduplication

Getting started with data deduplication

Compression, deduplication and encryption: What's the difference?

Deduplication extends to archives

Special Report: Data Deduplication
deduplication, sometimes called intelligent compression or single-instance storage, is often used in conjunction with other forms of data reduction. Traditional compression has been around for about three decades, applying mathematical algorithms to data in order to simplify large or repetitious parts of a file -- effectively making a file smaller. Similarly, delta differencing reduces the total volume of stored data by comparing the new and old iteration of a file and saving only the data that had changed. Together, these techniques can optimize the use of storage space.

Benefits of data deduplication

When properly implemented, data deduplication lowers the amount of storage space required, which results in less disk expenditures. More efficient use of disk space also allows for longer disk retention periods, which offers better recovery time objective (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication and disaster recovery.

Data deduplication primarily operates at the file, block and even the bit levels. File deduplication is relatively easy to understand: If two files are exactly alike, one copy of the file is stored and subsequent iterations receive pointers to the saved file. However, file deduplication is not very efficient because the change of even a single bit results in a totally different copy of the entire file being stored.

By comparison, block and bit deduplication looks within a file and saves unique iterations of each block. If a file is updated, only the changed data is saved. This behavior makes block and bit deduplication far more efficient. "It's an order of magnitude difference in terms of the amount of storage that it [block deduplication] saves in a typical environment," says W. Curtis Preston, vice president of data protection at GlassHouse Technologies Inc. Other analysts note that deduplication can achieve compression ratios ranging from 10-to-1 to 50-to-1. However, block and bit deduplication take more processing power and use a much larger index to track the individual blocks.

Data deduplication platforms must contend with the issue of "hash collisions." Each chunk of data is processed using a hash algorithm, such as MD5 or SHA-1, generating a unique number for each piece. The resulting hash number is then compared with an index of the existing hash numbers. If that hash number is already in the index, the piece of data is a duplicate and does not need to be stored again. Otherwise, the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data.

When such a hash collision occurs, the system fails to store the new data because it sees that hash number already. This is called a false positive and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions.

Implementing data deduplication

The data deduplication process is usually implemented in hardware within the actual storage system, but it is also appearing in backup software. Hardware-based implementations are usually easier to deploy and are geared to reducing storage at the disk level within the appliance or storage system.

Software-based implementations also reduce data, but the reduction is performed at the backup server. This minimizes the bandwidth between the backup server and backup system, which is particularly handy if the backup system is located remotely. "Users get end-to-end benefits when deduplicating data at the source -- less data traverses the WAN, LAN and SAN," says Lauren Whitehouse, analyst at the Enterprise Strategy Group. However, deploying deduplication in a new backup application is more disruptive because it involves installing lightweight agents on the systems that must be backed up, in addition to installing the new backup engine.

Caveats regarding data deduplication

There is no universal approach to data deduplication; results can vary dramatically, depending on factors such as the storage environment and, of course, which dedupe product is selected. Data deduplication only makes sense when long-term retention is involved, usually for backup and archive tasks. Short-term retention sees little benefit because there is nothing to deduplicate against.

Preston cautions against the misinformation circulating between deduplication vendors and suggests focusing on issues of performance, capacity and cost. With due diligence, you can identify potential performance and compression issues in your environment. "Let's say you're backing up seismic data or medical imaging data -- this data tends to not dedupe very well," he says. He also advices users to test a prospective data deduplication platform with various types of backups and restores, and see how it functions under actual circumstances.

Scalability is another issue for data deduplication deployment, especially in terms of performance as the data deduplication system grows. Performance might have been an issue as early hash indexes grew large and additional time was needed to look up each block, but Preston calls that FUD (fouled up data) marketing now. "All of the vendors that I am aware of that are currently shipping or about to ship have addressed this [scaling issue] in one way or another," he says. Nevertheless, he recommends you check with your data deduplication vendor on the matter.

From a management perspective, data deduplication should not present any noticeable increase in overhead. "It [management] shouldn't be any more or less than just a standard VTL [virtual tape library]." When multiple deduplication devices are needed, however, there could be an incremental increase in management effort.

Impact of data deduplication

The Appalachian and coastal areas South Carolina are enticing attractions to tourists and regional industry. Advertising, communication and literature have emerged as key assets to the Department of Parks, Recreation and Tourism -- the agency responsible for promoting tourism as an industry and maintaining an extensive park system throughout the state. The agency originally had an EMC Corp. storage area network (SAN) hosting a total of 4 terabytes (TB), of which 1.2 TB comprised the actual working data set of databases and files, while 2 TB was allocated for disk backups before being relegated to DLT. Like many IT organizations, the agency sought ways to mitigate the increasing storage demands of its media and other data.

After investigating numerous data deduplication vendors, the agency settled on Data Domain Inc.'s 430 appliance for disk backup tasks. With 2 TB of onboard storage, the 430 replaced the 2 TB that had previously been set aside on the SAN. The reduction in space was dramatic with bit level deduplication. "With the compression and deduplication, I think we're using about 900 MB," says Bernie Robichau, the agency's systems administrator and security officer. The space reduction was a welcome cost savings, but it also allowed much longer backup retention on disk. "If someone had requested a two-week old file, I would have never been able to get that from a disk-based backup because I couldn't keep two sets of backups on our allocated 2 TB of hard drive [SAN] storage," Robichau notes. "Now someone can request a file from three weeks ago or six weeks ago, and it's immediately available."

Robichau says that installation of the data deduplication platform was relatively quick and easy, requiring only about four hours of onsite engineering work and minimal configuration. Its current CommVault System Inc. backup infrastructure proved to be fully compatible -- backup agents were simply pointed to the new appliance rather than the EMC SAN. "The backups worked just as they always did, but we're consuming far less disk space and much more retention than we ever did before," he says. While the deduplication appliance requires almost no management time, Robichau notes as much as 75% labor savings in tape overhead, such as cartridge rotation, cleaning and storage. The only remaining tape effort involves full backups on weekends and systematic cartridge rotation to an offsite location.

Although there are no immediate plans to upgrade storage on the 430 appliance, the attention is clearly focused on disaster recovery. Previous considerations of complex disaster recovery plans were put on hold due to complexity. However, the 430 supports replication easily and Robichau expects to replicate the 430 to a duplicate appliance and eliminate backup tapes entirely sometime in the next fiscal year or beyond. "There's no planning beyond synchronizing an identical appliance on site and putting it in one of our remote locations."

Denver-based IT hosting provider Data393 Holdings LLC was drowning in customer data. Its challenge: to keep its data protection business running smoothly, along with other services, like managed server hosting, managed firewalls and load balancing. However, its backup environment was formidable; handling 20,000 backups per month with each customer protecting 20 GB to100 GB. Even with 4.5 TB of protected storage, Data393 Holdings could only keep two weeks of retention. To make matters even more challenging, its StorageTek L700 and L11000 tape libraries were managed by an outsourced provider, requiring a full-time engineer at the hosting provider.

But, it was ongoing restoration problems that forced Data393 Holdings into action. "Our success rate from backups, at the lowest point, was roughly 70%," says Steve Merkel, senior systems engineer. "And far too often, we couldn't hit [restore] the exact day they wanted." Poor performance of the tape backup process also plagued the organization, with full backup windows often exceeding 18 hours. These problems also translated into significant customer support costs. It became clear to Merkel that disk storage was the key to beating reliability and performance woes, and data deduplication would be essential to reduce the total volume of storage needed for full and incremental customer backups.

Data393 Holdings opted for Avamar Technologies' Axion software running on a cluster of 11 Dell 2850s offering about 10 TB of total storage. The actual deployment involved a forklift upgrade, but Merkel reports that the system was up and running in just a few days after installing agents on almost 400 backup servers and migrating necessary data.

The move to data deduplication brought several significant benefits, most notably a reduction in storage requirements. While it might have taken 350 GB to protect 100 GB of customer data without deduplication (full and incremental backups), with data deduplication, it actually takes less storage than the data it's protecting. "I'm using about 7 TB of storage to protect roughly 8 TB of data," Merkel says. "That includes anywhere from two weeks to one year of retention [daily full backups]."

Backup time was also slashed; in some cases an 18-hour backup window fell to 1-1/2 hours, while the backup and restoration success rate was improved to 98% or more. Before, two full-time engineers were needed. After the deployment, that requirement fell to one half-time engineer. "We wanted to have an ROI [return on investment] of 24 months, and we hit payback at 20 months," says Merkel.

Today, the 4.5 TB of protected data has grown to about 7.6 TB protected by data deduplication. About 2 TB of that protected data is replicated to a smaller Avamar deployment at a disaster recovery site in St. Louis. Data393 continues to use tape for long-term archival backups. Merkel expects the amount of protected data to double in the foreseeable future, though less storage will be required to handle the growth.

The future of data deduplication

In the near term, industry experts see data deduplication filling an important role in disaster recovery: saving disk storage space by replicating the data of one deduplication platform to another located off site. This reduces the need to move tapes back and forth, which can be particularly valuable when replicating hundreds of terabytes of data.

Other analysts note that the separate "point products," like VTL, will address backup window performance, while data deduplication addresses the issue of storage capacity. Whitehouse says, "Next-generation backup solutions fix both, deduplicating data as it's sourced from the backup target and improving the efficiency of its transfer across a LAN/WAN to the central disk repository." Deduplication is now common in VTLs and will appear as a feature of traditional backup products.



BROWSE BY TAG
Data Backup,   Data reduction and deduplication,   Data Retrieval Strategies,   Retrieving data from backups,   Retrieving data from backups related information,   Resource Library,   Storage explained,   VIEW ALL TAGS

Digg This!    StumbleUpon Toolbar StumbleUpon    Bookmark with Delicious Del.icio.us   



RELATED CONTENT
Data reduction and deduplication
Tools and techniques for reducing your enterprise data storage footprint
Is data deduplication right for your primary storage infrastructure?
Backup in a snap: A guide to snapshot technologies
Tips for an effective data deduplication implementation
EMC completes acquisition of Data Domain; fate of data deduplication partner Quantum unknown
EMC wraps up data deduplication vendor Data Domain; what's next for EMC, NetApp?
EMC acquires Data Domain for $2.1B after NetApp drops bid
EMC raises its acquisition offer for data deduplication vendor Data Domain to $2.1B
Choosing a storage system for data archiving
Storage Decisions Chicago 2009 Session Downloads

Retrieving data from backups related information
Archives and data protection: The important differences

Storage explained
Data storage management in virtual server environments
Data storage and wide-area networks in 2009
Tiered storage tutorial
Buying storage capacity in 2009
The evolution of RAID data protection
Applying ITIL best practices to storage explained
Content-addressed storage (CAS) explained
Self-healing storage explained
NAS virtualization explained
How to create Tier 0 storage by leveraging solid-state drive technology

RELATED GLOSSARY TERMS
Terms from Whatis.com − the technology online dictionary
data deduplication  (SearchStorage.com)
delta differencing  (SearchStorage.com)

RELATED RESOURCES
2020software.com, trial software downloads for accounting software, ERP software, CRM software and business software systems
Search Bitpipe.com for the latest white papers and business webcasts
Whatis.com, the online computer dictionary




Find Data Reduction
TechTarget Storage Media
Storage Magazine View this month\\'s issue and subscribe today.
Storage Decisions Apply online for free conference admission.
SearchStorage.com
HomeNewsMagazineTopicsLearningMultimediaWhite PapersBlogsEventsAbout Us

About Us  |  Contact Us  |  For Advertisers  |  For Business Partners  |  Site Index  |  RSS
TechTarget provides technology professionals with the information they need to perform their jobs - from developing strategy, to making cost-effective purchase decisions and managing their organizations' technology projects - with its network of technology-specific websites, events and online magazines.

TechTarget Corporate Web Site  |  Media Kits  |  Site Map




All Rights Reserved, Copyright 2000 - 2009, TechTarget | Read our Privacy Policy
  TechTarget - The IT Media ROI Experts