Data is flooding the enterprise. Storage administrators are struggling to handle a spiraling volume of documents, audio, video and images, along with an alarming proliferation of large email attachments. Adding storage is not always the best solution; storage costs money and the sheer number of files eventually burdens the company's
Rather than finding ways to store more data, companies are turning to data reduction technologies such as data deduplication. This article explains the basic principles of data deduplication and looks at some of the implementation issues for data deduplication technology.
Understanding data deduplication
Data deduplication is a means of reducing storage space. It works by eliminating redundant data and ensuring that only one unique instance of the data is actually retained on storage media, such as disk or tape. Redundant data is replaced with a pointer to the unique data copy.
deduplication, sometimes called intelligent compression or single-instance storage, is often used in conjunction with other forms of data reduction. Traditional compression has been around for about three decades, applying mathematical algorithms to data in order to simplify large or repetitious parts of a file -- effectively making a file smaller. Similarly, delta differencing reduces the total volume of stored data by comparing the new and old iteration of a file and saving only the data that had changed. Together, these techniques can optimize the use of storage space.
Benefits of data deduplication
When properly implemented, data deduplication lowers the amount of storage space required, which results in less disk expenditures. More efficient use of disk space also allows for longer disk retention periods, which offers better recovery time objective (RTO) for a longer time and reduces the need for tape backups. Data deduplication also reduces the data that must be sent across a WAN for remote backups, replication and disaster recovery.
Data deduplication primarily operates at the file, block and even the bit levels. File deduplication is relatively easy to understand: If two files are exactly alike, one copy of the file is stored and subsequent iterations receive pointers to the saved file. However, file deduplication is not very efficient because the change of even a single bit results in a totally different copy of the entire file being stored.
By comparison, block and bit deduplication looks within a file and saves unique iterations of each block. If a file is updated, only the changed data is saved. This behavior makes block and bit deduplication far more efficient. "It's an order of magnitude difference in terms of the amount of storage that it [block deduplication] saves in a typical environment," says W. Curtis Preston, vice president of data protection at GlassHouse Technologies Inc. Other analysts note that deduplication can achieve compression ratios ranging from 10-to-1 to 50-to-1. However, block and bit deduplication take more processing power and use a much larger index to track the individual blocks.
Data deduplication platforms must contend with the issue of "hash collisions." Each chunk of data is processed using a hash algorithm, such as MD5 or SHA-1, generating a unique number for each piece. The resulting hash number is then compared with an index of the existing hash numbers. If that hash number is already in the index, the piece of data is a duplicate and does not need to be stored again. Otherwise, the new hash number is added to the index and the new data is stored. In rare cases, the hash algorithm may produce the same hash number for two different chunks of data.
When such a hash collision occurs, the system fails to store the new data because it sees that hash number already. This is called a false positive and can result in data loss. Some vendors combine hash algorithms to reduce the possibility of a hash collision. Some vendors are also examining metadata to identify data and prevent collisions.
Implementing data deduplication
The data deduplication process is usually implemented in hardware within the actual storage system, but it is also appearing in backup software. Hardware-based implementations are usually easier to deploy and are geared to reducing storage at the disk level within the appliance or storage system.
Software-based implementations also reduce data, but the reduction is performed at the backup server. This minimizes the bandwidth between the backup server and backup system, which is particularly handy if the backup system is located remotely. "Users get end-to-end benefits when deduplicating data at the source -- less data traverses the WAN, LAN and SAN," says Lauren Whitehouse, analyst at the Enterprise Strategy Group. However, deploying deduplication in a new backup application is more disruptive because it involves installing lightweight agents on the systems that must be backed up, in addition to installing the new backup engine.
Caveats regarding data deduplication
There is no universal approach to data deduplication; results can vary dramatically, depending on factors such as the storage environment and, of course, which dedupe product is selected. Data deduplication only makes sense when long-term retention is involved, usually for backup and archive tasks. Short-term retention sees little benefit because there is nothing to deduplicate against.
Preston cautions against the misinformation circulating between deduplication vendors and suggests focusing on issues of performance, capacity and cost. With due diligence, you can identify potential performance and compression issues in your environment. "Let's say you're backing up seismic data or medical imaging data -- this data tends to not dedupe very well," he says. He also advices users to test a prospective data deduplication platform with various types of backups and restores, and see how it functions under actual circumstances.
Scalability is another issue for data deduplication deployment, especially in terms of performance as the data deduplication system grows. Performance might have been an issue as early hash indexes grew large and additional time was needed to look up each block, but Preston calls that FUD (fouled up data) marketing now. "All of the vendors that I am aware of that are currently shipping or about to ship have addressed this [scaling issue] in one way or another," he says. Nevertheless, he recommends you check with your data deduplication vendor on the matter.
From a management perspective, data deduplication should not present any noticeable increase in overhead. "It [management] shouldn't be any more or less than just a standard VTL [virtual tape library]." When multiple deduplication devices are needed, however, there could be an incremental increase in management effort.
Impact of data deduplication
The Appalachian and coastal areas South Carolina are enticing attractions to tourists and regional industry. Advertising, communication and literature have emerged as key assets to the Department of Parks, Recreation and Tourism -- the agency responsible for promoting tourism as an industry and maintaining an extensive park system throughout the state. The agency originally had an EMC Corp. storage area network (SAN) hosting a total of 4 terabytes (TB), of which 1.2 TB comprised the actual working data set of databases and files, while 2 TB was allocated for disk backups before being relegated to DLT. Like many IT organizations, the agency sought ways to mitigate the increasing storage demands of its media and other data.
After investigating numerous data deduplication vendors, the agency settled on Data Domain Inc.'s 430 appliance for disk backup tasks. With 2 TB of onboard storage, the 430 replaced the 2 TB that had previously been set aside on the SAN. The reduction in space was dramatic with bit level deduplication. "With the compression and deduplication, I think we're using about 900 MB," says Bernie Robichau, the agency's systems administrator and security officer. The space reduction was a welcome cost savings, but it also allowed much longer backup retention on disk. "If someone had requested a two-week old file, I would have never been able to get that from a disk-based backup because I couldn't keep two sets of backups on our allocated 2 TB of hard drive [SAN] storage," Robichau notes. "Now someone can request a file from three weeks ago or six weeks ago, and it's immediately available."
Robichau says that installation of the data deduplication platform was relatively quick and easy, requiring only about four hours of onsite engineering work and minimal configuration. Its current CommVault System Inc. backup infrastructure proved to be fully compatible -- backup agents were simply pointed to the new appliance rather than the EMC SAN. "The backups worked just as they always did, but we're consuming far less disk space and much more retention than we ever did before," he says. While the deduplication appliance requires almost no management time, Robichau notes as much as 75% labor savings in tape overhead, such as cartridge rotation, cleaning and storage. The only remaining tape effort involves full backups on weekends and systematic cartridge rotation to an offsite location.
Although there are no immediate plans to upgrade storage on the 430 appliance, the attention is clearly focused on disaster recovery. Previous considerations of complex disaster recovery plans were put on hold due to complexity. However, the 430 supports replication easily and Robichau expects to replicate the 430 to a duplicate appliance and eliminate backup tapes entirely sometime in the next fiscal year or beyond. "There's no planning beyond synchronizing an identical appliance on site and putting it in one of our remote locations."
Denver-based IT hosting provider Data393 Holdings LLC was drowning in customer data. Its challenge: to keep its data protection business running smoothly, along with other services, like managed server hosting, managed firewalls and load balancing. However, its backup environment was formidable; handling 20,000 backups per month with each customer protecting 20 GB to100 GB. Even with 4.5 TB of protected storage, Data393 Holdings could only keep two weeks of retention. To make matters even more challenging, its StorageTek L700 and L11000 tape libraries were managed by an outsourced provider, requiring a full-time engineer at the hosting provider.
But, it was ongoing restoration problems that forced Data393 Holdings into action. "Our success rate from backups, at the lowest point, was roughly 70%," says Steve Merkel, senior systems engineer. "And far too often, we couldn't hit [restore] the exact day they wanted." Poor performance of the tape backup process also plagued the organization, with full backup windows often exceeding 18 hours. These problems also translated into significant customer support costs. It became clear to Merkel that disk storage was the key to beating reliability and performance woes, and data deduplication would be essential to reduce the total volume of storage needed for full and incremental customer backups.
Data393 Holdings opted for Avamar Technologies' Axion software running on a cluster of 11 Dell 2850s offering about 10 TB of total storage. The actual deployment involved a forklift upgrade, but Merkel reports that the system was up and running in just a few days after installing agents on almost 400 backup servers and migrating necessary data.
The move to data deduplication brought several significant benefits, most notably a reduction in storage requirements. While it might have taken 350 GB to protect 100 GB of customer data without deduplication (full and incremental backups), with data deduplication, it actually takes less storage than the data it's protecting. "I'm using about 7 TB of storage to protect roughly 8 TB of data," Merkel says. "That includes anywhere from two weeks to one year of retention [daily full backups]."
Backup time was also slashed; in some cases an 18-hour backup window fell to 1-1/2 hours, while the backup and restoration success rate was improved to 98% or more. Before, two full-time engineers were needed. After the deployment, that requirement fell to one half-time engineer. "We wanted to have an ROI [return on investment] of 24 months, and we hit payback at 20 months," says Merkel.
Today, the 4.5 TB of protected data has grown to about 7.6 TB protected by data deduplication. About 2 TB of that protected data is replicated to a smaller Avamar deployment at a disaster recovery site in St. Louis. Data393 continues to use tape for long-term archival backups. Merkel expects the amount of protected data to double in the foreseeable future, though less storage will be required to handle the growth.
The future of data deduplication
In the near term, industry experts see data deduplication filling an important role in disaster recovery: saving disk storage space by replicating the data of one deduplication platform to another located off site. This reduces the need to move tapes back and forth, which can be particularly valuable when replicating hundreds of terabytes of data.
Other analysts note that the separate "point products," like VTL, will address backup window performance, while data deduplication addresses the issue of storage capacity. Whitehouse says, "Next-generation backup solutions fix both, deduplicating data as it's sourced from the backup target and improving the efficiency of its transfer across a LAN/WAN to the central disk repository." Deduplication is now common in VTLs and will appear as a feature of traditional backup products.
This was first published in April 2007