Published: 09 Jun 2008
| The dedupe stage is getting crowded and confusing. Avoid trouble by knowing where to look for it.
Deduplication promises to reduce the transfer and storage of redundant data, which optimizes network bandwidth and storage capacity. Storing data more efficiently on disk lets you retain data for longer periods or "recapture" data to protect more apps with disk-based backup, increasing the likelihood that data can be recovered rapidly. Transferring less data over the network also improves performance. Reducing the data transferred over a WAN connection may allow organizations to consolidate backup from remote locations or extend disaster recovery to data that wasn't previously protected. The bottom line is that data dedupe can save organizations time and money by enabling more data recovery from disk and reducing the footprint and power and cooling requirements of secondary storage. It can also enhance data protection.
Sounds great, doesn't it? But the devil is in the details. For starters, vendors offering data dedupe technology are often polarized when it comes to their approaches. There are debates over deduplication in backup software vs. in the hardware storing backup data, inline vs. post-process deduplication and even the type of hash algorithm used. These topics are discussed in vendor blogs and marketing materials, which can cause a lot of confusion among users.
The fine print
In data protection processes, dedupe is a feature available in backup apps and disk storage systems to reduce disk and bandwidth requirements. Data dedupe technology examines data to identify and eliminate redundancy. For example, data dedupe may create a unique data object with a hash algorithm and check that fingerprint against a master index. Unique data is written to storage and only a pointer to the previously written data is stored.
File-level dedupe (or single-instance storage) removes duplicated data at the file level by checking file attributes and eliminating redundant copies of files stored on backup media. This method delivers less capacity reduction than other methods, but it's simple and fast.
Deduplicating at the sub-file level (block level) carves the data into chunks. In general, the block or chunk is "fingerprinted" and its unique identifier is then compared to the index. With smaller block sizes, there are more chunks and, therefore, more index comparisons and a higher potential to locate and eliminate redundancy (and produce higher reduction ratios). One tradeoff is I/O stress, which can be greater with more comparisons; in addition, the size of the index will be larger with smaller chunks, which could result in decreased backup performance. Performance can also be impacted because the chunks have to be reassembled to recover the data.
Byte-level reduction is a byte-by-byte comparison of new files and previously stored files. While this method is the only one that guarantees full redundancy elimination, the performance penalty could be high. Some vendors have taken other approaches. A few concentrate on understanding the format of the backup stream and evaluating duplication with this "content-awareness."
Where and when
Deduplication can be timed to occur before data is written to the disk target (inline processing) or after data is written to the disk target (post-processing).
Post-process deduplication will write the backup image to a disk cache before starting to dedupe. This lets the backup complete at full disk performance. Post-process dedupe requires disk cache capacity sized for the backup data that's not deduplicated plus the additional capacity to store deduped data. The size of the cache depends on whether the dedupe process waits for the entire backup job to complete before starting deduplication or if it starts to deduplicate data as it's written and, more importantly, when the deduplication process releases storage space.
Inline dedupe could negatively impact backup performance when the app uses a fingerprint database that grows over time. Inline approaches inspect and dedupe data on the way to the disk target. Performance degradation depends on several factors, including the method of fingerprinting, granularity of dedupe, where the inline processing occurs, network performance, how the dedupe technology workload is distributed and more.
The issue is software vs. hardware. On the hardware side, purpose-built appliances offer faster deployments, integrating with existing backup software and providing a plug-and-play experience. The compromise? There are limitations when it comes to flexibility and scalability. Additional appliances may need to be added as demand for capacity increases, and the resulting appliance "sprawl" not only adds management complexity and overhead, but may limit deduplication to each individual appliance.
With software approaches, disk capacity may be more flexible. Disk storage is virtualized, appearing as a large pool that scales seamlessly. In a software scenario, the impact on management overhead is less and the effect on deduplication may be greater since deduplication occurs across a larger data set than most individual appliance architectures.
Software-based client-side and proxy dedupe optimize performance by distributing dedupe processing across a large number of clients or media servers. Target dedupe requires powerful, purpose-built storage appliances as the entire backup load needs to be processed on the target. Because software implementations offer better workload distribution, inline dedupe performance may be improved over hardware-based equivalents.
Choosing a software or hardware approach may depend on your current backup software implementation. If the backup software in place doesn't have a dedupe feature or option, switching to one that does may pose challenges.
| Reduction ratios
Beware of reduction ratio reports from vendors. In a recent survey, ESG found that among survey participants using dedupe technology, most reported reduction ratios in the 10x to 20x range (see "Capacity reduction," this page). Data reduction rates vary depending on a number of factors, such as data type, change rate and retention period, and whether or not dedupe occurs across multiple data sets.
A vendor could accurately claim reduction ratios of more than 100x, but some explanation of that number would be required. For example, calculating deduplication ratios for client-side deduplication of a full backup in a low change-rate environment, such as a Windows, could result in 100x daily reduction ratios. However, this isn't the reduction ratio for backup storage and is a great example of a liberty vendors take when issuing quantifiers. You should ask your deduplication vendor for the method used to arrive at their reduction ratio number.
Dedupe improves the value proposition of disk-based data protection because it eliminates the redundancy traditionally seen in secondary storage processes. Given the explosive growth rates of data and the cost of power and cooling, implementing dedupe is becoming an imperative.
Choosing a deduplication strategy isn't a simple task. Technology maturity varies considerably and the vendor landscape is changing quickly with new entrants and a recent spate of acquisitions. As solutions are considered, cut through the vendor hype by requesting real-world references and proof points. Test backup performance and, more importantly, restore performance. Conducting your performance testing and due diligence early should save you from unnecessary drama later on.