Essential Guide

Browse Sections


This content is part of the Essential Guide: Curbing data storage capacity demand
Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Toigo: Deduplication and compression are short-term capacity fixes

Learn why analyst Jon Toigo says the value of dedupe and compression software is lowering, and why other capacity tools are a better bet.

In the toolkit available to storage administrators who want to realize allocation and utilization efficiency in their storage infrastructures, deduplication and compression are often reached for first. However, these two technologies can be considered short-term fixes, providing temporary relief from capacity shortfalls by squeezing more data into the same amount of space. There are several more strategic tools to consider, but let's consider these, especially in light of all the attention they have received recently.

Though limited in terms of the extent and duration of their impact, tactical capacity maximization tools have gotten a lot of ink in the past few years. Deduplication, for example, is heralded by some vendors as a "capacity expander" that, through the reduction of the physical space used to store data, enables a single drive to store the data that once occupied the space of several drives. Let's take a closer look.

Deduplication originally aimed at backup files stored to disks that were configured as tape caches or virtual tape libraries (VTLs). Deduplicating backup workloads may have made sense, given that most full backups contain a considerable percentage of static data -- that is, data that has not changed since the prior backup. Essentially, deduplication processes mimic, in terms of their impact, a backup strategy often referred to as incremental or change backup, in which only changed data is backed up after a full backup has been conducted. Deduplication merely processes the incremental backup in a different way (by comparing the full backup data to the previous full backup and "reducing" or eliminating the bits that are the same in the new full backup).

Vendors used deduplication algorithms to charge considerably more money for a stand of comparatively cheap (usually consumer-grade SATA) disk drives. An early model of a deduplicating VTL had an MSRP of $410,000 for approximately $3,000 worth of drives and shelf hardware, the justification for the huge price hike being the value-added software (the deduplication software) included on the rig. Once it was determined that few companies would realize sufficient capacity and cost savings from deduplication to justify the price of the rig, the appeal of the strategy began to fade. Also, we have to consider the isolation of deduplication functionality to a single array. Once that array is filled, a second array running an entirely separate deduplication process is required, thus making the value case even less sustainable.

Compression offers another approach, albeit one much more straightforward and less subject to the challenges of identifying and deduplicating similar bits of data. The acquisition of an industry leader in the compression space, StorWize, by IBM in 2012 has introduced a family of storage arrays featuring in-line compression technology. Such compression also has a way of stretching capacity by squeezing the same amount of data into a smaller amount of disk turf. The sensible question to ask is whether the equipment with compression technology built in is yielding sufficient capacity savings to justify its price tag. IBM appears to be migrating some of its StorWize technology to its virtualization engine, the SAN Volume Controller, to provide a way to scale the compression functionality to multiple stands of disk instead of isolating it to one array controller.

At the end of the day, neither deduplication nor compression does much more than buy some time so that more strategic measures can be taken to manage capacity. It is also worth noting that initiatives are afoot within most file-system development projects to build both deduplication and compression directly into the file system for optional use by storage administrators. If these efforts are successful, there may be no need in the future to purchase proprietary technologies for delivering these functions that are isolated on specific array controllers. Gone will be the exorbitant price tags and the vendor lock-ins for such functionality.

Dig Deeper on Storage optimization

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

I strongly agree with you Toigo. Deduplication is just another form of incremental backup jobs. even i consider Full daily dedup backup jobs is worse than incremental because they do incur CPU overhead to scan all files, disect them into pieces and then apply the dedup algorith, and at the end compare the hash results with the hash database, then decide whether this stream of bytes needs to be stored or not.
Sometimes i even see it time consuming to run FULL dedup jobs. I would rather suggest to run daily incrementals with weekly FULL Synthetic backup jobs, and then periodically ship from disk to tapes for DR/ archival purposes.

Additionally, i don't prefer to enable deduplication & on the primary storage unless there is aN SSD disk tier to which hot storage block can be migrated like EMC FAST Cache or netApp Flash Pool.

Interesting perspective, but not sure I agree with everything you say. Your focus on deduplication is on the backup scenario. While I agree that the type of backups performed will have a dramatic impact on dedupe rates, you will still see benefits when backing up the same data from different machines (VM images, OS, applications, etc.). There are still great opportunities for optimization. You don't address the value of deduplication on primary storage at all (HDD or Flash). If you reduce the data at the source, before the backup, then you save not only on your most expensive storage (primary), but also on your backup (time and capacity). This is most prevalent when you look at the virtual server and desktop growth trends. Why would you not want to leverage technology that can save you up to 35x up front?

As for compression, this technology has been around forever and is fully vetted. As you know, some data compresses better than other data (databases and "big data" etc.) while some data dedupes better (virtual environments, user data, email, etc.) than others. By combining compression with deduplication, you get the best opportunity to efficiently store data. The technology is available today. Why wouldn't you want to deploy it, as long as it can scale and not impact overall system performance?
I am so sorry but how is extending existing storage by 5-10 times a short term fix? How is reducing the purchase price by 5-10 times a short term fix?

If yo could download an app to your iPhone which would give you 80GB instead of the purchased 16GB, you would not do it? Is that a "short term" fix? This is a very flawed article with some secondary interests. Dedupe and Compression is very valuable for FLASH/DRAM and FLASH accelerated Hybrid Arrays like Tegile and some other ones. It turns the cost per GB even for flash upside down.
In addition to what previous commenters mentioned, there might also be a performance benefit:
* if you dedupe primary storage (and have a good dedupe ratio, e.g. in a virtual environment) you can have quite a (read-)performance gain, if your caching mechanisms are dedupe-aware.
* And if your storage system performance benefits from available space (like with WAFL/ZFS/BTRFS) your write performance will also benefit from good dedupe ratios since there's more freely available blocks. You also will have less 'fragmentation' issues.
I cannot agree with you about deduplication storage scalabiltity:
"Also, we have to consider the isolation of deduplication functionality to a single array. Once that array is filled, a second array running an entirely separate deduplication process is required, thus making the value case even less sustainable."
This is not true for years. Several systems (like HYDRAstor) scales out
There are a few misstatements in your article, and you neglected to state the important caveats of relying on deduplication. Also, some of the other folks who commented made a few misstatements, as well.

While true, we first saw deduplication in the marketplace to address the backup space, it had nothing to do with VTL. In fact, Data Domain, who was the first commercial vendor to bring dedupe to the data center (2004-ish), didn't even support VTL until a few versions past their "1.0".

Your comment, "Once it was determined that few companies would realize sufficient capacity and cost savings from deduplication to justify the price of the rig, the appeal of the strategy began to fade." is just garbage. It's irresponsible reporting, at best. Do you really believe that? The strategy is as alive and well as any! Today, vendors such as ExtremeIO and Solidfire have leveraged dedupe to attempt to bring the cost of flash down to the levels of spinning disk. NetApp has been using it on primary storage for years, EMC recently entered the dedupe on primary storage game with VNX2. The "appeal" is alive and well.

""Also, we have to consider the isolation of deduplication functionality to a single array. Once that array is filled, a second array running an entirely separate deduplication process is required, thus making the value case even less sustainable." is also garbage and irresponsible reporting. As mentioned, Hydrastor, as well as Data Domain have had the notion of a global dedupe realm for years.

Your reference to a $410K solution also sounds dubious. I was one of the early adopters of dedupe and my entire acquisition, which INCLUDED REMOTE REPLICATION targets, cost about that - a solution to back up my entire data center! Yes, you're paying for software on these arrays, but what storage vendor ISN'T a software company? Aside from the market liking them as software vendors, most storage vendors take commodity parts, slap a pretty bezel on their shelves, pop some software on them, then mark up the solution considerably.

What you failed to warn your readers about, is the resource costs of deduplication. Deduplication does-not-reduce on-system performance requirements - it increases them. It takes a considerable amount of CPU cycles to run deduplication. Most organizations make storage purchase decisions that they must live with for a number of years. Day One, you may have the cycles to run dedupe, but Day 1095, you may not. Then what? If you've relied heavily on dedupe for space optimization, you may be in for a world of hurt, when you have to turn dedupe off, then watch your capacity utilization rise, only to realize that you can't add any more disk to your system because you are at capacity. Welcome out-of-band forklift upgrade.

I wish that I could attach a screenshot of a presentation that I did on deduplication, where at the time, I showed CLI snippets of a system which showed backups of our Oracle Financials database. During this symposium, which was targeted at US Department of Defense technology managers, I crammed 1TB+ of data into something like 15GB or addressable storage.

"At the end of the day, neither deduplication nor compression does much more than buy some time " Or, it saves organizations money in storage, power, cooling, floor space .... I love the iPhone analogy that someone posed. What I would add, is that if the cost of the app makes sense....

You also should have differentiated between in-line and at-rest deduplication. Different use cases and scenarios where either is a fit.

Honestly, please do some real research before you print this stuff.

Lots of comments here that merit additional response on my part. So, here goes.

First of all, the core statements made in this article should be

1. De-duplication and compression are not a solution to the problem of unmanaged data growth; they provide at best a holding action -- a temporary solution while a real strategy is devised. Like "trash compactors" that produced neat cubes of refuse that were expected to curb the rates at which landfills filled garbage, data reduction technologies do not reduce storage capacity requirements over the long term: even compacted trash fill the landfill. So, simply put, data reduction does not permanently resolve the problem of unmanaged data growth. For that, you need archive -- active and deep.

2. The first mention of data reduction technology was in the context of backups and virtual tape libraries. VTLs had come to be used as a location (usually a disk platform) for storing 30, 60, or 90 days of data in order to facilitate fast restore of individual files that had become corrupted or otherwise lost. Compressing or de-duplicating these backups enabled the same sort of space savings that would accrue to using incremental backups in conjunction with a full backup, discarding data that already existed in the repository. In that context, de-dupe made considerable sense -- provided it did not introduce challenges for data restore timeframes.

3. De-duplication was originally introduced on hardware as a "value-add" feature that was joined at the hip to a proprietary array controller by a vendor. This contributed enormous cost to the platform as indicated in my example that was difficult to justify given the failure of these platforms to deliver their promised data reduction rates or to scale beyond a single frame. HydraStor did provide a scale-out model, but a proprietary one that required all kit to be purchased from a single vendor. This exception probably should have been pointed out in the piece.

4. That de-duplication (like compression in past years) is being added to file systems directly, the concern with proprietariness and hardware lock-in in the piece might be ameliorated to some extent. However, many firms shy away from hardware or software compression or de-duplication in any case, preferring un-obfuscated access to original data -- think archive, where the techniques for data storage need to be considered within the context of potential data retrieval problems over time and the requirements to un-ingest archive data, then re-ingest them, for every change that occurs in "data container" technology. If de-dupe were ever to become a standard, this concern might diminish. However, as a practical matter, in many financial firms I work with, the technology is not used at all for concern about the retrieval of data and the permissibility in court of data that has been de-duplicated -- a process not yet evaluated or sanctioned by regulators.

I'm not sure if the above covers everyone's boggles on this piece, but it restates what I hope are key points for further discussion. Personally, I don't have a problem with the technology itself. It is a good tactical solution in some cases to deferring capacity expansion requirements. However, you don't fix the problem of the storage junk drawer just by compressing or de-duplicating the junk you are putting into the drawer: at some point, you have to do the hard work of sorting out the junk drawer itself.