Deduplication is already popular in the backup space, but the data reduction technology can also be useful for primary storage, especially for server virtual machine (VM) images and hosted virtual desktops (HVDs).
Dave Russell, research vice president of storage technologies and strategies at Stamford, Conn.-based Gartner Inc., said he has heard accounts of dramatic levels of data reduction, typically among IT shops that take advantage of built-in deduplication from a major vendor.
In this podcast on primary storage deduplication, Russell discusses the potential to use the same vendor’s technology for primary data and backups, as well as the risk of long-term vendor lock-in.
Download for later:
- Internet Explorer: Right Click > Save Target As
- Firefox: Right Click > Save Link As
Lots of IT shops use data deduplication for backups, but far fewer use the technology with primary storage. For what types of data does deduplication make sense with primary storage?
David Russell: The same things that would make deduplication interesting and effective for primary storage are the same as those in the backup world. Backup really started to embrace this technology first because there was already so much redundancy in what was going on with that kind of an activity -- meaning oftentimes doing weekly full backups and some organizations doing full backups on each application every night. The idea is if there’s a lot of commonality of the data payload, then we can get some real benefit from doing deduplication.
In the primary space, server VM images have been a great target for deduplication. Client virtualization, or hosted virtual desktops, can exhibit strong deduplication ratios. We’ve heard of [ratios of] up to 100-to-1, which is 99% reduction, so dramatic space savings in the client or HVD area. But with server virtualization in particular, we’re seeing a lot of deployment because there's so much commonality in those images, network shares and file shares where you and I may have a project to share or a department disk that we keep and drag a lot of the same kinds of Excel spreadsheets, Word documents and things where we’ve got a lot of redundancy for boilerplates, modified presentations and just a lot of repetitive material.
Do you advise an appliance or built-in technology for reducing data with primary storage?
Russell: There are pros and cons of each approach. On paper or theoretically, the advantage of an appliance or some kind of blocks that sits in front of your existing storage is that you could put that in front of any number of different kind of disk technologies you may already have. So, if you really weren’t in a position to go out and buy a new device, or perhaps if you had to license or turn on some capability and that was prohibitive for you, a purpose-built appliance or something that sits in front of your disk could be of interest. But the disadvantage is that’s yet another piece of infrastructure, another wire that’s sitting there in front in the data path of your storage, and most of the market is very risk averse to that. We typically see people favoring built-in technologies -- they ideally want it to come from a major vendor and be a capability embedded in their storage solution. One of the advantages of it being embedded is that it’s typically not a charged item, whereas an external appliance will have a cost of goods associated with it.
What options exist to use the same vendor’s data reduction for both primary storage and backups, and will that approach bring any advantage?
Russell: We’ve seen people that have had deduplication in backup, or are getting engaged in deduplication for backup, announce moves in the primary storage space. A year ago, Dell made an acquisition in this space [when it bought Ocarina]. Hewlett-Packard has some organic technology [StoreOnce] it's talked about bringing into the primary workload as well. Right now, there’s not a great deal of cross-pollination. Even someone like an EMC, which has assets like Avamar and Data Domain for the backup world, is using a different type of technology in terms of primary storage reduction.
The potential advantage is that a single supplier could take the primary data, use the form-factoring technology -- deduplication -- to reduce the footprint, and keep that in its smaller state as that data payload moves up and down the environment. Probably the closest to actually doing that today is NetApp, which can do deduplication/compression on the primary storage and keep the files small or shrunken down as they move that off for disaster recovery or send it to another NetApp-based disk.
Will an IT shop encounter any difficulties or problems if it uses dedupe technology from one vendor for primary storage and a different vendor for backups?
Russell: Not inherently. It certainly can work, and there shouldn’t be any kind of connectivity or support-matrix types of challenges with it. The one issue, though, is that you would have to re-inflate the data. If you’re using Vendor A for primary storage and Vendor B for your backup, Vendor A will re-inflate that as it goes out the backup stream, and gets sent across the wire to Vendor B’s new target. That’s not inherently negative. I’ve even heard some organizations describe that as a positive, where they like the idea potentially of changing the technology or the algorithm that’s used to compress and deduplicate that data. But we can envision where the market might ultimately want to go through the exercise of making the data reduced and keeping it reduced.
How tough is it to switch from one data dedupe product to another? Once you make a choice, are you generally locked in for the long haul?
Russell: What we’ve seen is that if you’re going to make the architectural standard on this, there's a certain amount of lock-in associated with that, and you get the good and the bad. You certainly get the familiarity with the vendor and the technology, and how to tune and configure it. But there could be potential negatives, such as how easily the solution scales, and if you max out capacity and just need a little bit more, you might not be able to buy a little bit more. You might have to buy a very large appliance or another whole new frame if you’ve gone past the expandability. So, that could be viewed sometimes as a bit of a lock-in, but some organizations look at this and say, “I can see this being cost affordable for a certain amount of capacity, but if I do it across the whole enterprise, now it starts to feel expensive at scale.”
In terms of difficulty switching, we know of primary storage vendors’ swap-outs occurring on a frequent basis, Not that it’s an easy activity, but again, Vendor A replaces Vendor B, and most people assume that’s what this would be like. But the other case that’s not really taken into account, if you’re getting say a 20-to-1 deduplication ratio on your primary storage, and you move that to another vendor, you’ve got to re-inflate all that data, and that means now it’s not just a matter of taking say 10 terabytes of data and moving that 10 terabytes of disk space somewhere else. It re-inflates out, and it’s now 200 terabytes of data going across the wire. So, the problem can look a little bigger than it actually is.
This was first published in December 2011