Although the opportunity for data deduplication and compression on primary storage may be less than for backups
Dave Russell, a research vice president at Gartner, discusses the current techniques for data reduction on primary storage, from standard compression to file- and sub-file deduplication, to a combination of deduplication and compression. He also outlines the emerging approaches that users may find more prevalent in the future.
You can read the podcast interview below or download the MP3 file.
Download for later:
Primary storage deduplication/compression is worth a try
• Internet Explorer: Right Click > Save Target As
• Firefox: Right Click > Save Link As
SearchStorage.com: What are the most common approaches for data deduplication and compression on primary storage?.
Russell: There are really four main techniques that are used. The first is standard compression, typically Lempel-Ziv compression, based off of algorithms from the late '70s -- 1977, 1978. Another approach is oftentimes referred to by one of two names, either single-instance store, or its acronym SIS, or it's sometimes referred to as file-level deduplication, which really tries to reduce commonality from a complete file perspective -- for example, if you and I both have the same copy of a PDF. The third approach is really sub-file deduplication, and that's the kind of deduplication that most people are aware of, where we look for commonality between little bits of files or potentially databases, email systems as well. And a fourth area that we see really being applied more frequently is a couple of different techniques, most typically deduplication and compression, whereby we can look for commonality within files, bits and pieces of file data, and then compress the results further from there.
SearchStorage.com: Can you briefly compare and contrast how each of the different approaches works?
Russell: Compression is really looking at a very specific amount of data. One everyday example might be a sound file. An MP3 is an example. Compression is just looking within that individual object, in this case, one single music file, and doesn't really persist any kind of data reduction across other types of files or data that it's going to process later on.
The next step would be single-instance store, which actually would look for commonality across many different files and would persist this idea of a dictionary of looking to see what had been repeated in the past. Deduplication takes this concept even further and really keeps this dictionary of known bits of data and looks further at a smaller chunk, or further granularity, across files for repetitive data that may have shown up in the past. So, whereas compression's typically whatever it's presented with and oftentimes at a single-file perspective, single-instance store looks at multiple files over a period of time, and then deduplication cracks this down a little bit further and looks at elements of objects or files and over a longer period of time.
And part of the difference is when and how this data reduction is applied. It could be as data is created, or it could be after data lands on disk and is processed later on. So these techniques have different kinds of processing requirements associated with them.
SearchStorage.com: How effective are data deduplication and compression for primary storage vs. backups?
Russell: Certainly backup is one of the most redundant kind of workloads that we have, meaning that we're capturing the same files on a very frequent basis. And for some organizations, they might be doing full backups every single night, and if they're not doing that, they're probably doing a full backup at least once a week, which with the typical change rate of data means at least 90% to 95% of what they're backing up on each full backup is exactly redundant with what they've captured before.
So, the opportunity on primary storage might be a little less, but it's still very significant, especially for so-called unstructured data or things like word processing documents, spreadsheets, PowerPoints, which tend to have not only a lot of commonality but a lot of situations where even one individual saves a file with very, very similar data multiple times. Maybe they're only changing a little bit of a title page as one example, but a lot of the specifics in, say, a contract might look very, very similar. Another example might be databases, where an organization oftentimes has at least half a dozen, maybe even more than 10, copies of their database. So there's a lot of opportunity for reduction in the primary world as well.
SearchStorage.com: Do you foresee any new approaches emerging for data deduplication and compression on primary storage, and if so, how will they work and what kind of results can users expect to see?
Russell: I think the first thing we see coming down into the marketplace relatively soon is a situation where vendors combine more of these capabilities together, and certainly we have some evidence of that today. But we think that we're going to see many more products and solutions that combine compression and deduplication. Today one vendor may only offer deduplication; in the near future, they may offer compression on top of that; and where potentially today they only offer compression, they may expand into dedupe as well.
The next area that we see is that because this tends to take a certain amount of processing power, particularly CPU, we're going to see more advancement in chip technology and that is going to be much more cost-affordable. The speed of being able to process data and potentially to do that more in an inline process rather than land all this on disk first is very likely to come about in numerous products.
The third area really is around scope, meaning how global is the data reduction, how wide can we look across repetitive data and reduce commonality. Today, some products are limited from a single LUN or a volume. Others are limited by certain streams, as examples, and we think that we're going to see increasingly broader, more global capabilities that will really drive home this data reduction even further for primary storage.
This was first published in March 2010