News Stay informed about the latest enterprise technology news and product updates.

At the brink of the data deduplication wars

OK folks, the data deduplication war has begun. In the center of the war are vendors such as Data Domain, Diligent Technologies, ExaGrid, FalconStor, Quantum and SEPATON. It is only a question of time before EMC, NetApp and Symantec join the fray. To understand what is happening let’s start with what has happened in the last five years relative to disk-to-disk technologies. The early pioneers included the six vendors mentioned above. Diligent, Quantum and SEPATON brought in their VTL products in the market approximately three years ago and started marketing the value of secondary disk for backup and restore. By making disk look like tape they correctly maintained that the backup procedures would require no alterations and yet you would see vast improvements in backup and restore speeds and reliability. I think all of them have adequately proved that value. Many of you have told me you have seen speed improvements of 3x in backups, with 30-50% improvements very common.

None of these vendors said a thing about data deduplication at the time they entered the market. Data Domain, on the other hand, took a very different tack. They came to market with a disk-based product targeting the same space but focused on data deduplication, front and center. Their premise from the beginning was that by eliminating duplication of data at a sub-file level one could keep months of backup data on disk and therefore have fast access to not just what was backed yesterday but data that was month’s, even years old. When viewed through the data deduplication lens, Data Domain took a lion’s share of the market with 53% of the storage with data deduplication in 2006, according to our estimates. Along with Avamar (now EMC), another data deduplication-centric backup software vendor, they presented an argument for changing the role of tape to that of very long term retention.

The lift off for Data Domain took some time in the market and they focused initially on the SMB market. This was no surprise to me because all paradigm shifting ideas take time to sink in. And frankly, you did the right thing in testing the waters before jumping in head first. But the idea made sense. If you could keeps months of backup data on disk but do it at prices that came close to tape, why wouldn’t you? Once the concept was validated and you built trust in the vendor you started buying hundreds of terabytes of secondary disk.

While Data Domain was pushing the data deduplication, they were also inherently pushing disk as a media for long term storage for backups. At the same time, others were presenting their VTL solutions and convincing you on the merits of secondary disk but without any data deduplication. But, behind the scenes, they all knew they had to add data deduplication as quickly as possible to compete in this nascent but $1B+ market. Each worked on different ways to squeeze redundancy out of backup data.

At the concept level, they all do the same thing. The way full and incremental backups have been done for years, there is a lot of redundancy built in. Take, for instance, the full backups that you typically do once a week. How much of that data is the same week to week? 90% would not be a bad guess. Why keep copying the same stupid thing again and again. Even with incremental backup, existing files that have even a single byte changed is backed up again. Why? It is best to not get me going on that front. I happen to think that the legacy backup vendors did a miserable job on that front. But, we will leave that aside for now. Back to data deduplication. So, the idea is to break the file into pieces and keep each unique piece only once, replacing redundant uses of it with small pointers that point to the original piece. As long as you keep doing full and incremental backups using legacy products from Symantec, EMC (Legato), IBM Tivoli, HP or CA, you will continue to see vast amounts of redundant data that can be eliminated. The value of eliminating this redundant data has been made abundantly clear in the past year by Data Domain customers.

2006 saw data deduplication offerings from the VTL players: Diligent Technologies, FalconStor, Quantum (via its acquisition of ADIC who had just acquired Rocksoft, an Australian vendor focused strictly on deduplication technologies). ExaGrid, an SMB player, uses the NAS interface and had deduplication integral to their product. Each does data deduplication differently. Some using hashing algorithms such as MD5 or SHA-1 or 2. Others use content awareness, versions or “similar” data patterns to identify byte-level differences. Each claims to get 20:1 data reduction ratios and more over time. Each presents its value proposition and achievable ROI, based on its internal testing. Some do inline data deduplication; others perform backups without deduplication first and then reduce the data in a separate process, after the backup is finished. Each presents its solution to be the best. Are you surprised? I am not.

What is clear to me is the following:

1. The value proposition of using disk for backup and restore is clear. No one can argue that anymore. The proof points are abundant and clear.

2. The merits of data deduplication are also abundantly clear.

3. However, the merits of various methods of data deduplication and the resultant reduction ratios achieved are not clear to you today (in general).

4. The market for these is huge (Taneja Group has projected $1022M for capacity optimized (i.e. with data deduplication) version of VTL and $1,615M for all capacity optimized version of disk-based products in 2010)

5. Both VTL and NAS interface will prevail. The battlefront is on data deduplication.

6. Vendors will do all they can in 2007 to convince you of their solution’s advantages. Hundreds of millions of dollars are at stake here.

7. By the end of this year we will see the separation between winners and losers. Of course, without de-duping I believe a product is dead in any case.

So, be prepared to see a barrage of data coming your way. I am suggesting to the vendor community that they run their products using a common dataset to identify the differences in approaches. I think you should insist on it. Without that, the onus is on you to convert their internal benchmarks to how it might perform in your environment. You may even need to try the system using your own data. This area of data protection is so important that I think we need some standard approach. We are doing our part in causing this to happen. You should do yours.

I think we have just seen the beginnings of a war between vendors on this issue alone. To make matters even more interesting we will see EMC apply the data deduplication algorithms from their Avamar acquisition to other data protection products, may be even the EMC Disk Library product (OEM’d from FalconStor). I expect NetApp to throw a volley out there soon. Symantec has data deduplication technology acquired from DCT a few years ago, but currently only applied to their PureDisk product. IBM and Sun, both OEMs of FalconStor may use Single Instance Repository (SIR) from FalconStor or something else, no one is sure. I certainly am not. But, I am certain that none of the major players in the data protection market dare stay out of this area.

Data deduplication is such a game changing technology that the smart ones know they have to play. What I can say to you is simple: Evaluate data deduplication technologies carefully before you standardize on one. Three years from now, you will be glad you did. Remember that for your environment whether you get 15:1 reduction ratio or 25:1 will translate into millions of dollars in terms of disk capacity purchased. I will be writing more about the subtle differences in these technologies. So stay tuned!

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Do you have a practical use for Red Hat Storage?
Using for all kinds of unstructured data.
I have yet to determine if the performance and reliability are there, but moving away from over priced SANs is a plus.
I would use it for test environment at 1st.