With data deduplication in the news today, I recommend checking out the responses to Jon Toigo’s questionnaire for data deduplication vendors. I found his questions about backing up deduped data to tape and the potential legal ramifications of changing data through dedupe especially interesting. The responses from the vendors so far about hardware-based hashing are also interesting, in that they seem to break down according to whether or not their companies offer a hardware- or software-based product.
It would be pretty disappointing if Hifn’s announcement of hardware-based hashing led to a religious war around software- vs. hardware-based dedupe systems. It’s clear (and has been generally accepted, or so I thought) that hardware performs better than software, meaning it’s in users’ best interest to improve the throughput of data deduplication systems by moving processor-intensive calculations to hardware. And the dedupe market is full of enough FUD as it is.
Speaking of which, Data Domain and EMC are getting all slapper-fight about dedupe thanks to today’s product announcement from Data Domain (and attendant comparisons to EMC/Avamar), and the fact that EMC is planning to finally roll out deduping tape libraries at EMC World (based on Quantum’s dedupe).
EMC blogger Storagezilla calls the statement by DD in a press release that its new product is 17 times faster than Avamar’s RAIN grid “nose gold” (props for the phraseology, at least), and then points out that Avamar’s back end doesn’t actually do any deduping, which is something I still don’t quite get.
So Data Domain’s box is faster at de-dup than the Avamar back end which doesn’t do any de-dup.
Since the de-dup is host based and only globally unique data leaves the NIC do I get to count the aggregate de-dup performance of all the hosts being backed up?
Yes, I do!
How does Avamar decide what data is ‘globally unique’? If this is determined before data leaves the host, than that processing must be done at the host. ‘Zilla even says he can count the aggregate performance of all the hosts being backed up in the dedupe performance equation. . .which brings me back to the first point again: Avamar’s back end doesn’t do de-dupe, but it’s faster at dedupe than Data Domain anyway?
Chris Mellor explored this further:
Accrding to EMC, Avamar moves data at 10 GB/hr per node (moving unique sub-file data only). Avamar reduces typical file system data by 99.7 percent or more, so only 0.3 percent is moved daily in comparison to the amount that Data Domain has to move in conjunction with traditional backup software. This equals a 333x reduction compared to a traditional full backup (Avamar has customer data indicating as much as 500X, but 333X is a good average).
‘An EMC spokesperson’ (should we assume it was, or wasn’t, Storagezilla himself?) further stated to Mellor:
“Remember that Data Domain has to move all of the data to the box, so naturally they’re focusing on getting massive amounts of data in quickly. EMC Avamar never has to move all of that data, so instead we focus on de-dupe efficiency, high-availability and ease of restore. Attributes that are more meaningful to the customer concerned with effective backup operations. “
Again I ask, where does the determination that data is ‘globally unique’ take place? It’s got to be taking up processor cycles somewhere. The rate at which it makes those determinations, and where it makes those determinations, would be the apples-to-apples comparison with DD, which is making those calculations as data is fed into its single-box system.
All of that is overlooking that the real meat and potatoes when it comes to dedupe is single-stream performance, anyway — total aggregate throughput over groups of nodes (which is really what both vendors are talking about) doesn’t mean as much. For one thing, Data Domain’s aggregate isn’t really aggregate, because it doesn’t have a global namespace yet. For another, I fail to see how EMC can even quote an aggregate TB/hr figure when talking about a group of networked nodes. Doesn’t network speed factor in pretty heavily to that equation?
Personally, I don’t think either vendor is really putting it on the line in this discussion (c’mon guys, get MAD out there ;)!). And if Avamar really performs better than Data Domain, why isn’t its dedupe IP being used in EMC’s forthcoming VTLs? (EMC continues to deny this officially, or at least refuses to confirm, but there’s internal documentation floating around at this point that indicates Quantum is the partner.)
Meanwhile, according to EMC via Mellor:
EMC says Data Domain continues to compare apples and oranges because it wants to avoid the discussion that there are a number of different backup solutions that fit a variety of unique customer use cases.
I have to admit this made me chuckle. Most of the discussions I’ve had about EMC over the last year or so have involved their numerous backup and replication products and what the heck they’re going to do with them all long-term. Finally, it seems we have an answer: Turn it into a marketing talking point!
I don’t think Data Domain even really wants to avoid that subject, either. They’re well aware that there are a number of different products out there that fit different use cases, given their positioning specifically for SMBs who want to eliminate tape.
At the same time, it’s interesting to watch the EMC marketing machine fire itself up in anticipation of a new major announcement–the scale and coordination are something to behold. This market has already been a contentious one. It’ll be interesting to see what happens now that EMC’s throwing more of its chips on the table.