This article can also be found in the Premium Editorial Download "Storage magazine: Using two midrange backup apps at once."
Download it now to read this article plus other related content.
Grey Healthcare Group, a New York City-based healthcare advertising agency, works with many media files, some exceeding 2GB in size. The company was storing its files on a 13TB EqualLogic (now owned by Dell Inc.) iSCSI SAN, and backing it up to a FalconStor Software Inc. VTL and eventually to LTO-2 tape. Using FalconStor's post-processing deduplication, Grey Healthcare was able to reduce 175TB to 2TB of virtual disk over a period of four weeks, "which we calculate as better than a 75:1 ratio," says Chris Watkis, IT director.
Watkis realizes that the same deduplication process results could be calculated differently using various time frames. "So maybe it was 40:1 or even 20:1. In aggregate, we got 175TB down to 2TB of actual disk," he says.
Proprietary algorithms deliver the best results. Algorithms, whether proprietary or open, fall into two general categories: hash-based, which generates pointers to the original data in the index; and content-aware, which looks to the latest backup.
"The science of hash-based and content-aware algorithms is widely known," says Neville Yates, CTO at Diligent. "Either way, you'll get about the same performance."
Yates, of course, claims Diligent uses yet a different approach. Its algorithm, he explains, uses small amounts of data that can be kept in memory, even when dealing with a petabyte of data,
| thereby speeding performance. Magnum Semiconductor's Wunder, a Diligent customer, deals with files that typically run approximately 22KB and felt Diligent's approach delivered good results. He didn't find it necessary to dig any deeper into the algorithms.
"We talked to engineers from both Data Domain and ExaGrid Systems Inc. about their algorithms, but we really were more interested in how they stored data and how they did restores from old data," says Michael Aubry, director of information systems for three central California hospitals in the 19-hospital Adventist Health Network. The specific algorithms each vendor used never came up.
FalconStor opted for public algorithms, like SHA-1 or MD5. "It's a question of slightly better performance [with proprietary algorithms] or more-than-sufficient performance for the job [with public algorithms]," says John Lallier, FalconStor's VP of technology. Even the best algorithms still remain at the mercy of the transmission links, which can lose bits, he adds.
Hash collisions increase data bit-error rates as the environment grows. Statistically this appears to be true, but don't lose sleep over it. Concerns about hash collisions apply only to deduplication systems that use a hash to identify redundant data. Vendors that use a secondary check to verify a match, or that don't use hashes at all, don't have to worry about hash collisions.
GlassHouse Technologies' Preston did the math on his blog and found that with 95 exabytes of data there's a 0.00000000000001110223024625156540423631668090820313% chance your system will discard a block it should keep as a result of a hash collision. The chance the corrupted block will actually be needed in a restore is even more remote.
"And if you have something less than 95 exabytes of data, then your odds don't appear in 50 decimal places," says Preston. "I think I'm OK with these odds."
This was first published in September 2008