This article can also be found in the Premium Editorial Download "Storage magazine: Storage Products of the Year 2010."
Download it now to read this article plus other related content.
Data compression vs. data deduplication
The main technology for PSO is compression, while the cornerstone of SCO is data deduplication. Compression technologies look at a data stream and try to eliminate unneeded 0s and 1s algorithmically such that no data is lost when it's uncompressed. Data deduplication at a file level will delete a duplicate file and replace it with a pointer. Dedupe at the sub-file level will do the same, except it uses a number of pointers, one each for the sub-file or chunk. Data deduplication doesn't try to crunch the file as compression does. It looks for duplicates within a repository either at a file level or sub-file level. Compression can also be applied to secondary storage -- a lot of backup data is compressed before it's written to tape and most SCO products add compression on top of deduplicated data.
Global vs. local dedupeNow let's look at global data deduplication. When you install your first data deduplication solution for your backups, the system needs time to extract duplicates at the sub-file level. The capacity reduction may be only 2 to 1 in the first week, with one full and six incrementals, for instance. As more weekly fulls and daily incrementals are done, the ratio will improve, often to about 20 to 1.
If the backups were done intelligently and the same data wasn't repeatedly dragged to the backup disk, we wouldn't need data dedupe. Global deduplication comes into play when a single system can squeeze
Current data deduplication systems vary in the way they perform dedupe, such as whether they do it inline or post processing, if they use a virtual tape library (VTL) or network-attached storage (NAS) interface, and so forth. But a major architectural difference is whether the systems are single node or scale-out (sometimes called clustered). Scale-out solutions can perform global data deduplication simply by adding nodes. Even a system of just two nodes enhances reliability as the configuration can withstand a failure of a disk in any node, or the failure of an entire node. The nodes can be managed as a single system to create a global deduplication solution. A single-node system has no visibility into other nodes, so even though there may be chunks or files that are exactly the same on multiple nodes they'll be viewed as unique data and stored on each node.
The merits of global deduplication may seem obvious, but in practical terms they're somewhat diminished. Would you want to put all your eggs in one basket with a single solution for the entire enterprise? And if you have multiple subsidiaries, would it be desirable for them to share one backup repository? Probably not. But that's not to suggest that scale-out solutions are bad; I see scale-out as the preferred architecture as it mitigates proliferation and lets you decide whether to create one monolithic system or multiple standalone systems.
But I'm not convinced that global deduplication is necessarily the primary reason for preferring scale-out products. Global deduplication will yield better reduction ratios, but when compared to standalone systems, the difference is often less than dramatic.
This was first published in February 2011