What you should know about global dedupe


This article can also be found in the Premium Editorial Download "Storage magazine: Storage Products of the Year 2010."

Download it now to read this article plus other related content.

When global dedupe truly matters

However, there are places where global deduplication matters immensely. Take Symantec Corp.'s NetBackup PureDisk solution as an example. You install it in the main data center with smaller versions at each of your remote locations. All of them are scale-out, but it's likely the data center installation is indeed multinode while the remote ones are single-node (or dual-node) systems. The data is chunked up at each remote site, automatically checked against the master unit in the data center to see if it already exists there, and then either moved or marked with a pointer. Because the data center unit is the reference point, all data across all remote sites is deduplicated and the master unit is indeed very efficient in terms of data deduplication. Keep in mind that even single-node solutions, such as EMC Data Domain and others, allow such elimination across remote sites. But having a large, scalable unit at the data center does make a difference in this case. EMC recently added a two-node Data Domain system where each node is indeed aware of the other and eliminates duplication across both nodes. We consider this a "quasi" scale-out system that's perhaps a precursor to a full-blown scale-out solution in the future.

FalconStor Software Inc.'s solution is a bit different architecturally. Its VTL is scale-out but doesn't have integrated data deduplication. Another scale-out FalconStor product, Single Instance Repository (SIR), sits

Requires Free Membership to View

on the same local-area network (LAN) and performs data deduplication on a post-process basis. One would consider this an example of a system capable of doing global deduplication.

NetApp adds compression to dedupe

NetApp's approach is unique in the industry because it offers a dedupe feature that's shipped free of charge (although it has to be licensed). This is the single case in the industry where deduplication, as we define it, is used for optimizing capacity on primary and secondary storage. It uses a post-process method that looks for block-level redundancies to shrink the data. When data is needed by the application it's presented in the original format, perhaps with a small amount of latency since the file has to be reconstituted. If the storage is being used to store backups, the capacity optimization is done in exactly the same way. This is the only solution we've seen that has applied deduplication technology to both primary and secondary data. Recently, NetApp added compression for both primary and secondary data. This makes NetApp unique in its approach to capacity optimization, and blurs the lines I defined earlier, but it currently doesn't offer global deduplication.

Permabit Technology Corp. is another noteworthy player. Until recently, the firm offered an appliance designed for archival data. It's classic scale-out, and uses data deduplication and compression to optimize capacity utilization. Permabit recently isolated the deduplication engine and made it available to OEMs who lack PSO or SCO technologies. Because Permabit's archival system, Permeon, can be used as a backup target, as an archive or as tier 3 primary storage, the company claims its data reduction engine combines the benefits of all capacity optimization technologies and applies equally to primary or secondary storage. BlueArc, LSI and Xiotech Corp. have signed on as OEMs for this technology. From a global deduplication perspective, Permabit's architecture does indeed meet our definition, as do NEC's HydraStor and Sepaton's VTL with DeltaStor appliances.

Global data deduplication is an important feature, especially as it applies to data management across remote sites. And it's hard to argue against having a smaller number of systems to manage. Being able to scale simply by adding another node with the system automatically redistributing the data on the back end is a great benefit, too, and it's hard to argue against getting a few more percentage points of deduplication across the enterprise. But don't chase global data deduplication at all costs. Choose a scalable architecture, but not just for the global data deduplication it delivers.

BIO: Arun Taneja is founder and president of Taneja Group, an analyst and consulting group focused on storage and storage-centric server technologies. He can be reached at arunt@tanejagroup.com.

This was first published in February 2011

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: