A few companies are still kicking the tires of deduplication products, but veteran users should be thinking about how they can step up to the next level of backup dedupe.
Anything labeled "2.0" carries the expectation of being the “next big thing” or at least something better than what you have now. This phenomenon is coming to the world of storage in the form of Dedupe 2.0 and it’s a cool development.
Whether you’re a veteran dedupe user ready to step up to a new version or still in the initial evaluation phase, data deduplication is a “when” not an “if.” And frankly, the “when” is now. Everyone understands that disk-based protection is a better frontline defense for most recovery scenarios. But disk-based protection without deduplication can be cost prohibitive, so you need dedupe to make disk-based protection as cost effective as possible. From there, you can proceed to the cloud, retain data on tape or just replicate it to another disk.
But you need disk and, therefore, you need dedupe. New research from Enterprise Strategy Group (ESG) shows that nearly 80% of IT environments already use or plan to deploy deduplication as part of their data protection solution.
Today’s data deduplication conversation shouldn’t focus on “when” or even “how well” the dedupe solution compresses per se. Almost every vendor touts its own offering as having the most efficient compression ratios or methodology. Behind closed doors, these vendors can tell you exactly which data types their product dedupes best, and which data types its competitive products will choke on.
Of course, since storage vendors often battle in “speeds and feeds,” dedupe has inevitably led to a lot of marketing hype. To legitimately quantify dedupe effectiveness, we would all have to agree on an industry-standard methodology for measuring deduplication efficiency (and a way to publish the findings) before we discuss how well a given product dedupes. Until then, you’ll have to test the products on your short list to determine which ones fit your needs.
Dedupe is still a “where” discussion. In that regard, I offer that deduplication lends itself to a “good, better, best” assessment -- in other terms, dedupe 1.0, 1.5 or 2.0. Here’s a guide to help answer the “where” question:
• Deduplicated storage is good (dedupe 1.0). Everyone should incorporate deduplication into their disk-based protection architecture, so simply having deduplicated storage is a good thing.
• Deploying smarter backup servers is better (dedupe 1.5). With legacy deduplication, the backup server is oblivious to the storage being deduplicated. It sends everything it backs up to storage, and then the storage discards most of it because the data already exists in the deduplicated storage pool. Extending deduplication intelligence (or even just awareness) to the backup server solves that problem. The backup server won’t send data the deduplicated storage array already has.
• Client-side deduplication is best (dedupe 2.0). Why send everything from the production server if it will be discarded by the storage array (dedupe 1.0) or the backup server (dedupe 1.5)? Instead, by making the production server dedupe aware, only new data whose fragments or blocks aren’t already in deduplicated storage are sent.
While there are scenarios where deduplication should be driven by the storage (1.0) or media server (1.5), an ideal data deduplication environment for most environments would have the changed data going directly from the production server to storage (2.0), with the backup server the scheduler and keeper of the catalog and metadata only. This level of deduplication maturity doesn’t necessarily require hardware, although software is required.
ESG’s latest research on data protection modernization reveals how dedupe users are delivering or planning to deliver deduplication within their data protection products. Among current dedupe users, 46% use a software-only approach, 28% apply hardware-only methods and 26% employ a combination of the two.
• A software-only approach might involve Symantec NetBackup 7.5, for example, with its client-side deduplication and accelerator features. Symantec Backup Exec, CommVault Simpana and Quest (now part of Dell) NetVault represent similar software-centric approaches.
• A hardware-only approach might involve any deduplication array in which enablement software is turned off (or not available) within the backup server, and the backup server is unaware of the deduplication capabilities of the storage.
• A hardware plus software approach might be something along the lines of the EMC Data Domain products, with Data Domain Boost at work either in the backup server or in the production server via EMC NetWorker or even Oracle Recovery Manager (RMAN). Similar functionality is being touted by Hewlett-Packard (HP) through its recently announced HP StoreOnce and Catalyst enablement APIs.
Interestingly, IT respondents who aren’t yet using deduplication (but plan to deploy) have a different strategy. In those cases, only 19% plan to use a software-only approach (down from 46% of current users). These respondents have a much higher anticipated use of hardware-centric or hardware plus software products.
If you haven’t committed to data deduplication, do that first. Next, consider where the deduplication will occur. When talking to vendors, get them to pinpoint their published ingest rate for data during backups, as well as their restore rate (which may be very different).
Then, because every vendor calls itself the “most innovative next-generation leader in deduplication,” test it -- not with a read-through of their whitepapers, but with a representative sampling of your data.
BIO: Jason Buffington is a senior analyst with Enterprise Strategy Group. He focuses primarily on data protection, Windows Server infrastructure, management and virtualization. He blogs at CentralizedBackup.com and tweets as @JBuff.