You can take your chances and believe the deduplication ratios and performance the vendors say you'll get, or you can do it right and test the systems yourself.
Some data deduplication vendors are lying to you. Although I knew this, it became immediately apparent after the feedback I received to my BackupCentral.com blog post on deduplication performance.
I received public and private comments from vendors and users alike that said, "How can you say vendor ABC does xxx MB/sec? We've never seen more than half that!" or "Have you asked that vendor for a reference? I doubt a single customer is using the configuration you listed in your article!" Suffice it to say that some of those numbers, while openly published by the vendors, are complete and utter fiction. Which ones, you ask? Given the confusion about published statistics and the lack of independent testing of these products, the only way you're going to know is to test the product yourself. The following explains what you should test, and what I believe is the best way to conduct those tests.
Target vs. source dedupe
There are two very different types of dedupe, and they require very different testing methodologies. With target dedupe, deduplication occurs inside an appliance that accepts "regular" backups, that is, backups from a traditional backup application. These appliances usually accept those backups via a virtual tape interface, NFS, CIFS or other proprietary API, such as the Open Storage (OST) API in Symantec Corp.'s Veritas NetBackup. Backups are received in their entirety and are deduped once they arrive. Target dedupe saves disk space on the target device, but does nothing to reduce the network load between backup client and server. This makes target dedupe more appropriate for environments where bandwidth utilization isn't the primary consideration, such as in a centralized data center.
Source deduplication products require custom backup software at the backup client and backup server. The client identifies a unique chunk of data it hasn't seen before, and then asks the server if it's ever seen the chunk of data before. If the server has backed up that same chunk of data from that (or another) client, it tells the client not to send the chunk over the network and simply indicates that the chunk was found in another location. If the chunk is determined to be truly unique, it sends the chunk across the network and records where it came from. This makes source dedupe most appropriate for environments where bandwidth is the primary consideration, such as remote offices and mobile users.
|Two types of dedupe|
Target deduplication. Data deduplication is done in an appliance that sits inline between the backup server and the backup target. The appliance receives the full backup stream and dedupes the data immediately.
Source deduplication. Backup software performs the deduplication on the backup client and the backup server before sending data to the backup target. This approach has less impact on the available bandwidth.
This was first published in May 2009