Put data dedupe to the test

You can take your chances and believe the deduplication ratios and performance the vendors say you'll get, or you can do it right and test the systems yourself.

This Content Component encountered an error

You can take your chances and believe the deduplication ratios and performance the vendors say you'll get, or you can do it right and test the systems yourself.

Some data deduplication vendors are lying to you. Although I knew this, it became immediately apparent after the feedback I received to my BackupCentral.com blog post on deduplication performance. All I did in that article was compile and interpret publicly available information, but some people felt I was validating those vendors' claims despite a disclaimer to the contrary.

I received public and private comments from vendors and users alike that said, "How can you say vendor ABC does xxx MB/sec? We've never seen more than half that!" or "Have you asked that vendor for a reference? I doubt a single customer is using the configuration you listed in your article!" Suffice it to say that some of those numbers, while openly published by the vendors, are complete and utter fiction. Which ones, you ask? Given the confusion about published statistics and the lack of independent testing of these products, the only way you're going to know is to test the product yourself. The following explains what you should test, and what I believe is the best way to conduct those tests.

Target vs. source dedupe

There are two very different types of dedupe, and they require very different testing methodologies. With target dedupe, deduplication occurs inside an appliance that accepts "regular" backups, that is, backups from a traditional backup application. These appliances usually accept those backups via a virtual tape interface, NFS, CIFS or other proprietary API, such as the Open Storage (OST) API in Symantec Corp.'s Veritas NetBackup. Backups are received in their entirety and are deduped once they arrive. Target dedupe saves disk space on the target device, but does nothing to reduce the network load between backup client and server. This makes target dedupe more appropriate for environments where bandwidth utilization isn't the primary consideration, such as in a centralized data center.

Source deduplication products require custom backup software at the backup client and backup server. The client identifies a unique chunk of data it hasn't seen before, and then asks the server if it's ever seen the chunk of data before. If the server has backed up that same chunk of data from that (or another) client, it tells the client not to send the chunk over the network and simply indicates that the chunk was found in another location. If the chunk is determined to be truly unique, it sends the chunk across the network and records where it came from. This makes source dedupe most appropriate for environments where bandwidth is the primary consideration, such as remote offices and mobile users.

Two types of dedupe

Target deduplication. Data deduplication is done in an appliance that sits inline between the backup server and the backup target. The appliance receives the full backup stream and dedupes the data immediately.

Source deduplication. Backup software performs the deduplication on the backup client and the backup server before sending data to the backup target. This approach has less impact on the available bandwidth.

Testing target dedupe

There are three things to verify when considering a target dedupe solution: cost, capacity and throughput. When considering the cost of deduplication systems (or any system for that matter), remember to include both capital expenditures (CAPEX) and operational expenditures (OPEX). Look at what hardware and software you'll need to acquire to use a particular appliance to match a given throughput and capacity model. Some dedupe vendors make it very easy to arrive at a CAPEX number: for example, you need to store 30 TB of data, and you back up 5 TB/day, so you need model x. It includes all the computing and storage capacity you need to meet your requirements. Other vendors just provide a gateway that you can connect to your own storage. Finally, some vendors provide just the software, leaving the purchase of all hardware up to you. Remember to include the cost of the server hardware in this configuration, making sure that you're specifying a server configuration that's approved by that vendor. In both the gateway- and software-only pricing models, make sure to include the cost of the disk in your comparison even if it's "free." The dedupe pricing world is so unique that there are scenarios where you can actually save money by not using disk you already have.

One final cost element: Remember to add in (if necessary) any "extra" disks, such as a "landing zone" (found in post-process systems), a "cache" where data is kept in its original format for faster restores or any disks not used to store deduplicated data. All of those disks should be considered in the total cost of purchasing the system.

You then need to consider OPEX. As you're evaluating each vendor, make note of how you'll need to maintain their systems and how the systems will work with your backup software vendor. Is there a custom interface between the two (e.g., Veritas NetBackup's OST API), or will your system just pretend to be a tape library or a file system? How will that affect your OPEX? What's it like to replace disk drives, disk arrays or systems that are part of this system? How will global dedupe, or the lack of it, affect your ability to scale the product to meet your needs?

There are two ways to test capacity. The first is to send a significant amount of backups to the device and compare the size of those backups with the amount of storage they take up on the target system. This will show your dedupe ratio. Multiply that ratio times the disk capacity used to store deduped data and you'll get your effective capacity. The second method is to send backups to the device until it fills up and then record how many backups were sent. The latter method takes longer, but it's the only way to know how the system will perform long term. (The performance of some systems decreases as they near capacity.)

Finally, there are several things you should test for performance.

Ingest/Write. The first measure of a disk system (dedupe or not) is its ability to ingest (i.e., write) backups. (While restore performance is technically more important, you can't restore what you didn't back up.) Remember to test both aggregate and single-stream backup performance.

Restore/Copy/Read speed. The second measure of a disk system (dedupe or not) is its ability to restore or copy (i.e., read) backups. I like to point out that the whole reason we started doing disk-to-disk-to-tape (D2D2T) backups was to use disk as a buffer to tape; therefore, if a disk system (dedupe or not) isn't able to stream a modern tape drive when copying backups to tape, then it misses the point. Remember to test the tape copy where you plan to do the tape copy; for example, if you plan to replicate to another system and make the tape there, test that. Finally, don't assume that restore speeds will be fine, and remember to test both single-stream and aggregate restore performance.

Deduplication. Once the data arrives in its native format to the device, it must be deduped. Inline boxes dedupe the data the second it arrives. The original data never hits disk; therefore, an inline vendor's dedupe speed is the same as its ingestion speed. Post-process vendors can take from seconds to hours to dedupe data. You'll have to investigate how long the dedupe process actually takes.

Replication. Your dedupe ratio comes into effect with replication as well. The better your dedupe ratio is, the fewer blocks will have to be replicated. But the only way to know for sure how replication will work is to actually do the replication. Observe how many blocks of data are replicated and note when the replication starts and stops. You may be able to capture this data from the dedupe vendor, but to test it yourself you may need to use a network tool to get this information. Remember that not all vendors start replicating at the same time. Of course, nothing can be replicated until it's deduped, but don't assume that an inline vendor will replicate backups immediately after they're deduped; many vendors will wait until a given tape is no longer being used or a file is closed (in the case of NAS).

Test with production data, but…

You must test target dedupe systems by storing your actual production backups on them. However, don't test your dedupe system by backing up production systems directly to it. Vendors would love for you to test that way, as it's hard to give the system back when you're using it to store real backups needed for real restores. It's always a bad idea to use a test system in production.

So how do you test dedupe systems with production data without backing up production systems to them? It's simple. Copy your production backups from tape or disk to the dedupe system. When testing restore/copy speed, copy backups from the deduped device to disk or tape because the "reconstitution" process the dedupe system has to go through for a copy is exactly the same as what it does for a restore.

Determine how long you plan to store your backups in the dedupe system. In my opinion, if you plan to store 90 days of backups in your dedupe system, that's how many days of backups you should store in your test system. (It won't take 90 days to store 90 days' worth of backups.)

If you plan on testing 90 days of backups, pick a period of 90 days that starts with a full backup (or an IBM Corp. Tivoli Storage Manager backup set) and continues for 90 days. If you're testing multiple dedupe systems, make sure to use the same set of backups with each test (ceteris paribus -- with all other factors or things remaining the same). Copy the first full backup (or backup set), followed by backups that are 89 days old, then 88 and so forth. Do that until you've worked your way up to 90 days.

Each simulated "backup day" should include a single backup (i.e., one backup set copied until it's complete), simultaneous backups (as many simultaneous copies as you have tape drives), deduplication and replication. If possible, the simultaneous backups should supply enough throughput to reach that system's maximum throughput. Once all of those activities have completed, the next day's "backups" can continue by copying the next set of backup tapes into the system.

Dedupe testing tips

Use real data: To get accurate results, all testing should be done using copies of the data you actually back up.

Try restores, too: It's not enough to just test backup performance, you should also test dedupe products by doing typical restores.

Recreate replication: It's likely you'll also replicate backup data to a disaster recovery site or vaulting facility, so you should test how well -- and how quickly -- a dedupe product handles replication.

Tally costs correctly: The actual price of the dedupe product might not reveal the total bill for the solution. Be sure to include the cost of any new disks or disk systems, as well as software upgrades that may be required to implement the product correctly.

The beginning of each simulated "backup week," including the first one, should include a number of simulated restore tests. The best way to test restore speed is to actually copy a predictably sized backup set from the dedupe system to tape. You should do two single restores by themselves (i.e., one backup set copied from the dedupe system to tape until it's complete), followed by two sets of simultaneous restores (as many simultaneous copies from the dedupe system to tape as you have tape drives). The reason you should copy two sets in each test is that you want to copy from the oldest and newest backups in each test cycle. What you're looking for with these tests is a difference in restore time from older backups in relation to newer backups, and from backups when the system is relatively empty to when the system is relatively full.

One key to doing this right is automation. This will allow you to do testing around the clock and will provide the best way of documenting the timing of all activities. Automating things is also the key to ceteris paribus, which is absolutely essential when testing multiple systems. If possible, another approach is to use a completely separate backup server and tape library. That will isolate the test from the backup traffic, both for the sake of production backups and to ensure that production backups don't impact the test.

Testing source dedupe

In most cases, source dedupe is being considered as a replacement for some backup software that's already "doing the job" via backups to an inexpensive tape device. While that configuration comes with a lot of drawbacks that source dedupe intends to fix, the fact that you're replacing an existing product creates a greater burden of proof on the source dedupe product.

Basic backup functionality. You'll be using this product to perform all backups and restores for supported clients. Make sure you try everything with this product that you currently do with your backup system. Schedule automated backups and see how it reports their success. If any of them fail (and you should force some of them to fail), what happens next? What's it like to rerun failed backups? What are restores like? Can the administrator do them or can users do them? Use the same workflows you're accustomed to using and see if they can be adapted to this new product.

Advanced backup functionality. Do you plan to replicate these backups to a second location? Once you've replicated all backups to a centralized location, do you plan to copy some or all of them to tape? How does that work?

Performance. What kind of backup performance do you get? How fast is the replication? How much data is sent across the wire? (Don't assume that two different deduplication products will send the same amount of data over the wire.) If you're planning on replicating across long distances, how do latency and an unreliable connection affect the overall performance and stability of the product?

As with target dedupe, there's no substitute for real data during testing. But unlike target dedupe, it can be very difficult to reliably test these products doing anything short of backing up the types of systems you plan on actually backing up. You can back up test systems, but the test is only valid if you can simulate user activity, such as emails to the Exchange database, and new and updated files in the file system. Without those changes, your source dedupe system will perform very well, but will offer no insight into how it's going to perform in the real world.

Most people can't simulate real user activity on a large test environment, so their only alternative is to back up real systems. Once you've verified in a test environment that the software in question can run on the types of systems you'll be testing, you need to begin a proof of concept on "real" systems that will represent the types of systems you'll be backing up. To minimize risk, it's best to start with systems that aren't currently being backed up and don't have a mission-critical uptime requirement, such as laptops. Select a few users to pilot the software, make sure they're aware it's a pilot and ask them to report their experiences to you. Once you've logged a little time with those types of systems, you can expand to file servers at a remote site, followed by application servers (such as Exchange). Just remember that each time you start backing up a new type of system, you risk negatively impacting the stability or performance of that system -- so you must watch for any instabilities during each test.

Any systems already being backed up in the proof-of-concept test should continue being backed up via the previous method until you're in production with the new system. If they're Windows systems, you must verify that the two programs won't interfere with each other by resetting and/or using the Windows archive bit. The worst-case scenario would be if they're both using it, as new or modified files would get backed up by the next backup product to run and wouldn't get backed up by the following product. You must verify how the archive bit will affect two products running in parallel.

Make sure to get answers for all of these questions. Also simulate all of the things that are likely to happen, such as a laptop user suspending their laptop in the middle of a backup, an Ethernet cable being unplugged or an Internet connection timing out. The hardest question to answer may be how many bytes are actually sent across the network, so you may need third-party network monitoring software to get a verifiable number.

Nothing proposed here is easy. However, the potential risks of buying dedupe systems without proper testing are simply too great to consider skipping testing. With some dedupe vendors possibly exaggerating their products' prowess, testing is the only way to separate truth from fiction -- and probably save some money in the process.

BIO: W. Curtis Preston is an executive editor at SearchStorage.com and an independent backup expert.

This was first published in May 2009

Dig deeper on Storage Resources

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close