Catching up with deduplication

Deduplication backup products differ in how they recognize and reduce duplicate data. Because vendors implement deduplication differently, the fear, uncertainty and doubt surrounding deduplication products has increased, with questions about when to deploy what product. Here's what you need to know to pick the product that will best fit into your environment.

Deduplication backup products differ in how they recognize and reduce duplicate data. Here's how to pick the product that will best fit into your environment.

Backup has seen the future and it's disk. As the backup target gradually, but persistently, changes from tape to disk, data deduplication is becoming a key component of the backup process. Because vendors implement deduplication differently, the fear, uncertainty and doubt surrounding deduplication products has increased as have the questions about when to deploy what product.

Deduplication resides in the backup process in two primary places: backup software and disk libraries. Asigra Inc.'s Televaulting, EMC Corp.'s Avamar and Symantec Corp.'s Veritas NetBackup PureDisk are backup software products that deduplicate data at the host level, minimizing the amount of data that needs to be sent over corporate networks to backup targets or replicated to disaster recovery sites. Disk libraries from Data Domain Inc., Diligent Technologies Corp., Quantum Corp. and Sepaton Inc. deduplicate data at the target, which allows companies to deploy disk libraries without disrupting current backup processes.

With the underlying deduplication algorithms essentially the same across both sets of products, the real issues are how each product implementation impacts performance, and data management in the short and long term. Neither approach is yet ideal for all backup requirements, so a crossover period is emerging in which some storage managers will likely use backup software and disk library methods for specific needs. Hidden issues like undeduplicating data to store it on tape, integration with enterprise backup software products, and the ability to selectively turn off deduplication to accommodate specific compliance requirements and preexisting encryption conditions should be evaluated closely to determine whether those issues outweigh the benefits of deduplication.

Data reduction and compression algorithms

Backup software and disk library products deduplicate data in similar ways, with most using a combination of data-reduction and compression algorithms. Both types of deduplication approaches initially identify whether chunks of data or files are the same by first performing a file-level compare or using a hashing algorithm such as MD5 or SHA-1. Unique files or data chunks are preserved, while duplicate files or data chunks may be optionally rechecked. This recheck is done using a bit-level comparison or secondary hash to ensure the data is truly a duplicate and not a rare hash collision. This first stage in the deduplication process typically reduces data stores by factors of approximately 10 or more over time.

To achieve data-reduction factors of 20 times or greater requires the product to compress the unique deduplicated files or data chunks. To accomplish this, vendors use a lossless data compression algorithm, such as Huffman coding or Lempel-Ziv coding, which executes against the unique file or deduplicated data chunk. Compression squeezes out items like leading zeros or spaces to reduce the data to its smallest possible footprint before it's stored.

However, using deduplication at the source or target introduces performance and management issues. Backup software-based deduplication products introduce a heavy initial processing toll on the host. In addition, users should carefully examine how swapping current backup software with a deduplication backup product, or running two backup software products concurrently, will affect server and application performance, as well as their stability.

Conversely, deduplicating data on a disk library may require users to deploy multiple disk libraries to handle the performance overhead created during peak backup periods. This creates more management overhead as each disk library creates its own unique deduplicated data store; administrators must also manage and direct backup jobs to multiple physical disk libraries as opposed to just one logical one. Determining which backup software, disk library or combination of them to select, and under what circumstances, is how they handle these potential bottlenecks.

Breaking the bottlenecks

Asigra Televaulting attempts to break the management bottleneck by taking an agentless approach that expedites deployments while minimizing user involvement. Users initially install the Asigra Televaulting gateway software on a Windows or Linux server. The Televaulting backup software accesses client files over the internal network using CIFS, NFS or SSH (SSH allows for security but is slower) and reads the files. As it reads each file, the Asigra Televaulting server performs a hash on the file. If the file is determined to be unique, the file is chunked up with its unique blocks stored while redundant blocks are indexed and thrown away.

All hash processing takes place on the Asigra Televaulting server, which maintains a database of all of the unique file blocks on the different servers it's assigned to protect. Once the initial backup and index is done, subsequent server backups execute faster because they can use this common repository of unique blocks created from the first server's backup.

This approach still doesn't completely eliminate the performance toll of deduplication. By running the deduplication on a central server, the Televaulting software transfers the performance overhead from the client servers to the Televaulting server. Multiple servers with unusually large daily data change rates (more than 10%) or large numbers of servers (100 or more) needing to run backups at the same time could impact backup times and force the deployment of more Asigra Televaulting servers to manage the overhead.

Click here for a chart showing deduplicating backup software (PDF)

EMC Avamar and Symantec Veritas NetBackup PureDisk take a slightly different approach to address the performance issue. They use agents that utilize computing resources on each client server to do the initial file hash. As part of this process, the agents communicate with the main backup server, which maintains a central database of the unique file hashes. As the Avamar or PureDisk agents on the servers hash the files, they check with the central server to see if the generated hash already exists. If the hash exists, the agent ignores the file; if it doesn't exist, it breaks the file into smaller segments and looks for new unique file segments to store. From that point, EMC Avamar and PureDisk deviate in their product implementation.

EMC Avamar allows server storage capacity to grow to approximately 1.5TB in size. Although Symantec Veritas NetBackup PureDisk servers can grow to manage nearly 4TB of PureDisk storage capacity, EMC Avamar uses segment sizes that are about one-fourth the size of PureDisk's. This allows it to better identify redundant data in files, asserts Jed Yueh, EMC Avamar's VP of product management. If users should need to grow in capacity and scale, EMC Avamar uses a redundant array of independent nodes (RAIN) clustering architecture. This allows organizations to add more server nodes into the RAIN cluster to increase server capacity and performance by striping the data across multiple nodes.

In a PureDisk environment, a single server can manage 4TB of PureDisk storage and up to 100 million files which equates, according to Symantec, to a little more than 80TB of source data. Additional servers can be added to expand PureDisk's storage capacity or to handle larger number of files.

PureDisk manages file meta data outside of the file system using MetaBase Server and MetaBase Engines. As an environment grows, a storage manager uses PureDisk to add new instances of MetaBase Engines; because the MetaBase Server controls communication to all MetaBase Engines, expanding the deduplication environment is a relatively simple process. This separation of the file meta data from the file system allows PureDisk to improve search- and maintenance-related activities on the underlying storage system, grow to hundreds of terabytes and billions of files, and retain a single logical instance of deduplicated data across the enterprise.

Click here for a chart showing deduplicating disk libraries (PDF)

Early adopters

Early adopters of EMC Avamar and Symantec Veritas NetBackup PureDisk report minimal issues with installing backup software agents or server performance hits, but there are some specific circumstances that they monitor more carefully: the initial round of backups and the age of the server on which agents are deployed.

Jim Rose, manager of systems administration with the State of Indiana's Office of Technology, recently installed PureDisk at branch state offices as part of a mandate by Indiana Governor Mitch Daniels to centralize certain IT functions. At each of the 80 offices he manages, Rose installed a Microsoft Windows server with PureDisk software, as well as PureDisk agents on each of the servers targeted for backup. Rose found the backup of the initial server took between 24 hours to 36 hours, while the second backup took about half that time; by the second or third day, backup windows across all of these servers were almost back to normal, he says.

"Symantec PureDisk backed up new servers without any discernable performance hit," says Rose. "[But] servers older than three years took longer to complete the initial scan."

Michael Fair, network administrator, information technology division at St. Peter's Health Care Services in Albany, NY, finds that the performance and management overhead associated with EMC's Avamar is almost nothing vs. what he encountered when backing up his servers with CA BrightStor and Symantec Backup Exec. "I eliminated domain controllers in eight sites and can now run backups during the day if the need arises with no discernible impact to server applications," says Fair.

Introducing PureDisk allowed the State of Indiana's Rose to back up 300 servers across 80 sites in six hours, and he now has a demonstrable, working recovery plan for those sites. However, as the individual responsible for both remote offices and enterprise data centers, he recognizes the limitations of backup software deduplication. Taking 24 hours to 36 hours to complete an initial backup, coupled with high change rates on central databases, precludes Rose from deploying PureDisk in his core data center environment. For these more mission-critical servers, he looks to disk libraries to keep processing off the hosts.

Inline disk libraries

Disk libraries perform deduplication in two general ways: inline and postprocessing. With inline processing, the disk library processes backup streams and deduplicates the data as it enters the disk library. Inline disk libraries use three general deduplication methods to minimize the performance impact: hash-based, inline compare and grid architecture.

Data Domain's DDX disk library uses a hash-based technique. DDX takes an 8KB slice of the incoming backup data and computes a hash or fingerprint value. If the fingerprint value is unique, it deduplicates and stores the data. The main issues with this approach are the performance requirements to compute the hash and keeping the hash index in memory; as the hash index grows, it spills over from memory onto disk. To mitigate the performance overhead associated with retrieving the index from the disk, Data Domain developed a technique called stream-informed segment layout (SISL) that minimizes seeks to disk so the performance is CPU-centric; the faster the disk library CPU, the better the performance.

Diligent Technologies' inline ProtecTier Data Protection Platform attempts to avoid the performance penalty required by hash lookups by doing a computational compare. Using its proprietary HyperFactor technology, it avoids opening the backup data stream to examine content and instead scans and indexes the data stream, looking for data that's similar to data already stored.

When the ProtecTier Data Protection Platform finds data it considers similar to data already stored in its index, it does a byte-level compare of the two sets of data; if it matches, it discards the match and references it. Diligent claims this compare-and-compute technique allows its ProtecTier Data Protection Platform to scale to manage hundreds of terabytes. However, this technique still requires some processing power on the part of the disk library to do the computational compare and to compress the data after it has been deduplicated.

NEC Corp. of America's Hydrastor also uses an inline approach, but it employs two different techniques to offset the performance overhead. In the first phase, Hydrastor deduplicates larger, variable-sized chunks of data to eliminate large pieces of redundant data. In the second phase, Hydrastor analyzes smaller, variable-sized chunks of data. In both cases, unique data is compressed.

To compensate for the performance overhead this multiphased approach creates, Hydrastor uses a grid architecture. This allows users to add additional nodes to the cluster at any time, which are designed to deliver additional performance or capacity. Unlike some other disk libraries, Hydrastor doesn't offer an option to present itself as a virtual tape library. Rather, it presents itself to hosts as a NAS filer using standard NFS and CIFS interfaces and creates one large storage pool on the back end. The Hydrastor architecture may present a problem for those enterprises that need to allocate and reserve certain amounts of storage for specific departments or business units.

Postprocessing disk libraries

With postprocessing, the disk library stores the data in its native format before deduplicating it, which allows the disk library to dedupe the data during nonpeak backup times. Vendors implement postprocessing in a variety of ways.

For example, Quantum's DXi-Series deduplicates data after it's stored, but initiates the deduplication process without waiting for the entire backup job to finish. By starting deduplication and then compressing the data while the backup is still running, it overcomes one of the principle downsides of postprocessing--the requirement for sufficient capacity to house the native backups. However, deduplication requires use of the DXi-Series' cache and processor, which can potentially slow the backup process because the backup job may need to write the data directly to slower responding disk instead of storing it in the DXi-Series' cache.

To avoid that scenario, ExaGrid Systems Inc.'s ExaGrid and Sepaton's S2100-ES2 execute only on backup sets that have completed, so deduplication doesn't impact backup and restore performance. On the first analysis of backed up data, ExaGrid and S2100-ES2 only compress the data and don't deduplicate. When a second backup completes, ExaGrid does byte-level delta differencing while Sepaton uses its ContentAware software to compare objects in the first backup at byte level against similar objects in the second one. Like objects in the first backup are then deleted and replaced with pointers to objects in the second backup, with objects in the second backup then compressed but not deduplicated. This deduplication and compression process repeats as backups occur.

The difference between the two determines what size environments they best fit. You can't add more controllers to ExaGrid to allow it to deduplicate the large amounts of data that enterprise backups generate. Sepaton uses a grid architecture in S2100-ES2 so additional controllers for more processing and capacity can be added as deduplication requirements grow.

How to estimate your deduplication ratio

The actual deduplication ratio--what you should expect to get and how soon you can get it--will vary according to many factors, some of which are within a user's control. Here are a few variables that can help you estimate the deduplication ratio you can reasonably expect to achieve.

Redundant data. The more redundant data you have on your servers, the higher the deduplication ratios you can expect to achieve. If you have primarily Windows servers with similar files and/or databases, you can reasonably expect to achieve higher ratios of deduplication. If your servers run multiple operating systems and different files and databases, expect lower deduplication ratios.

Rate of data change. Deduplication ratios are related to the number of changes occurring to the data. Each percentage increase in data change drops the ratio; the commonly cited 20:1 ratio is based on average data change rates of approximately 5%.

Precompressed data. Data compression is a key component in every vendor's data-reduction algorithm. Vendors base their advertised data-reduction ratios on the premise that compression will reduce already deduplicated data by a factor of 2:1. In a case where data deduplication achieves 15 times, compression could take that ratio up as high as 30:1. However, users with large amounts of data stored in compressed formats such as jpeg, mpeg or zip, aren't likely to realize the extra bump compression provides.

Data-retention period. The length of time data is retained affects the data-reduction rate. For example, to achieve a data-reduction ratio of 10 times to 30 times, you may need to retain and deduplicate a single data set over a period of 20 weeks. If you don't have the capacity to store data for that long, the data-reduction rate will be lower.

Frequency of full backups. Full backups give deduplication software a more comprehensive and granular view into the backup. The more frequently full backups occur, the higher the level of deduplication you'll achieve. Deduplicating backup software products have a slight edge over disk libraries because they run a full server scan every time they execute a server backup, even though they only back up changes to existing files or new files. In between full backups, disk libraries usually only receive the changes sent as part of the backup software's daily incrementals or differentials.

Hidden issues

Regardless of the deduplication approach, there are some hidden issues. For postprocessing disk libraries, as the amount of data increases, it may take much longer to deduplicate the data once the backups are complete. If the deduplication takes longer than the time between the end of one backup window and the start of the next, all of the data from the first backup won't be deduplicated so users will need to ensure they can add more processing power to handle this load.

Another potential problem may arise with inline or postprocessing disk libraries that aren't replicating the data to a remote disk library: the need to create tapes. The disk library needs sufficient time to first deduplicate the data and then undeduplicate a copy of the data to be spun off to tape. Both ExaGrid Systems' ExaGrid and Sepaton's S2100-ES2 avoid this undeduplication overhead because the last backup is only compressed, not deduplicated, so users can copy the job directly to tape.

Other postprocessing disk libraries like Spectra Logic Corp.'s nTier appliance allow users to run a local master or media server within their nTier appliance that alleviates some of the pain of this process. The nTier appliance eliminates the need to move data from host to media server to deduplication box to media server to tape, and allows the data to move from host to nTier appliance to tape. This design also eliminates the need to undeduplicate the data before storing it to tape.

Deduplicating backup software products that must operate in conjunction with enterprise backup products like Symantec Veritas NetBackup or EMC NetWorker face a different problem--allowing the enterprise backup software product to recognize and catalog the data it has backed up. While neither Asigra Televaulting nor EMC Avamar have any formal integration in place with any enterprise backup software product yet, Symantec Veritas NetBackup PureDisk 6.1 includes a NetBackup export engine that allows an administrator to copy a backed-up data selection from a PureDisk content router to NetBackup. NetBackup then catalogs the data and copies it to tape or disk and, from the NetBackup administration console, the storage administrator can treat those files as if they were native NetBackup files. Both EMC and Symantec anticipate tighter integration between their enterprise and deduplicating backup software products in the near future.

A final area of concern is the ability to selectively turn off deduplication for specific files or servers. This is important when compliance is an issue because the authenticity of data may come into question if it's deduplicated in any way. Also, if data is encrypted before the disk library receives it, deduplication provides no additional space-saving benefits; users should identify ahead of time which data is encrypted before it's stored or sent to a disk library.

As data stores continue to soar, deduplication into the backup process is rapidly evolving from a nice-to-have capability to a must-have capability for most corporate environments. The good news is that for small and midsized businesses managing 10TB of data or less, using either type of deduplication product, backup software or disk library, will significantly shorten backup windows. The decision then becomes what product best fits your environment.

At the enterprise level this isn't yet the case. Though promising work is occurring with inline approaches such as Diligent Technologies ProtecTier Data Protection Platform and NEC's Hydrastor, most enterprises will find a postprocessing disk library such as Sepaton's S2100-ES2 a safer choice for now until all of the costs, risks and processing overhead associated with inline deduplication are better understood and documented.

Dig Deeper on Storage Resources