Software-based data deduplication products
EMC Corp.'s Avamar software product performs in-band deduplication at the host server (the source) using the SHA-1 algorithm. Avamar employs a central management scheme to inspect data in the entire environment, but the actual deduplication is performed at each server before being sent to the backup storage platform. This saves storage space at the backup target and reduces network congestion. EMC reports plans to incorporate Avamar technology into its own backup software and virtual tape library (VTL) system in the near future.
Symantec Corp. provides software-based deduplication in its Veritas NetBackup product through a feature called PureDisk, which uses a proprietary hash algorithm to perform deduplication inline at each host server. NetBackup PureDisk 6.2 supports tape targets and the Backup Reporter monitoring tool. NetBackup 6.5 offers even better integration and support for deduplication, VTL and third-party appliances.
Sepaton Inc. implements deduplication using DeltaStore software, an option on its S2100-ES2 VTL hardware product. Like PureDisk, DeltaStor uses a proprietary hash algorithm, but the S2100 deduplicates data at the VTL (the storage target). This means backup traffic is sent to the VTL before deduplication is performed, so there is no decrease to network traffic. Sepaton also works differently than other deduplication schemes. Where the first iteration of data is written and later iterations receive pointers, DeltaStor writes the latest version and replaces the previous iterations with a pointer, a technique called forward referencing, which promises faster restores.
Compression, encryption and data deduplication
One of the stickiest issues with data deduplication is the relationship between compression, encryption and deduplication. Traditional compression works by eliminating redundancy in files, deduplication can eliminate redundant files, blocks or bits, and encryption turns that data into a data stream that is random by its nature. So if you encrypt data first, it may be impossible to compress or deduplicate it. Ideally, data should be compressed and deduplicated first, and then encrypted as needed. This isn't difficult when compression and deduplication are performed at the host server using backup software, and the resulting data stream is encrypted on the way to the backup target using a dedicated appliance or at the tape library or LTO-4 drive. However, this may present difficulties when deduplicating at the target storage system. For example, if the backup data is encrypted by an inline appliance and then sent to a deduplication-capable storage system like the Sepaton S2100, it may be impossible to further compress or deduplicate the encrypted data.
Check out the entire Data Deduplication Handbook.
This was first published in November 2007