This article can also be found in the Premium Editorial Download "Storage magazine: What you need to know about data dedupe tools for backup."
Download it now to read this article plus other related content.
In a multi-node architecture, a product can manage multiple dedupe systems as one. This approach also provides linear throughput and capacity scaling, high availability and load balancing. There’s a reduction in administrative overhead and, importantly, global deduplication is typical. ExaGrid EX Series, FalconStor File-interface Deduplication System (FDS), HP’s Virtual Library Systems (VLS), IBM ProtecTier, NEC Hydrastor, Sepaton DeltaStor and Symantec NetBackup 5000 Series all have multi-node configurations and support global deduplication. The modular architectures of these products deliver impressive aggregate performance and let you grow the systems seamlessly.
Symantec’s appliance is a new entrant in the target deduplication system field through a joint venture with Huawei. Symantec maintains a unique position in the data protection market as the only vendor to offer integrated deduplication in its own backup software- and hardware-based products as well as catalog-level integration with backup target devices of third-party vendors via its OST interface.
Deduplication in backup software
While originally limited to so-called “next-generation” backup apps like EMC’s Avamar, deduplication in backup software is now pervasive. Backup software products with deduplication include Arkeia Network Backup, Asigra Cloud Backup, Atempo Time Navigator, CA ARCserve, Cofio Software AIMstor, CommVault Simpana, Druva InSync and Phoenix, EMC Avamar, i365 EVault, IBM Tivoli Storage Manager (TSM), Quest Software NetVault Backup, Symantec Backup Exec and NetBackup, and Veeam Backup & Replication.
In software, client agents running on application servers identify and transfer unique data to the backup media server and target storage device, reducing network traffic. Other software solutions deduplicate the backup stream at the backup server, removing any potential performance burden from production application servers. The deduplication domain is limited to data protected by the backup application; multiple backup applications in the same environment create deduplication silos.
Global deduplication isn’t a given with software approaches either. First of all, not all vendors employ the same techniques for identifying duplicates. Some deduplicate by employing delta differencing (e.g., Asigra), which compares data segments for the same backup set over time. Deltas identify unique blocks for the current set vs. the previous backup of that set and only transfer unique blocks. It doesn’t make comparisons across different sets (i.e., no global deduplication).
Another approach is to use a hash algorithm. Some vendors segment the backup stream into fixed blocks (anywhere from 8 KB to 256 KB), generate a hash value and compare it to a central index of hashes calculated for previously seen blocks. A unique hash indicates unique data that should be stored. A repeated hash signals redundant data, so a pointer to the unique data is stored instead. Other vendors rely on variable block sizes that help increase the odds that a common segment will be detected even after a file is modified. This approach finds natural patterns or break points that might occur in a file and then segments the data accordingly. Even if blocks shift when a file is changed, this approach is more likely to find repeated segments. The trade-off? A variable-length approach may require a vendor to track and compare more than just one unique ID for a segment, which could affect index size and computational time.
Arkeia Software uses another approach it calls progressive deduplication. This method optimizes deduplication with a sliding-window block size and a two-phase progressive-matching deduplication technique. Files are divided into fixed blocks, but the blocks can overlap so that when a file is changed, the block boundaries accommodate the insertion of bytes. Arkeia adds another level of optimization by automatically assigning fixed block sizes (from 1 KB to 32 KB) based on file type. The technique also uses a sliding window to determine duplicate blocks at every byte location in a file. Progressive deduplication is designed to achieve high reduction ratios and to minimize false positives while accelerating processing.
This was first published in August 2011