NetApp: Post-process deduplication limits performance hit in primary storage data deduplication

NetApp's post-process deduplication approach to primary storage data deduplication limits the performance penalty to about 20% percent, and the company claims its deduplication on VMDK files delivers 70% space savings.

NetApp Inc. offers data deduplication as a feature of its Data Ontap operating system with its FAS and V-series systems. The company cites post-process deduplication as a major reason it's able to limit the deduplication performance penalty to 10% to 20% for average workloads. Writes are stored to minimize interference with application throughput. Deduplication runs later either on a scheduled basis typically during off-peak hours or automatically, based on the growth of the storage volume.

"It's always done in the background, and it's always done after the write occurs," said Larry Freeman, senior marketing manager for storage efficiency at NetApp. "If you run it more frequently, it's going to run faster because we're going to catch the duplicate blocks before there's too many of them."

Inline vs. post-process deduplication

NetApp's post-process deduplication approach contrasts with the inline, or real-time, method used by some of the popular backup dedupe products such as EMC Corp.'s Data Domain. (Some of the other backup systems use post-process deduplication.) Inline dedupe removes the duplicates as they appear and wastes little space. But Freeman claimed the performance impact on CPU resources is too high for primary storage.

"They're intercepting [the data] at the storage controller, and they have to make an immediate real-time decision: Do I store this or do I reference it?" Freeman said of inline dedupe products. "You have to compare that data object to every other object that's been stored previously. They do this with some sophisticated look-up tables and hash comparisons, but the more data is in the system, the more extensive the look-up has to be, and the slower the system becomes."

Freeman said the vendor originally expected its dedupe to be used for backup and archiving, but customers found it especially valuable for reducing VMware virtual machine disk (VMDK) files. "We promoted that and it really just took off," he said. "There was no turning back. Deduplication became the focus of primary storage."

 NetApp's post-process deduplication system uses a fingerprint catalog to identify candidates for data deduplication. Each 32-byte, algorithm-created fingerprint, which is also referred to as a digital signature or hash, references a larger 4 KB data block. When the system finds two fingerprints that match, it pulls the blocks into memory and does a byte-level validation to insure against false positives or hash collisions.

Multiple-block referencing technology then kicks in. Each of the data blocks has a pointer going to it. If two blocks validate as identical, the system moves one of the data pointers to point to the same block as the first pointer and releases the duplicate block back to the free pool on the storage system.

But Freeman said NetApp's Data Ontap operating system is especially conducive to data deduplication because it includes a file system with data pointers to facilitate the multiple-block referencing. "All we needed to do to add deduplication was create a catalog of fingerprints to identify duplicate data," he said.

NetApp deduplicates any raw data on the system, whether storage-area network (SAN) or network-attached storage (NAS). The system supports deduplication on a per-volume basis, with a volume limit of 16 TB. Future plans include addressing customer requests for increased volume sizes as well as deduplication across volumes.

Space savings average out at 30% across all storage tiers, performance workloads and applications, according to Freeman. He said the company doesn't break down the storage savings by tier. But with its leading use case, VMware Inc. VMDK files, space savings are in the range of 70%, he said.

The American Association of Airport Executives claimed initial space savings of approximately 30% on 1 TB of CIFS-based shared drives and 22% on 600 GB of NFS-based data using deduplication with the NetApp FAS 3140 it rolled out in February.

"If I don't have to keep growing that volume out but I can put more on it because of dedupe, I can not only store more locally but I can replicate more and have a better disaster recovery plan. And it doesn't take up anymore bandwidth," said Patrick Osborne, senior vice president of IT at the Alexandra, Va.-based association.

But,Osborne wasn't comfortable performing deduplication on all of his data. The association elected not to deduplicate its training videos and highly sensitive biometric files out of fear of corrupting the data, he said.

"I brought it to my users and said, 'Hey, we can do this [on the NetApp FAS 3140]. We might save space, but we don't know how it's going to work.' They said no," Osborne said. "Since I was saving space in those other areas where I was really looking to save space, I was OK."

Dig Deeper on All-flash arrays