EMC Celerra: Primary storage data reduction through deduplication, compression

EMC's Celerra takes a hybrid approach to primary storage data reduction, using both compression and data deduplication to achieve about 50% to 60% space savings.

Celerra is currently the only primary storage subsystem in the EMC Corp. product family to provide primary storage data reduction. Celerra's data deduplication/compression service integrates a number of technologies that EMC acquired, including an extensible policy engine from Avamar and the compression algorithms of RecoverPoint.

A free operating system feature, Celerra Data Deduplication, works at a file level with CIFS and NFS data, and only on a per-file-system basis (file-level deduplication is also referred to as single-instance storage). That means, if the same file is located in multiple file systems, the dedupe technology couldn't reduce it to a single copy. Compression also works only on a per-file-system basis.

Using the default settings, the policy engine scans production files once per week to look for data that hasn't been accessed in 30 days. The system compresses whichever files it can and creates a unique hash for each file. It then compares the hashes to see which complete files are redundant and removes the duplicate copies. Stubs point to the files in a hidden deduplication store.

Brad Bunce, director of unified storage marketing at EMC, said the most typical and beneficial use case is general-purpose Microsoft Corp. Office shares/files and home directories. Compression generally brings 40% to 50% space savings, and its file-level deduplication produces approximately 10%, he said.

NetApp's more granular fixed-block deduplication produces at least twice the space savings, if not more. But NetApp doesn't offer compression, choosing to leave that to partners such as Storwize Inc.

"If you want to look at block-based deduplication, or deduplication of virtual machine files, for example, that's an area today that we don't compete with them at," Bunce acknowledged.

EMC's lower file-level deduplication rate is somewhat mitigated by the fact that it uses fewer system resource than fixed-block and variable-block deduplication. The resource impact of compression lies somewhere between file- and block-level deduplication, Bunce added.

Bunce said future plans for primary storage data reduction call for greater efficiency for all types of storage, whether file or block, and more granular controls for end users to selectively deduplicate and compress their own data.

Dig Deeper on All-flash arrays

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.