Dedupe and compression cut storage down to size

Data reduction technologies like data deduplication and compression have been well integrated into backup systems with impressive results. Now those benefits are available for primary storage data systems.

Data reduction technologies like data deduplication and compression have been well integrated into backup systems with impressive results. Now those benefits are available for primary storage data systems.

Use less disk, save more electricity. What's not to like? If you buy the right products, you can pare down the disk capacity your data needs and maybe even cut your electric bills by as much as 50%. That's the promise of primary storage data reduction, and while slashing utility costs is appealing, there's still plenty of skepticism about the claimed benefits of the technology. While there's little dispute that this new class of products can reduce the amount of disk your primary storage uses, uncertainty remains about whether the gains outweigh the challenges of primary storage data reduction.

The key questions about primary storage data reduction include the following:

  • Why is it called "data reduction" rather than data deduplication?
  • Disk is cheap. Why bother adding new technologies to reduce the size of the data it holds?
  • What are the different types of data reduction for primary storage?
  • How much disk space can actually be saved?

Data reduction defined

In backup environments, data deduplication is a recognized and appropriate term for the technologies that eliminate redundancies in backup sets, but for primary storage, data reduction is a more accurate term because not all data reduction technologies use deduplication techniques. As an umbrella term, data reduction includes any technology that reduces the footprint of your data on disk. There are three main types of data reduction in use today: compression, file-level deduplication and sub-file-level deduplication.

Before we examine these different technologies -- all of which were used for backups before they were applied to primary storage -- let's look at how very different primary data storage is from backup data storage. The main difference between primary storage and backups is the expectation of the entity that's storing or accessing the data. Backups are typically written in large batches by automated processes that are very patient. These processes are accustomed to occasional slowdowns and unavailability of resources, and even have built-in technologies to accommodate such things. Backups are rarely read, and when they are, performance expectations are modest: Someone calls and requests a file or database to be restored, and an administrator initiates the restore request. Unless the restore takes an abnormally long time, no one truly notices how long it took. Most people have adjusted their expectations so that they're happy if the restore worked at all. (This is sad, but unfortunately true.) This typical usage pattern of a disk-based backup system means you could slow down backups quite a bit without a lot of people noticing.

Primary storage is very different. Data is written to primary storage throughout the day and it's typically written directly by real people who are entering numbers into spreadsheets, updating databases, storing documents or editing multimedia files. These activities could occur dozens, hundreds or even thousands of times a day, and the users know how long it takes when they click "Save." They also know how long it takes to access their documents, databases and websites. Inject something into the process that increases save time or access time from one or two seconds to three or four seconds, and watch your help desk light up like a Christmas tree.

This means that the No. 1 rule to keep in mind when introducing a change in your primary data storage system is primum non nocere, or "First, do no harm." Data reduction techniques can definitely help save money in disk systems, and power and cooling costs, but if by introducing these technologies you negatively impact the user experience, the benefits of data reduction may seem far less attractive.

The next challenge for data reduction in primary data storage is the expectation that space-saving ratios will be comparable to those achieved with data deduplication for backups. They won't. Most backup software creates enormous amounts of duplicate data, with multiple copies stored in multiple places. Although there are exceptions, that's not typically the case in primary storage. Many people feel that any reduction beyond 50% (a 2:1 reduction ratio) should be considered gravy. This is why most vendors of primary data reduction systems don't talk much about ratios; rather, they're more likely to cite reduction percentages. (A 75% reduction in storage sounds a whole lot better than a 3:1 reduction ratio.)

If you're considering implementing data reduction in primary data storage, the bottom line is this: compared to deploying deduplication in a backup environment, the job is harder and the rewards are fewer. That's not to suggest you shouldn't consider primary storage data reduction technologies, but rather to help you properly set expectations before making the commitment.

Primary storage data reduction technologies

Compression. Compression technologies have been around for decades, but compression is typically used for data that's not accessed very much. That's because the act of compressing and uncompressing data can be a very CPU-intensive process that tends to slow down access to the data (remember: primum non nocere).

There's one area of the data center, however, where compression is widely used: backup. Every modern tape drive is able to dynamically compress data during backups and uncompress data during restores. Not only does compression not slow down backups, it actually speeds them up. How is that possible? The secret is that the drives use a chip that can compress and uncompress at line speeds. By compressing the data by approximately 50%, it essentially halves the amount of data the tape drive has to write. Because the tape head is the bottleneck, compression actually increases the effective speed of the drive.

Compression systems for primary data storage use the same concept. Products such as Ocarina Networks' ECOsystem appliances and Storewize Inc.'s STN-2100 and STN-6000 appliances compress data as it's being stored and then uncompress it as it's being read. If they can do this at line speed, it shouldn't slow down write or read performance. They should also be able to reduce the amount of disk necessary to store files by between 30% and 75%, depending on the algorithms they use and the type of data they're compressing. The advantage of compression is that it's a very mature and well understood technology. The disadvantage is that it only finds patterns within a file and doesn't find patterns between files, therefore limiting its ability to reduce the size of data.

File-level deduplication. A system employing file-level deduplication examines the file system to see if two files are exactly identical. If it finds two identical files, one of them is replaced with a link to the other file. The advantage of this technique is that there should be no change in access times, as the file doesn't need to be decompressed or reassembled prior to being presented to the requester; it's simply two different links to the same data. The disadvantage of this approach is that it will obviously not achieve the same reduction rates as compression or sub-file-level deduplication.

Sub-file-level deduplication. This approach is very similar to the technology used in hash-based data deduplication systems for backup. It breaks all files down into segments or chunks, and then runs those chunks through a cryptographic hashing algorithm to create a numeric value that's then compared to the numeric value of every other chunk that has ever been seen by the deduplication system. If the hashes from two different chunks are the same, one of the chunks is discarded and replaced with a pointer to the other identical chunk.

Depending on the type of data, a sub-file-level deduplication system can reduce the size of data quite a bit. The most dramatic results using this technique are achieved with virtual system images, and especially virtual desktop images. It's not uncommon to achieve reductions of 75% to 90% in such environments. In other environments, the amount of reduction will be based on the degree to which users create duplicates of their own data. Some users, for example, save multiple versions of their files on their home directories. They get to a "good point" and save the file, and then save it a second time with a new name. This way, they know that no matter what they do, they can always revert to the previous version. But this practice can result in many versions of an individual file -- and users rarely go back and remove older file versions. In addition, many users download the same file as their coworkers and store it on their home directory. These activities are why sub-file-level deduplication works even within a typical user home directory.

The advantage of sub-file-level deduplication is that it will find duplicate patterns all over the place, no matter how the data has been saved. The disadvantage of this approach is that it works at the macro level as opposed to compression that works at the micro level. It might identify a redundant segment of 8 KB of data, for example, but a good compression algorithm might reduce the size of that segment to 4 KB. That's why some data reduction systems use compression in conjunction with some type of deduplication.

Is archiving data reduction?

Some vendors consider archiving and hierarchical storage management (HSM) to be data reduction technologies. Both archiving and HSM systems can reduce the amount of disk you need to store your primary data, but they do so by moving data from one storage system to another. While they may save you money, they're not truly reducing the size of the data -- they're just moving it to less-expensive storage. Therefore, while these are good technologies that companies with a lot of data should explore, it's not data reduction per se.

A sampler of primary storage data reduction products
The following vendors currently offer primary storage data reduction products (listed in alphabetic order):

EMC Corp. EMC introduced file-level deduplication and compression of inactive files in its Celerra filer. Administrators can configure various settings, such as how old a file must be before it's a candidate for this process, and what file sizes the process should look for. While deduplication and compression of older data obviously won't generate as much data reduction as compressing or deduplicating everything, EMC customers have reported significant savings using this data reduction implementation.

Exar Corp. Exar gained data deduplication technology with its April 2009 acquisition of Hifn Inc. End users may be unfamiliar with Exar, but they may already be using their products. Many high-end virtual tape libraries (VTLs) and data deduplication systems for backups use Exar hardware compression cards for data compression. Exar now has released a card, designed to be placed into a Windows or Linux server, that will deduplicate data as it's being written to any hard drive. Exar's Hifn BitWackr B1605R is a hardware and software product that offloads data deduplication and compression from a server's CPU and makes adding data reduction to a Windows or Linux server a relatively easy process.

GreenBytes Inc. GreenBytes is in something of a unique position, as it's the first vendor attempting to make a single product to address the data reduction needs of both backup and primary data storage in its GB-X Series of network-attached storage (NAS) and storage-area network (SAN) storage devices. The firm uses a hash-based data deduplication technology, but the hash algorithm is different from that used by all other vendors: Instead of the widely used SHA-1, GreenBytes uses Tiger, which it says is more suited to general-purpose processors than SHA-1 and, therefore, offers significant performance advantages while not decreasing data integrity. Tiger's key space (192 bits) is significantly larger than that of SHA-1 (160 bits), which further reduces the chances of a hash collision. GreenBytes is also making extensive use of solid-state disk as a cache in front of SATA disk so that it can better meet the performance needs of primary data storage users.

Click here to get a PDF of Primary Storage Data Reduction Vendors.

Microsoft Corp. With its Windows Storage Server 2008, Microsoft offers file-level single-instance deduplication built into the operating system. A number of storage systems vendors are taking advantage of the built-in SIS, including Hewlett-Packard's StorageWorks X-series Network Storage Systems and Compellent's Storage Center with NAS. File-level deduplication alone will provide modest space savings for users of these systems.

NetApp Inc. NetApp was the first primary data storage vendor to offer deduplication, which leverages the company's existing write anywhere file layout (WAFL) file system technology. The WAFL file system already computes a CRC checksum for each block of data it stores, and has block-based pointers integrated into the file system. (It's the secret behind NetApp's ability to have hundreds of snapshots without any performance degradation.) An optional process that runs during times of low activity examines all checksums; if two checksums match, the filer does a block-level comparison of those blocks. If the comparison shows a complete match, one of the blocks is replaced with a WAFL pointer. The result is sub-file-level deduplication without a significant impact on performance. NetApp's deduplication system has been tested by many users against multiple data types, including home directories, databases and virtual images, and most users have reported positive results in both reduction percentages and performance. As of this writing, NetApp uses only deduplication and doesn't do compression.

Nexenta Systems Inc. Nexenta uses the Oracle Solaris ZFS file system in its NexentaStor family of storage system software products that are based on the open source OpenSolaris platform; however, the firm has added more than 30 additional features to its ZFS-based offering that are only available from Nexenta. Examples of these features include an integrated management console, LDAP integration, continuous data protection (CDP) and synchronous replication. The recently announced NexentaStor 3.0 offers deduplicated storage that's fully integrated with Citrix Systems Inc. XenServer, Microsoft Corp. Hyper-V and VMware Inc. VMware vSphere.

Ocarina Networks. Ocarina takes a very different approach to data reduction than many other vendors. Where most vendors apply compression and deduplication without any knowledge of the data, Ocarina has hundreds of different compression and deduplication algorithms that it uses depending on the specific type of data. For example, the company uses completely different techniques to compress images and Word documents. It also understands encapsulation systems such as the Digital Imaging and Communications in Medicine (DICOM) system. Ocarina will actually disassemble a DICOM container, examine and deduplicate the various components, and then reassemble the container. As a result, Ocarina can often achieve much greater compression and deduplication rates than other vendors can realize with the same data types.

Ocarina isn't a storage vendor; it works with existing data storage system vendors that will allow Ocarina to interface with their systems. Ocarina is currently partnering with BlueArc Corp., EMC, Hewlett-Packard, Hitachi Data Systems and Isilon Systems Inc.

Oracle-Sun. Oracle's Solaris ZFS file system also has sub-file-level data deduplication built into it. As of this writing, there's not much available information about how well it duplicates data or its performance in user production environments. However, the ZFS website does state that there shouldn't be a significant difference in performance between deduplicated and native data, as long as the hash table used for deduplication can fit into memory.

New and growing fast

A little over a year ago, there were virtually no viable options for reducing data in primary storage. Now there are half a dozen or so, with more on the way. Given the runaway growth in file storage that most companies are experiencing, it shouldn't take long for data reduction technologies to find their way into many of the products offered by data storage systems vendors.

BIO: W. Curtis Preston is an executive editor in TechTarget's Storage Media Group and an independent backup expert. Curtis has worked extensively with data deduplication and other data reduction systems.

Dig Deeper on Storage optimization