This article can also be found in the Premium Editorial Download "Storage magazine: Primary storage dishes up dedupe."
Download it now to read this article plus other related content.
This means that the No. 1 rule to keep in mind when introducing a change in your primary data storage system is primum non nocere, or "First, do no harm." Data reduction techniques can definitely help save money in disk systems, and power and cooling costs, but if by introducing these technologies you negatively impact the user experience, the benefits of data reduction may seem far less attractive.
The next challenge for data reduction in primary data storage is the expectation that space-saving ratios will be comparable to those achieved with data deduplication for backups. They won't. Most backup software creates enormous amounts of duplicate data, with multiple copies stored in multiple places. Although there are exceptions, that's not typically the case in primary storage. Many people feel that any reduction beyond 50% (a 2:1 reduction ratio) should be considered gravy. This is why most vendors of primary data reduction systems don't talk much about ratios; rather, they're more likely to cite reduction percentages. (A 75% reduction in storage sounds a whole lot better than a 3:1 reduction ratio.)
If you're considering implementing data reduction in primary data storage, the bottom line is this: compared to deploying deduplication in a backup environment, the job is harder and the rewards are fewer. That's not to suggest you shouldn't consider primary storage data reduction technologies, but rather to help you properly set expectations before making
Primary storage data reduction technologies
Compression. Compression technologies have been around for decades, but compression is typically used for data that's not accessed very much. That's because the act of compressing and uncompressing data can be a very CPU-intensive process that tends to slow down access to the data (remember: primum non nocere).
There's one area of the data center, however, where compression is widely used: backup. Every modern tape drive is able to dynamically compress data during backups and uncompress data during restores. Not only does compression not slow down backups, it actually speeds them up. How is that possible? The secret is that the drives use a chip that can compress and uncompress at line speeds. By compressing the data by approximately 50%, it essentially halves the amount of data the tape drive has to write. Because the tape head is the bottleneck, compression actually increases the effective speed of the drive.
Compression systems for primary data storage use the same concept. Products such as Ocarina Networks' ECOsystem appliances and Storewize Inc.'s STN-2100 and STN-6000 appliances compress data as it's being stored and then uncompress it as it's being read. If they can do this at line speed, it shouldn't slow down write or read performance. They should also be able to reduce the amount of disk necessary to store files by between 30% and 75%, depending on the algorithms they use and the type of data they're compressing. The advantage of compression is that it's a very mature and well understood technology. The disadvantage is that it only finds patterns within a file and doesn't find patterns between files, therefore limiting its ability to reduce the size of data.
File-level deduplication. A system employing file-level deduplication examines the file system to see if two files are exactly identical. If it finds two identical files, one of them is replaced with a link to the other file. The advantage of this technique is that there should be no change in access times, as the file doesn't need to be decompressed or reassembled prior to being presented to the requester; it's simply two different links to the same data. The disadvantage of this approach is that it will obviously not achieve the same reduction rates as compression or sub-file-level deduplication.
Sub-file-level deduplication. This approach is very similar to the technology used in hash-based data deduplication systems for backup. It breaks all files down into segments or chunks, and then runs those chunks through a cryptographic hashing algorithm to create a numeric value that's then compared to the numeric value of every other chunk that has ever been seen by the deduplication system. If the hashes from two different chunks are the same, one of the chunks is discarded and replaced with a pointer to the other identical chunk.
This was first published in April 2010