Cut data down to size

With today's extreme data growth rates, adding disk-based protection is no longer an option but a requisite. Data reduction can help ease growth pains by paring down the data that goes to disk. There are many products with data-reduction capabilities available, but the technologies they use vary widely.

This article can also be found in the Premium Editorial Download: Storage magazine: Lessons learned from creating and managing a scalable SAN:

Data-reduction technologies are emerging as key components of data protection products. By reducing the amount of data stored, you can cut storage costs and gain greater backup efficiency.

Whether your company is big or small, you've likely seen your digital data grow at an alarming rate. Cheaper storage arrays and the emergence of cost- and capacity-efficient SATA drives have helped relieve the pressure, but just throwing disk at the problem isn't a long-term solution. Storage management costs are generally proportional to the amount of data managed. Reducing the amount of data that needs to be stored and managed is one big way to reduce storage costs.

Squeezing the air out of data
Data compression has been around since the 1970s when a couple of computer scientists developed the Lempel-Ziv (LZ) algorithm. But with the advent of other technologies in the past five years, "compression" has taken on a more generic--and sometimes confusing--definition.

Broadly speaking, compression means modifying data in such a way that the information remains intact, but occupies a smaller amount of storage. Technologies such as data deduplication, single instancing, commonality factoring and data coalescence can also achieve this goal.

Comparison of different data-reduction approaches
Click here for a comprehensive comparison of different data-reduction approaches (PDF).

Lempel and Ziv created an algorithm that compares each incoming piece of data with the most recent data stored to determine if the new data is unique or has changed only slightly from the previously stored data. If it's changed slightly, the new data is replaced with a token representing that data. Comparing each new piece of data with all the data that had come in before is computationally prohibitive, so the LZ algorithm uses only a small amount of historical data. Compression ratios are therefore limited, with typical backup data yielding about 2:1 compression. Most tape drives and tape libraries have this compression capability built in.

Beyond LZ
Many of the traditional data-reduction technologies have significant limitations, including:

  • LZ and its variants offer limited compression by comparing only incoming data to a small amount of existing data.
  • Compression ratios vary depending on data type.
  • Current systems look only at commonality within one system. Global enterprises often have hundreds, and sometimes thousands, of servers with vast amounts of data duplication across systems.
Several new approaches on the market use four fundamental technologies. They are:
  • Hash-based commonality factoring
  • Pattern recognition
  • A hybrid approach combining the two methods
  • Byte-level delta differencing using versioning
All of these methodologies can be used in conjunction with compression technologies such as LZ (see "Comparison of different data-reduction approaches," PDF file).

Hash-based commonality factoring
Hash-based commonality factoring is the dominant capacity-optimization technology today, and it's used in products such as Avamar Technologies Inc.'s Axion, Data Domain Inc.'s DD400 Enterprise Series that uses Global Compression Technology, EMC Corp.'s Centera and Hewlett-Packard (HP) Co.'s Reference Information Storage System (RISS). A hash is the result of applying an algorithm to some data to derive a unique number. It's extremely unlikely that any two files would produce the same hash result.

Hashes were originally created to ensure file authenticity. If the file and its hash were transmitted to another location, one could ensure authenticity by recalculating the received file's hash on the remote side and matching it to the transmitted hash. Today, hashes are used for capacity optimization in a slightly different way. Basically, each file (or subset of a file, called a chunk) is converted into a hash. Then, on the basis of a hash comparison, the same file or chunk is never stored again. Because complete files (or chunks) are much larger than their hash, comparison between hashes is much easier computationally than comparing complete files for duplication. The hash approach can work company-wide because a specific file will create the same hash, regardless of where it resides. A file can always be addressed by its hash, no matter where it resides geographically or on what system it resides on. There are no path names or file names.

EMC's Centera uses the MD5 hash standard to handle data archiving, whereas Avamar applies SHA160 hashing to a variable-sized file chunk (and fixed-chunk-sized for databases) for its backup and restore product. Data Domain uses hash technology for its backup/restore product, but the firm is tight-lipped about which method it uses. HP's RISS, an archival platform, also uses SHA160 and allows users to choose between complete files or fixed- or variable-sized chunks.

Some design tradeoffs must be made with all hashing solutions, such as chunk size. Smaller chunk sizes produce more commonality, but require larger indexes and more compute power. EMC Centera chose the file size rather than a chunk as the basic element for hashing; therefore, it eliminates duplication only at the file level. This keeps the system simple and the searches fast, but it doesn't achieve capacity-optimization levels as high as the Avamar system, for example.

Another consideration is whether to use fixed- or variable-sized chunks. Fixed-sized chunks are easier to handle, but suffer from the "slide" syndrome where all chunks after the location of a new byte would be different and require new hashes, even though that data is unchanged. Variable-sized chunks can understand the slide effect and create only one new chunk to reflect the change. Databases have more structured formats with well-defined and often fixed-length fields, so many data-reduction products use a fixed-chunk approach.

Pros and cons of various data-reduction technologies

Pattern recognition
A new data-reduction method based on pattern recognition was introduced by Diligent Technologies Corp., a Framingham, MA-based virtual tape library (VTL) vendor. With pattern recognition, the incoming data is reviewed to see if it matches similar data received in the past. If the new data is similar, the precise difference is identified and only the unique bytes are stored. The algorithms used are sophisticated and the result is superior, at least in terms of indexing. The biggest benefit is that the size of the index, even for large repositories, is so small that even inexpensive servers can be used as the data-reduction engine. For instance, a 1 petabyte (PB) repository requires only a 4GB index, which can easily be held in the cache of a small server. A chunk-based hashing methodology would require a 20GB cache for a 10TB repository. The index efficiency, at least conceptually (and if it resides in cache), results in a performance improvement, all else being equal.

Hybrid solutions
Sepaton Inc., Marlborough, MA, recently added a data-reduction option to its VTL products called ContentAware Delta-Stor, which is software that uses a mix of several technologies. Before it does any capacity optimization, the system studies the incoming data within the context of all meta data associated with it (file name, file type, owner, backup software that produced it, new or old file, etc.). Based on this and previously received meta data, it intelligently separates what is likely new data from existing data. It then categorizes the data into two primary compressor streams: Data Comparator and Data Discrimination. The Data Comparator stream applies a light hash algorithm to confirm equality, while Data Discrimination conducts a detailed byte-level comparison and stores only unique bytes. The ContentAware database is at the core of this approach and demonstrates how content-awareness can reduce the amount of computation required to isolate and store only unique pieces of data. With this system, data reduction is performed after the backup has completed.

Byte-level delta differencing
ExaGrid Systems Inc., a Westborough, MA-based vendor of grid-based data protection products, uses the reliable versioning method to reduce the amount of backup data. It recognizes that a backup stream is simply a modified version of what was received before, does a byte-level comparison and stores only unique bytes. Considering that most recoveries require the latest version of the data, the most recent file is kept intact and delta differencing is applied to recover older versions. All new data will have new meta data (file name, author, etc.) and will be kept as a complete unit. If it's later modified, delta differencing would come into play.

It's important to note that this technique won't achieve any data reduction for reference information. This type of data comprises objects that are fixed and will undergo no further change, such as satellite images, seismic datasets and radiological images.

Key considerations
There are many other questions to consider when deciding whether a data-reduction product is right for your organization. These include:

Where's data reduction performed? Data reduction can occur in three places: the application server, inline in front of the backup server and after uncompressed backup data is stored on disk. Data reduction at the application server takes compute cycles from application processing and may impact application performance. But the backups are highly efficient and network traffic is drastically reduced. The overall throughput may be impacted and you can't use your existing backup software--your backup and restore procedures have to change significantly--but the overall payoff can be excellent. Avamar's Axion and HP's RISS are the only two products on the market that fall into this category.

Data Domain and Diligent offer inline solutions with an appliance that sits in front of the backup server, intercepts the data stream and performs data reduction. These vendors use different data-reduction methods, but the placement of their appliances is in-band; both require additional disk space beyond the size of the reduced data.

Sepaton's S2100-ES2 with DeltaStor performs backups at full speed, unimpeded by the data-reduction engine. In the background, the capacity-optimization engine takes over and creates new "virtual cartridges" that replace those containing the original data. Disk space is released and the process repeats itself. This approach requires more disk capacity than other techniques, but delivers the best backup speed. ExaGrid's InfiniteFiler is also in this category.

When is data reduction performed? Avamar, Data Domain and Diligent do their data reduction whenever a backup happens. Backup isn't complete until the data reduction is finished; overall throughput is largely determined by the efficiency of the data-reduction engine. With ExaGrid and Sepaton, the backup speed is limited only by the normal efficiencies of the backup infrastructure. All capacity optimization happens after the fact, and the speed of data reduction can be throttled via policy to ensure backups always enjoy a higher priority.

How is data reconstructed? Because files are often broken down into blocks or chunks and stored only once, the integrity of the software that constructs an object from these smaller elements is critical. So it's imperative that the meta data required to construct a file is highly available. The design of the file system that reconstructs files or databases is just as important as the data-reduction algorithm.

What level of data integrity is achieved? The possibility of two files producing the same hash value is extremely remote, but if that's unacceptable, you need to look past hash-based solutions. Also ask vendors to explain if data grooming occurs in the background to ensure data integrity and recoverability.

What about performance? If backup speed is a critical issue, you should examine the throughput speeds of inline products to ensure they're adequate. You may be better off with a product that performs data reduction after backups are complete.

How scalable is the product, and what happens if a single appliance maxes out? The scalability of data-reduction products varies considerably. Avamar uses the redundant array of independent nodes (RAIN) architecture to scale; Diligent uses clustering; ExaGrid, HP RISS and Sepaton use grid principles to grow their appliances to larger capacities; and Data Domain uses a single-appliance concept. Management may be a concern as well, as the system grows to multiple appliances.

How big is the index? If data commonality checks are done in memory, the size of the index matters. With a small index like Diligent's, all searches can be done on a single server, which improves performance. If a product requires the index to be large or distributed, how it coordinates the parts may be an issue.

Is the degree of data reduction acceptable? In general, you should expect data reductions of 10:1 to 25:1. Most data-reduction products also offer hardware compression, which could add an extra 1.6:1 to 3:1, depending on the type of data. All together, the effective data reductions can easily be in the 20:1 to 30:1 range, assuming at least a few months' worth of data is kept on disk. Backup procedures also have an impact. If you do daily fulls, expect huge data reductions; for weekly fulls/daily incrementals, the rate is more modest.

An evolving technology
Data protection is undergoing a sea change with the number and type of products hitting the market at an unprecedented level. Given the extreme data growth rates, adding disk-based data protection is no longer an option. But picking the right technology has never been more difficult. The fundamentals of data reduction should provide enough knowledge to ask the right questions and seek straight answers from vendors.

This was first published in July 2006

Dig deeper on Data center storage



Enjoy the benefits of Pro+ membership, learn more and join.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: