This article can also be found in the Premium Editorial Download "Storage magazine: Lessons learned from creating and managing a scalable SAN."
Download it now to read this article plus other related content.
|Pros and cons of various data-reduction technologies|
A new data-reduction method based on pattern recognition was introduced by Diligent Technologies Corp., a Framingham, MA-based virtual tape library (VTL) vendor. With pattern recognition, the incoming data is reviewed to see if it matches similar data received in the past. If the new data is similar, the precise difference is identified and only the unique bytes are stored. The algorithms used are sophisticated and the result is superior, at least in terms of indexing. The biggest benefit is that the size of the index, even for large repositories, is so small that even inexpensive servers can be used as the data-reduction engine. For instance, a 1 petabyte (PB) repository requires only a 4GB index, which can easily be held in the cache of a small server. A chunk-based hashing methodology would require a 20GB cache for a 10TB repository. The index efficiency, at least conceptually (and if it resides in cache), results in a performance improvement, all else being equal.
Sepaton Inc., Marlborough, MA, recently added a data-reduction option to its VTL products called ContentAware Delta-Stor, which is software that uses a mix of several technologies. Before it does any capacity optimization, the system studies the incoming data within the context of all meta data associated with it (file name, file type, owner, backup software that produced it, new or old file, etc.). Based on this and previously received meta data, it intelligently separates what is likely new data from existing data. It then categorizes the data into two primary compressor streams: Data Comparator and Data Discrimination. The Data Comparator stream applies a light hash algorithm to confirm equality, while Data Discrimination conducts a detailed byte-level comparison and stores only unique bytes. The ContentAware database is at the core of this approach and demonstrates how content-awareness can reduce the amount of computation required to isolate and store only unique pieces of data. With this system, data reduction is performed after the backup has completed.
Byte-level delta differencing
ExaGrid Systems Inc., a Westborough, MA-based vendor of grid-based data protection products, uses the reliable versioning method to reduce the amount of backup data. It recognizes that a backup stream is simply a modified version of what was received before, does a byte-level comparison and stores only unique bytes. Considering that most recoveries require the latest version of the data, the most recent file is kept intact and delta differencing is applied to recover older versions. All new data will have new meta data (file name, author, etc.) and will be kept as a complete unit. If it's later modified, delta differencing would come into play.
It's important to note that this technique won't achieve any data reduction for reference information. This type of data comprises objects that are fixed and will undergo no further change, such as satellite images, seismic datasets and radiological images.
This was first published in July 2006