This article can also be found in the Premium Editorial Download "Storage magazine: Lessons learned from creating and managing a scalable SAN."

Download it now to read this article plus other related content.

Hash-based commonality factoring
Hash-based commonality factoring is the dominant capacity-optimization technology today, and it's used in products such as Avamar Technologies Inc.'s Axion, Data Domain Inc.'s DD400 Enterprise Series that uses Global Compression Technology, EMC Corp.'s Centera and Hewlett-Packard (HP) Co.'s Reference Information Storage System (RISS). A hash is the result of applying an algorithm to some data to derive a unique number. It's extremely unlikely that any two files would produce the same hash result.

Hashes were originally created to ensure file authenticity. If the file and its hash were transmitted to another location, one could ensure authenticity by recalculating the received file's hash on the remote side and matching it to the transmitted hash. Today, hashes are used for capacity optimization in a slightly different way. Basically, each file (or subset of a file, called a chunk) is converted into a hash. Then, on the basis of a hash comparison, the same file or chunk is never stored again. Because complete files (or chunks) are much larger than their hash, comparison between hashes is much easier computationally than comparing complete files for duplication. The hash approach can work company-wide because a specific file will create the same hash, regardless of where it resides. A file can always be addressed by its hash, no matter where it resides geographically or on what system it resides on. There

    Requires Free Membership to View

are no path names or file names.

EMC's Centera uses the MD5 hash standard to handle data archiving, whereas Avamar applies SHA160 hashing to a variable-sized file chunk (and fixed-chunk-sized for databases) for its backup and restore product. Data Domain uses hash technology for its backup/restore product, but the firm is tight-lipped about which method it uses. HP's RISS, an archival platform, also uses SHA160 and allows users to choose between complete files or fixed- or variable-sized chunks.

Some design tradeoffs must be made with all hashing solutions, such as chunk size. Smaller chunk sizes produce more commonality, but require larger indexes and more compute power. EMC Centera chose the file size rather than a chunk as the basic element for hashing; therefore, it eliminates duplication only at the file level. This keeps the system simple and the searches fast, but it doesn't achieve capacity-optimization levels as high as the Avamar system, for example.

Another consideration is whether to use fixed- or variable-sized chunks. Fixed-sized chunks are easier to handle, but suffer from the "slide" syndrome where all chunks after the location of a new byte would be different and require new hashes, even though that data is unchanged. Variable-sized chunks can understand the slide effect and create only one new chunk to reflect the change. Databases have more structured formats with well-defined and often fixed-length fields, so many data-reduction products use a fixed-chunk approach.

This was first published in July 2006

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: