Ezine

This article can also be found in the Premium Editorial Download "Storage magazine: Lessons learned from creating and managing a scalable SAN."

Download it now to read this article plus other related content.

Squeezing the air out of data
Data compression has been around since the 1970s when a couple of computer scientists developed the Lempel-Ziv (LZ) algorithm. But with the advent of other technologies in the past five years, "compression" has taken on a more generic--and sometimes confusing--definition.

Broadly speaking, compression means modifying data in such a way that the information remains intact, but occupies a smaller amount of storage. Technologies such as data deduplication, single instancing, commonality factoring and data coalescence can also achieve this goal.

    Requires Free Membership to View

Comparison of different data-reduction approaches
Click here for a comprehensive comparison of different data-reduction approaches (PDF).

Lempel and Ziv created an algorithm that compares each incoming piece of data with the most recent data stored to determine if the new data is unique or has changed only slightly from the previously stored data. If it's changed slightly, the new data is replaced with a token representing that data. Comparing each new piece of data with all the data that had come in before is computationally prohibitive, so the LZ algorithm uses only a small amount of historical data. Compression ratios are therefore limited, with typical backup data yielding about 2:1 compression. Most tape drives and tape libraries have this compression capability built in.

Beyond LZ
Many of the traditional data-reduction technologies have significant limitations, including:

  • LZ and its variants offer limited compression by comparing only incoming data to a small amount of existing data.
  • Compression ratios vary depending on data type.
  • Current systems look only at commonality within one system. Global enterprises often have hundreds, and sometimes thousands, of servers with vast amounts of data duplication across systems.
Several new approaches on the market use four fundamental technologies. They are:
  • Hash-based commonality factoring
  • Pattern recognition
  • A hybrid approach combining the two methods
  • Byte-level delta differencing using versioning
All of these methodologies can be used in conjunction with compression technologies such as LZ (see "Comparison of different data-reduction approaches," PDF file).

This was first published in July 2006

There are Comments. Add yours.

 
TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: