This article can also be found in the Premium Editorial Download "Storage magazine: Lessons learned from creating and managing a scalable SAN."
Download it now to read this article plus other related content.
Squeezing the air out of data
Data compression has been around since the 1970s when a couple of computer scientists developed the Lempel-Ziv (LZ) algorithm. But with the advent of other technologies in the past five years, "compression" has taken on a more generic--and sometimes confusing--definition.
Broadly speaking, compression means modifying data in such a way that the information remains intact, but occupies a smaller amount of storage. Technologies such as data deduplication, single instancing, commonality factoring and data coalescence can also achieve this goal.
Requires Free Membership to View
| Comparison of different data-reduction approaches | ||||||
| Click here for a comprehensive comparison of different data-reduction approaches (PDF). | ||||||
Lempel and Ziv created an algorithm that compares each incoming piece of data with the most recent data stored to determine if the new data is unique or has changed only slightly from the previously stored data. If it's changed slightly, the new data is replaced with a token representing that data. Comparing each new piece of data with all the data that had come in before is computationally prohibitive, so the LZ algorithm uses only a small amount of historical data. Compression ratios are therefore limited, with typical backup data yielding about 2:1 compression. Most tape drives and tape libraries have this compression capability built in.
Beyond LZ
Many of the traditional data-reduction technologies have significant limitations, including:
- LZ and its variants offer limited compression by comparing only incoming data to a small amount of existing data.
- Compression ratios vary depending on data type.
- Current systems look only at commonality within one system. Global enterprises often have hundreds, and sometimes thousands, of servers with vast amounts of data duplication across systems.
- Hash-based commonality factoring
- Pattern recognition
- A hybrid approach combining the two methods
- Byte-level delta differencing using versioning
This was first published in July 2006
Storage Management Strategies for the CIO

Join the conversationComment
Share
Comments
Results
Contribute to the conversation