Compression is a reduction in the number of bits needed to represent data. Compressing data can save storage capacity, speed file transfer, and decrease costs for storage hardware and network bandwidth.
Compression is performed by a program that uses a formula or algorithm to determine how to shrink the size of the data. For instance, an algorithm may represent a string of bits, or 0s and 1s, with a smaller string of 0s and 1s by using a dictionary for the conversion between them, or the formula may insert a reference or pointer to a string of 0s and 1s that the program has already seen.
Text compression can be as simple as removing all unneeded characters, inserting a single repeat character to indicate a string of repeated characters, and substituting a smaller bit string for a frequently occurring bit string. Compression can reduce a text file to 50% or a significantly higher percentage of its original size.
For data transmission, compression can be performed on the data content or on the entire transmission unit, including header data. When information is sent or received via the Internet, larger files, either singly or with others as part of an archive file, may be transmitted in a .ZIP, gzip or other compressed format.
Lossless and lossy compression
Compressing data can be a lossless or lossy process. Lossless compression enables the restoration of a file to its original state, without the loss of a single bit of data, when the file is uncompressed. Lossless compression is the typical approach with executables, as well as text and spreadsheet files, where the loss of words or numbers would change the information.
Lossy compression permanently eliminates bits of data that are redundant, unimportant or imperceptible. Lossy compression is useful with graphics, audio, video and images, where the removal of some data bits has little or no discernible effect on the representation of the content.
Graphics image compression can be lossy or lossless. Graphic image file formats are typically designed to compress information since the files tend to be large. JPEG is an image file format that supports lossy image compression. Formats such as GIF and PNG use lossless compression.
Compression vs. data deduplication
Compression is often compared to data deduplication, but the two techniques operate differently. Deduplication is a type of compression that looks for redundant chunks of data across a storage system or a file system and replaces each duplicate chunk with a pointer to the original. Compression algorithms reduce the size of the bit strings in a data stream that is far smaller in scope and generally remember no more than the last megabyte or less of data.
File-level deduplication eliminates redundant files and replaces them with stubs pointing to the original file. Block-level deduplication identifies duplicate data at the sub-file level. The system saves unique instances of each block, uses a hash algorithm to process them and generates a unique identifier to store them in an index. Deduplication typically looks for larger chunks of duplicate data than compression, and systems can deduplicate using a fixed or variable-sized chunk.
Deduplication is most effective in environments that have a high degree of redundant data, such as virtual desktop infrastructure or storage backup systems. Compression tends to be more effective than deduplication in reducing the size of unique information such as image, audio, video, database and executable files. Many storage systems support both compression and deduplication.
Pros and cons of compression
The main advantages of compression are a reduction in storage hardware, data transmission time and communication bandwidth, and the resulting cost savings. A compressed file requires less storage capacity than an uncompressed file, and the use of compression can lead to a significant decrease in expenses for disk and/or solid-state drives. A compressed file also requires less time for transfer, and it consumes less network bandwidth than an uncompressed file.
The main disadvantage of compression is the performance impact resulting from the use of CPU and memory resources to compress and decompress the data. Many vendors have designed their systems to try to minimize the impact of the processor-intensive calculations associated with compression. If the compression runs inline, before the data is written to disk, the system may offload compression to preserve system resources. For instance, IBM uses a separate hardware acceleration card to handle compression with some of its enterprise storage systems.
If data is compressed after it is written to disk, or post process, the compression may run in the background to reduce the performance impact. Although post-process compression can reduce the response time for each input/output (I/O), it still consumes memory and processor cycles, and can affect the overall number of I/Os a storage system can handle. Also, because data initially must be written to disk or flash drives in an uncompressed form, the physical storage savings are not as great as they are with inline compression.
Tools/technologies that use compression
Compression is built into a wide range of technologies, including storage systems, databases, operating systems and software applications used by businesses and enterprise organizations. Compressing data is also common in consumer devices such as laptops, PCs and mobile phones.
Many systems and devices perform compression transparently, but some give users the option to turn compression on or off. Compression can be performed more than once on the same file or piece of data, but subsequent compressions result in little to no additional compression and may even increase the size of the file to a slight degree, depending on the algorithms.
WinZip is a popular Windows program that compresses files when it packages them in an archive. Archive file formats that support compression include ZIP and RAR. The bzip2 and gzip formats see widespread use for compressing individual files.