This article can also be found in the Premium Editorial Download "Storage magazine: Primary storage dishes up dedupe."
Download it now to read this article plus other related content.
Data reduction technologies like data deduplication and compression have been well integrated into backup systems with impressive results. Now those benefits are available for primary storage data systems.
Use less disk, save more electricity. What's not to like? If you buy the right products, you can pare down the disk capacity your data needs and maybe even cut your electric bills by as much as 50%. That's the promise of primary storage data reduction, and while slashing utility costs is appealing, there's still plenty of skepticism about the claimed benefits of the technology. While there's little dispute that this new class of products can reduce the amount of disk your primary storage uses, uncertainty remains about whether the gains outweigh the challenges of primary storage data reduction.
The key questions about primary storage data reduction include the following:
- Why is it called "data reduction" rather than data deduplication?
- Disk is cheap. Why bother adding new technologies to reduce the size of the data it holds?
- What are the different types of data reduction for primary storage?
- How much disk space can actually be saved?
Data reduction defined
In backup environments, data deduplication is a recognized and appropriate term for the technologies that eliminate redundancies in backup sets, but for primary storage, data reduction is a more accurate term because not all
Before we examine these different technologies -- all of which were used for backups before they were applied to primary storage -- let's look at how very different primary data storage is from backup data storage. The main difference between primary storage and backups is the expectation of the entity that's storing or accessing the data. Backups are typically written in large batches by automated processes that are very patient. These processes are accustomed to occasional slowdowns and unavailability of resources, and even have built-in technologies to accommodate such things. Backups are rarely read, and when they are, performance expectations are modest: Someone calls and requests a file or database to be restored, and an administrator initiates the restore request. Unless the restore takes an abnormally long time, no one truly notices how long it took. Most people have adjusted their expectations so that they're happy if the restore worked at all. (This is sad, but unfortunately true.) This typical usage pattern of a disk-based backup system means you could slow down backups quite a bit without a lot of people noticing.
Primary storage is very different. Data is written to primary storage throughout the day and it's typically written directly by real people who are entering numbers into spreadsheets, updating databases, storing documents or editing multimedia files. These activities could occur dozens, hundreds or even thousands of times a day, and the users know how long it takes when they click "Save." They also know how long it takes to access their documents, databases and websites. Inject something into the process that increases save time or access time from one or two seconds to three or four seconds, and watch your help desk light up like a Christmas tree.
This was first published in April 2010