This article can also be found in the Premium Editorial Download "Storage magazine: Slimmer storage: How data reduction systems work."
Download it now to read this article plus other related content.
Pros and cons of reduction technologies
Thin provisioning is a good technology for reducing the size of initial primary allocations, as most applications don’t use their full space allocation at creation time and end-user storage capacity is typically overspecified to accommodate future growth. While savings from thin provisioning can be as high as 30%, the ongoing benefits of thin provisioning require maintenance and monitoring to ensure storage “stays thin.” Vendor implementations take radically different approaches to achieve this and, regardless, all thin provisioning deployments require host-based support. In addition, as discussed earlier, thin provisioning tackles overallocation of resources, so it won’t realize any savings where logical storage capacity is fully physically utilized.
Compression is a simple technology to deploy, requiring no user intervention in normal operation, but there are two factors to consider when using the technology. First, the compressed data needs to be “rehydrated” before a user can access it, and compression algorithms introduce latency into the write I/O cycle. Rehydration can introduce latency into data read time, as the data is uncompressed in memory prior to delivering the I/O request. During a write operation, as data is changed, the new compressed data size can increase, making it impossible to re-save the data in its original location. This introduces additional computations, especially when RAID parity calculations
Data deduplication is also a simple technology for users to implement, requiring no additional management overhead. Savings are realized by identifying repeated blocks of identical data, removing the duplicates and placing logical pointers to the single- instance physical copy.
There are two ways in which duplicates are identified: inline or via post processing. Inline dedupe identifies duplicate copies of data as they’re written to the storage array, usually by means of a hash table that creates a unique identifier for each different block of data. The inline technique requires more processing overhead and can introduce additional latency into the I/O operation; however, it’s more space efficient and can result in less back-end I/O when data doesn’t need to be physically written to disk.
Post-processing dedupe scans for duplicate blocks of data asynchronously as a background task that occurs independently of normal I/O operations. This method requires additional storage to accommodate the newly written data before it’s deduplicated, so it isn’t as efficient as inline processing. However, it does have less impact on host latency. Care needs to be taken when the dedupe process runs so that host I/O performance isn’t impacted.
Savings from deduplication vary and can range from 2:1 to 10:1, depending on industry segment and the data itself. For example, virtual desktop infrastructure (VDI) and virtual server deployments see good benefits from deduplication where virtual machines and desktops have been cloned from a single gold master.
This was first published in October 2012