Published: 05 Oct 2012
Although it’s become a staple of backup systems, data reduction is still just beginning to appear in primary storage systems. Here’s how it works and who’s doing it.
The demand for more data storage capacity at most companies continues to grow at eye-watering rates, in some cases as high as 100% each year. And even though the cost per terabyte of storage is dropping, the net effect is one of increasing costs.
Within the secondary storage market we’ve seen vendors tackling growth by implementing technologies such as data deduplication and compression. For those data types, the use of space-reduction technologies dramatically shrink the amount of real storage required and can significantly decrease costs. However, in the primary storage market we haven’t seen a widespread deployment of similar space-saving techniques. Data reduction in primary storage (DRIPS) is in its infancy, but we’re starting to see these features being added to products from the big storage vendors. Let’s take a look at the state of the DRIPS marketplace: why space reduction can be useful, how much money you can expect to save and the potential downsides of any implementations.
Data reduction techniques
Vendors have deployed a number of space-reduction techniques, some of which have been available for a while, and others that are new to primary storage platforms. The key data reduction technologies include thin provisioning, compression and data deduplication.
Thin provisioning is already widely implemented by all the major storage array vendors and is pushing down into midmarket and even small office/home office (SOHO) devices. The technology works by eliminating the reserve on unwritten blocks of storage, allowing overprovisioning of storage resources and enabling more logical capacity to be created than is physically available. However, thin provisioning only ensures physical capacity is used more efficiently and doesn’t optimize actual written data.
Compression is a space-reduction technique that looks to optimize the data stream by finding repeated patterns of similar information that can be reduced and replaced with an optimized data structure. The compression process is usually performed in-flight as the host writes data. The technology has been around for some time and was in use as long as 25 years ago in IBM tape drives.
Data deduplication (or dedupe) looks for repeated patterns of data, usually based on a fixed block size, and reduces them to a single physical instance of the pattern. All references to that block of data then point to the single physical copy. As data is changed, the resulting updates have to be stored elsewhere in the array as a new copy of that data.
Pros and cons of reduction technologies
Thin provisioning is a good technology for reducing the size of initial primary allocations, as most applications don’t use their full space allocation at creation time and end-user storage capacity is typically overspecified to accommodate future growth. While savings from thin provisioning can be as high as 30%, the ongoing benefits of thin provisioning require maintenance and monitoring to ensure storage “stays thin.” Vendor implementations take radically different approaches to achieve this and, regardless, all thin provisioning deployments require host-based support. In addition, as discussed earlier, thin provisioning tackles overallocation of resources, so it won’t realize any savings where logical storage capacity is fully physically utilized.
Compression is a simple technology to deploy, requiring no user intervention in normal operation, but there are two factors to consider when using the technology. First, the compressed data needs to be “rehydrated” before a user can access it, and compression algorithms introduce latency into the write I/O cycle. Rehydration can introduce latency into data read time, as the data is uncompressed in memory prior to delivering the I/O request. During a write operation, as data is changed, the new compressed data size can increase, making it impossible to re-save the data in its original location. This introduces additional computations, especially when RAID parity calculations are involved. However, as processing power has increased (especially with today’s Intel Xeon processors), the computing overhead of compression is becoming less of a problem. The savings from compression are highly dependent on the type of data in use, but reductions can be significant with pre-formatted data such as databases.
Data deduplication is also a simple technology for users to implement, requiring no additional management overhead. Savings are realized by identifying repeated blocks of identical data, removing the duplicates and placing logical pointers to the single- instance physical copy.
There are two ways in which duplicates are identified: inline or via post processing. Inline dedupe identifies duplicate copies of data as they’re written to the storage array, usually by means of a hash table that creates a unique identifier for each different block of data. The inline technique requires more processing overhead and can introduce additional latency into the I/O operation; however, it’s more space efficient and can result in less back-end I/O when data doesn’t need to be physically written to disk.
Post-processing dedupe scans for duplicate blocks of data asynchronously as a background task that occurs independently of normal I/O operations. This method requires additional storage to accommodate the newly written data before it’s deduplicated, so it isn’t as efficient as inline processing. However, it does have less impact on host latency. Care needs to be taken when the dedupe process runs so that host I/O performance isn’t impacted.
Savings from deduplication vary and can range from 2:1 to 10:1, depending on industry segment and the data itself. For example, virtual desktop infrastructure (VDI) and virtual server deployments see good benefits from deduplication where virtual machines and desktops have been cloned from a single gold master.
DRIPS and SSD
One impact of using space-reduction techniques is the increase in I/O density, specifically the random I/O it creates. I/O density increases with thin provisioning as the unused space is eliminated. Deduplication creates more random I/O, as the locations of the duplicate and single-instance blocks are unpredictable -- and it becomes more random over time.
Solid-state and dedupe: A good match
Data dedupe is enabling solid-state storage array vendors to compete at a $/GB ratio compatible with today’s high-end hard disk drive storage arrays. Once flash technology is trusted and widely adopted, data reduction techniques will increase their appeal and make today’s high-end disk-based arrays a much tougher sell for many firms.
Solid-state drives (SSDs) are a great fit for random I/O profiles, making them suitable for deployment in storage arrays implementing deduplication. There’s no latency penalty in handling random versus sequential I/O, and therefore no reduction in performance in managing deduplicated data. Dedupe also changes the effective $/GB ratio in terms of storage costs. Vendors are using deduplication ratios to reduce the $/GB cost of their storage and boost the appeal of their arrays to a wider audience. Prospective customers should be wary of accepting pricing based on deduplication ratios without an understanding of potential savings from their data, as savings may not match vendor claims.
NetApp Inc. was the first vendor to implement dedupe of primary data in its arrays, starting way back in May 2007. The feature was originally known as A-SIS (Advanced Single-Instance Storage) and performed post-processing dedupe of data at the 4 KB block level. Initially, A-SIS was restricted by platform and to smaller volume sizes than the filers would support without A-SIS installed. This was to ensure performance remained consistent; it was well known that performance could degrade as A-SIS-enabled volumes reached capacity. These restrictions have been eased as more powerful hardware has become available. NetApp also supports thin provisioning, a feature that was significantly expanded with the introduction of aggregates to Data Ontap 7.
In 2010, Dell Inc. acquired Ocarina Networks, which had developed a standalone deduplication appliance that could be placed in front of traditional storage to provide inline deduplication functionality. Since the acquisition, Dell has integrated the Ocarina technology into a number of product lines, including the Dell DR4000 for disk-to-disk backup and the Dell DX6000G Object Storage Platform, a storage compression node for object data. Dell has stated its intention to add primary data deduplication functionality to its EqualLogic and Compellent lines of storage arrays, which already support thin provisioning.
EMC Corp. has had data deduplication in its backup products for some time; however, only the VNX platform offers data deduplication in primary storage and it’s limited to file-based deduplication from the part of VNX that came from the now-defunct Celerra hardware. Although EMC has discussed its intention to implement deduplication, no firm announcements or details have emerged.
Oracle Corp. has had the ability to use deduplication in its storage products since 2009, when it acquired the ZFS file system as part of the Sun Microsystems takeover. The Sun ZFS Storage Series 7000 appliances support inline deduplication, compression and thin provisioning. The ability to deduplicate using ZFS is also available to storage vendors using the technology within their storage products. This includes Nexenta Systems Inc., which released deduplication in NexentaStor 3.0 in 2010. GreenBytes Inc. is another startup using SSDs within its storage arrays in combination with ZFS to deliver deduplication functionality.
As previously mentioned, SSDs are a great fit for space-reduction technologies, in particular data deduplication. Product offerings from Nimbus Data Systems Inc., Pure Storage Inc., SolidFire Inc. and Whiptail Technologies Inc. all use dedupe to increase the effective capacity (and so reduce the effective $/GB cost) of their storage arrays. Savings vary by vendor and industry segment -- a space reduction of 5:1 is typical; however, Pure Storage claims savings of up to 10:1 based on real-world customer examples.
Finally, it’s important to note several startup vendors that are producing products specifically for virtual environments. NexGen Storage Inc. and Tintri Inc. have developed hardware platforms specifically optimized for virtual environments. Both support dedupe, which can result in significant savings where virtual machines are cloned from master images.
Not widely available … yet
Thin provisioning, compression and deduplication can result in significant savings in primary data storage utilization. However, as we’ve seen, to date only thin provisioning seems to have been widely adopted by the industry. Of the top vendors, only NetApp stands out as having embraced deduplication as a mainstream array feature. Perhaps there’s a reticence to introduce space-reduction technologies as it reduces the ability to maximize storage sales. But the latest wave of storage startups, with their concentration on using solid-state storage efficiently, are managing to produce products that have attractive ROIs when used in general shared-storage environments and they’ll be looking to displace storage from the big six vendors. DRIPS is here to stay; in some cases, perhaps, as a result of the deployment of new technology.
BIO: Chris Evans is a U.K.-based storage consultant. He maintains The Storage Architect blog.
- Tiered Storage - Optimizing the Storage Infrastructure –Fujifilm Recording Media USA, Inc.
- Illuminating Insight for Unstructured Data at Scale –IBM