There is no shortage of new technologies to make the storage of data more efficient, but the vast majority of these storage advances have focused on backup and archive, not primary storage. However, with companies starting to perform data reduction on primary storage, it's important for them to understand the requirements that primary storage optimization demands.
Primary storage, often referred to as Tier 1 storage, is characterized as storage for active data -- data that is frequently accessed and that requires high performance, low latency and high availability. Primary storage is typically used to host mission-critical applications, such as databases, email and transaction processing. Most key applications have random data access patterns and varying access requirements, but all of them generate vast amounts of data that organizations use to run their business. As a result, organizations make many copies of their data, replicate the data for distributed use, warehouse the data, and then back up and archive the data for safekeeping.
The overwhelming majority of data originates as primary data. As the data ages, it is typically moved to secondary and tertiary storage. Therefore, if an organization can reduce its primary storage footprint, it would be able to leverage the capacity and cost savings throughout the data lifecycle. In other words, a smaller primary storage footprint translates into less data to replicate, warehouse, archive and back up.
Compression and data deduplication
Storage administrators trying to reduce the footprint of their primary storage are likely considering two data reduction methods: real-time compression and data deduplication.
Until recently, data compression was not widely adopted in primary storage applications because of performance concerns. However, vendors such as IBM with its Real-time Compression Appliances (based on technology acquired with Storwize in July 2010), today offer solutions providing up to a 15:1 footprint reduction using real-time, random access compression/decompression technology. Higher compression ratios and real-time performance are making compression solutions a real consideration for primary storage data reduction.
Data deduplication technology, so popular in backup applications, is also being applied to primary storage. The challenge with deduplication is that the process thus far in deduplication is an offline process. This is because trying to identify redundant blocks of data across potentially millions of files is time-consuming and highly storage-processor intensive, and the performance of very active data could be affected. This means that very active data is not processed until it is aged to some extent. Vendors in this space include NetApp, Data Domain and Dell (based on its July 2010 acquisition of Ocarina Networks).
There are six requirements for deploying a primary storage optimization solution.
Requirement No. 1: Zero impact on performance
Unlike backup or archive storage, performance of the active data set is more critical than the capacity that might be saved by some form of data reduction. As a result, the data reduction technology chosen must cause no impact on performance. It has to just work and do so simply; it must be the equivalent of "flip a switch and you consume less storage."
Active storage deduplication solutions today deduplicating active storage only when the data to be deduplicated has achieved a state of inactivity. In other words, this means essentially performing deduplication only on files that are no longer being accessed, but are still on the active storage pool -- a near active tier of storage.
This deduplication technology avoids becoming a performance bottleneck by recommending against deduplicating anything more than light I/O workloads. As result, critical components of the IT infrastructure are not optimized on storage. At the top of this list of critical components are databases. Since these are extremely active components of Tier 1 storage and are almost always classified as more than a light workload, the deduplication process never analyzes them. Hence, the space that they occupy on primary storage is not optimized.
On the other hand, real-time compression systems compress all data that flows through the appliance in real-time. This leads to a surprising benefit beyond mere savings in storage capacity: an increase in storage performance. When everything is compressed,the amount of data delivered on each I/O request is effectively increase, the disk cache space is increased and each write and read is more efficient.
The net effect is a reduction in disk capacity, with a measurable increase in overall storage performance.
The second benefit of primary storage deduplication is that all data is reduced, which amortizes the capacity savings across all data, including databases. While real-time data compression of Oracle environments may create some performance concerns, testing thus far has revealed a performance increase.
Another area of concern is the performance impact on the storage controller itself. Today's storage controller is asked to do many things beyond just serving up disk, including managing different protocols, performing replication and managing snapshots. Adding another function to the stack may be more than the controller can bear -- even if it can handle the additional workload, it is one more process that the storage administrator has to realize can be a potential I/O bottleneck. Offloading the compression to an external appliance removes a variable from performance concerns and does not impact the storage controller at all.
Requirement No. 2: High availability
Many data reduction solutions that focus on secondary storage are not highly available. This is because they need to instantly recover backup or archive data is not as acute as it is in Tier 1 storage. If a backup system goes down, the primary most likely still exists. However, even in the secondary tier that concept is fading, and high availability is being added as an option to many secondary storage systems.
But high availability is not optional in primary storage. The ability to read data out of its data reduced format (deduplicated or compressed) must exist. In a data deduplication solution (where the deduplication is integrated into the storage array), redundancy follows the storage array, which is almost always highly available.
In aftermarket deduplication systems, a component of the solution delivers the un-deduplicated data to the client in its original format. This component is called the reader. This reader also needs to be highly available, and be highly available seamlessly. Some solutions have the ability to load the reader on a standard server should an outage occur. These types of solution is often used on near-active or even more appropriate archive data; they are not well-suited for the very active data set.
Most inline compression systems are inserted inline and on the network, placed (logically enough) between the switch and the storage. As a result, they can achieve redundancy due to the high availability that is almost always designed in at the network infrastructure level. Inserting the inline appliances along each of these paths achieves a seamless failover that requires no extra effort on the part of the IT administrator; it leverages the work already done on the network.
Requirement No. 3: Space savings
There has to be appreciable capacity savings as a result of implementing one of these solutions. If capacity-reduced primary storage results in sub-standard user performance, it has no value.
The more redundant the data patterns, the higher the space savings with deduplication. The more random the data patterns, the higher the space savings with compression.
Primary data does not have the highly redundant storage patterns usually found in backup data. This has direct impact on the overall capacity savings. Again, there are two approaches to primary storage data reduction: data deduplication and compression.
Data deduplication looks for redundancy among near active files and what level of data reduction can be achieved depends on the environment. In environments with high levels of redundancy, there could be a significant ROI, while others might see only a 10% to 20% reduction.
Compression works on all the available data and while it may achieve lower capacity savings for highly redundant data. it provides a consistently higher savings for the more random data patterns typical of primary storage applications.
In essence, the more redundant the data patterns, the higher the space savings with deduplication. The more random the data patterns, the higher the space savings with compression.
Requirement No. 4: Application-agnostic
Real benefits can be derived from data reduction across all data types, regardless of the application that generates that data or how active that data is. While the actual reduction rate will vary based on the level of deduplicated data or the compressibility of that data, all data must qualify.
When it comes to archive and backup, there is definite value in application-specific data reduction, and there is time available to customize the reduction process for that data set. But for the active data set, application specificity would cause performance bottlenecks and would not deliver appreciable gains in capacity reduction.
Requirement No. 5: Storage-agnostic
In a mixed vendor IT infrastructure, not only does the ability to use the same data reduction tool across all platforms further increase the ROI benefit of data reduction, it also simplifies the implementation and administration. Having a different data reduction approach for each storage platform will require significant training, and cause confusion at the administration level.
Requirement No. 6: Complement backup optimization solutions
With all this work being done to optimize Tier 1 storage, when the time comes to back that storage up, it would be ideal to leave this in its optimized (compressed or deduped) format. If the data has to be expanded back to its native format prior to being backed up, this would be a waste of resources.
To expand the data set for backup will require:
- that the resources of the storage processor or external reader be used to inflate the data;
- that the network resources be expanded in order to send that data to the backup target; and
- that extra resources be assigned to the backup storage device to store that data.
Even if the backup storage device also performs data reduction such as data deduplication, sending the data to this device in an optimized format makes the deduplication system more efficient.
It's important that any primary storage optimization be complementary with backup optimization solutions.
About the author: George Crump is founder of Storage Switzerland, an analyst firm focused on the virtualization and storage marketplaces. Storage Switzerland provides strategic consulting and analysis to storage users, suppliers and integrators. An industry veteran of more than 25 years, Crump has held engineering and executive management positions at various IT industry manufacturers and integrators. Prior to Storage Switzerland, he was chief technology officer at one of the nation's largest integrators.