Whether clay pots, wooden barrels or storage arrays, vendors have always touted how much their wares can reliably store. And invariably, the bigger the vessel, the more impressive and costly it is, both to acquire and manage. The preoccupation with size as a measure of success implies that we should judge and compare offerings on sheer volume. But today, the relationship between physical storage media capacity and the effective value of the data "services" it delivers has become much more virtual and cloudy. No longer does a megabyte of effective storage mean a megabyte of real storage.
Most array vendors now incorporate capacity-optimizing features such as thin provisioning, compression and data deduplication. But now it looks like those vendors might just be selling you megabytes of data that aren't really there. I agree that it's the effective storage and resulting cost efficiency that counts, not what goes on under the hood or whether the actual on-media bits are virtual, compacted or shared. The type of engine and the gallons in the tank are interesting, but it's the speed and distance you can go that matter.
Duped by dedupe?
Corporate data that includes such varied things as customer behavior logs, virtual machine images and corporate email that's been globally deduped and compressed might deflate to a twentieth or less of its former glory. So when a newfangled flash array only has 10 TB of actual solid-state drives, but based on an expected minimum dedupe ratio is sold as a much larger effective 100+ TB, are we still impressed with the bigger number? We know our raw data is inherently "inflated" with too many copies and too little sharing. It should have always been stored "more" optimally.
But can we believe that bigger number? What's hard to know, although perhaps it's what we should be focusing on, is the reduction ratio we'll get with our particular data set, as deflation depends highly on both the dedupe algorithm and the content.
An exabyte by any other name
We all know data is growing, as is the amount of storage we have to deploy and manage. Structured databases are growing to terabytes, less structured bigger data to petabytes, and multi-tenant clouds are aggregating to exabytes.
But I feel that in this era of big data, raw capacity just isn't that much of an interesting number anymore. Of course, there's going to be more data and, therefore, more data storage. We're making and keeping data at a pace that's economically balanced by how much it costs vs. the value of doing it. Storage capacities per dollar are inevitably increasing. As storage capacity gets cheaper and big data analytics show how to extract business value out of massive amounts of data, we'll keep even more data around. So storage capacities will keep getting bigger.
Storing it all, once
High-capacity storage devices like HGST's 6 TB helium drives are available today, with holographic optical storage coming. Denser flash and more advanced types of non-volatile memory are also on the way. Combined with better dedupe and compression by excess CPU bandwidth in modern arrays, this will lead to some massive leaps in the amount of terabytes under management.
Frontline storage is getting deduped these days, and often compressed. Vendors with existing storage platforms like EMC Isilon are adding post-processing dedupe that squishes storage offline so it doesn't put a drag on performance. Some newer architecture vendors, however, are leveraging innovative flash designs to successfully dedupe inline, like SimpliVity with its high-performance ASIC.
One of the great things about inline dedupe is that it can speed performance while shrinking capacity. By eliminating back-end media I/O for duplicate blocks, downstream client reads can get a faster total response. And if replication is built on top of dedupe, then only new blocks need to be replicated. We expect data will be deduped on the storage side once, and be kept in that format throughout its lifecycle in storage -- through archive, metadata-level operations (e.g., VAAI), backup and restores.
It's what you do with it
As the trend toward deflating data in storage continues, we expect external apps to get in on the action. Oracle's Hybrid Columnar Compression for its structured database data is an example. In Oracle ZFS, for example, database data blocks are compressed incrementally and in such a way that as data becomes more static it becomes faster for the client to query them. The compressed blocks aren't only archived and backed up in compressed form, but read back into database memory in that form when accessed -- less I/O overhead and columnar/analytical format acceleration. RainStor does something similar for big data processing of structured data, with query performance rising as storage space is optimized.
Tarmin GridBank is a scalable storage grid that globally dedupes files upon ingestion and parses file content for desired metadata that can then be globally filtered and searched. Since the storage system automatically indexes its content for immediate use by storage clients, it's delivering higher level services that would otherwise have to be built on at greater cost or with less capability.
These kinds of application-aware storage capabilities are going beyond simply storing more bits; they're delivering tangible value to storage clients. It's becoming clear that if you want to take competitive advantage of greater data volumes, increasing storage capacities is critical; but it's only part of the solution. IT storage organizations will have to evolve from "just" reliably persisting bits on media to offering sophisticated data services at a higher, business-focused level.
Apples to oranges
There have always been data protection factors that impact how much physical storage a megabyte of data actually requires. We've long specified classes of storage in our catalogs based on well-known RAID and replication schemes that consume differing amounts of physical storage. But now there are complicating alternatives in the form of erasure and fountain coding schemes, flash-specific data protection approaches, advances in automated tiered storage, and tape solutions that appear more like slow disk than offline media.
Today, the key measure of "raw" capacity as an indicator of big storage is no longer reliable. It's no longer a question of how much you're able to store, but what value you can get out of the data you do keep. It's a difficult transition, but I'd like to see more vendor metrics and licensing schemes that focus on the value of data services provided instead of the size of raw storage. I expect that within just a couple of years, simply measuring storage by the byte will become relatively ineffective, while posting big numbers focused on the business value added by aligned data services will become critically important.
About the author:
Mike Matchett is a senior analyst and consultant at Taneja Group.