Sergey Nivens - Fotolia
Small World Big Data
Published: 08 Mar 2018
Whether you're growing on-premises storage or your cloud storage footprint this year, it's likely you're increasing total storage faster than ever. Where we used to see capacity upgrade requests for proposals in terms of tens of terabytes growth, we now regularly see RFPs for half a petabyte or more. When it comes to storage size, huge is in.
Do we really need that much more data to stay competitive? Yes, probably. Can we afford extremely deep storage repositories? It seems that we can. However, these questions raise a more basic chicken-and-egg question: Are we storing more data because we're making more data or because constantly evolving storage technology lets us?
Data storage economics
Looked at from a pricing perspective, the question becomes what's driving price -- more demand for data storage or more storage supply? I've heard economics professors say they can tell who really understands basic supply and demand price curve lessons when students ask this kind of question and consider a supply-side answer first. People tend to focus on demand-side explanations as the most straightforward way of explaining why prices fluctuate. I guess it's easier to assume supply is a remote constant while envisioning all the possible changes in demand for data storage.
But if storage supply is constant, given our massive data growth, then it should be really expensive. The massive squirreling away of data would instead be constrained by that high storage price (low availability). This was how it was years ago. Remember when traditional IT application environments struggled to fit into limited storage infrastructure that was already stretched thin to meet ever-growing demand?
Today, data capacities are growing large fast, and yet the price of storage keeps dropping (per unit of storage capacity). There's no doubt supply is rising faster than demand for data storage. Technologies that bring tremendous supply-side benefits, such as the inherent efficiencies in shared cloud storage -- and Moore's law and clustered open source file systems like Hadoop Distributed File System and other technologies -- have made bulk storage capacity so affordable that despite massive growth in demand for data storage, the price of storage continues to drop.
Endless data storage
When we think of hot new storage technologies, we tend to focus on primary storage advances such as flash and nonvolatile memory express. All so-called secondary storage comes, well, second. It's true the relative value of a gigabyte of primary storage has greatly increased. Just compare the ROI of buying a whole bunch of dedicated, short-stroked HDDs as we did in the past to investing in a modicum of today's fully deduped, automatically tiered and workload-shared flash.
It's also worth thinking about flash storage in terms of impact on capacity, not just performance. If flash storage can serve a workload in one-tenth the time, it can also serve 10 similar workloads in the same time, providing an effective 10-times capacity boost.
Don't discount the major changes that have happened in secondary storage, however. Offline archives have come online, delivering big data streams on demand and keeping all our aging data accessible and useful. You can use hybrid object stores to version, back up and restore a whole company's file systems. And unlike yesterday's data protection targets, these can actively serve all of our precious files directly to a global audience, using global namespaces with per-object security policy enforcement.
We also see analytics converging into storage. IT architects are trying to capitalize on the advantages gained by converging storage closer into the compute stack. For storage folks, it's also worth looking at convergence trends that bring compute capabilities closer into the storage stack.
Storage approaches are emerging that will support and process compute functions closer to -- and even inside -- where data is stored, rather than shipping data out of storage to some remote processing unit. With growing data sets, we'll see more processing occurring local to storage. This is a fundamental principal of big data processing. Some storage products can host local virtual machines and containerized applications and even process remotely submitted "lambda" functions, as in passing anonymous functions in functional programming.
The point of having processing occur local to storage is, perhaps, more one of performance at first. But by building-in ways to effectively apply analytics at scale, computing storage opens up new avenues for increasingly real-time applications to take advantage of even more data, collected and processed in higher volumes. For example, all the "things" possible in our future internet of things (IoT) will be generating useful data. If we can exploit IoT data where it's first recorded, we'll end up storing vastly more data again.
Do eggs store chickens?
An insider answer to my chicken and egg question is as we learn to wring more value out of our data, we want to both make and store more data. So what comes first is our increasing ability to apply more effective analytics at larger scales and at faster speeds on finer-grained data.
Savvy storage vendors have been looking at not just helping us manage larger data volumes but also at building facilities into storage to help exploit it all. Unstructured data search, big data analytics, online active archives, streaming data services and global namespaces are only a few data-exploiting capabilities in today's advanced storage products.
This means storage experts still have a challenging job ahead. They'll need to ensure that enough of the right kind of storage services are available to meet demand for data storage. In addition, as storage becomes less a passive store and more of an active, converged platform, they'll have to provide and align storage that can fully exploit the data it contains.