By Steve Duplessie
The last five years have been about IT efficiency -- from an operations standpoint, and also in terms of making our "stuff" more efficient.
We've made storage, networks and servers more efficient by virtualizing them. But now it's time to stop concerning ourselves with making gear more efficient and instead focus on data efficiency.
After all, who cares about gear, other than gear makers? It's all about the data. It is time to further leverage our commodity hardware by systematically adding smart services on top of it so we can more readily focus on extracting as much value from our data as is possible.
Data efficiency is just as it sounds -- making the data we need more efficient to access, use and manage. That lets us drive more value out of it -- which, I would argue, is the entire raison d'âtre for IT.
In the storage world, data deduplication has been a hot efficiency "enabler," along with thin provisioning, snapshots, virtualization, multi-tenancy and data compression. Some of them are new, and some have been around forever, it seems. All of them are important.
But when it comes to making data more efficient, it's important to consider the "why" and not just the "how." For example, most data deduplication solutions are designed for backup, not primary data environments. I'm all for making data backups more efficient, but that only represents a small fraction of the value potential for IT.
We have done a decent job, over the last five or so years, at making the systems that store and manage data far more efficient. We can thin provision (virtualize) physical storage assets so that we get the most use out of them. We virtualize data (from a presentation perspective) with the use of snapshots. With multi-tenancy, we can optimize the utilization of our physical assets across multiple constituents. All that is good, but new technologies exist that will allow us to take this much further.
Data compression comes of age
Data compression has been around for a long time, but this is one technology that is currently enjoying a renaissance period. Primary data compression is going to change the fundamental efficiency and overall value proposition that users derive. That's because you get more value when you create efficiencies closer to the point of data "creation." Think of it this way: If you start with 100 GB of primary data, over time you will back it up x times, so you'll end up with 100 GB of primary data, and 100 GBx of backup, or secondary, data. Backup deduplication players such as EMC Data Domain spend their time on the 100x problem -- and this is a good problem to spend time on. There are probably a lot of other uses/duplicates of the originating data throughout the organization between creation and backup -- like in test/development, data warehouses, etc.
Optimizing data as early as possible is the key. From that point on, all the downstream benefits are magnified. There's less to move, less to manage, less to back up, less to copy, less to replicate, less to store, and less to break. Less is the new more.
I'm not a genius, but it seems to me the most efficient way to do this is to leverage all of the tools at hand. First, compress the data as much as you can. We've proven that you can squish 50% or more out of the primary footprint of almost any kind of data, including databases. Second, deduplicate it. You can dedupe anything left over after you complete your compression work. People don't want all of their data deduped, but that's OK. Eliminate what you don't need, and start clean. That gives you the perfect baseline.
From there, snap it, thin provision it, copy it (virtually) -- and do whatever else you would like to do with it. At least you begin the journey with an optimized footprint, which makes everything else you do with that data far more efficient.
The trick is to compress in real time, without suffering the performance penalties that you remember from 20 years ago. It can be done today. We do some amazing things at wire speed that weren't possible even a few years ago. There will be a lot of money and R&D spent in this area, I predict, as it seems clear to me that it simply has to happen.
Moving the value needle of data storage optimization from the "dead" end of the wire to the "live" end (where data is created) is inevitably the best way to drive value all the way through the entire lifecycle of that data. Everywhere that data sits, every time that data is manipulated or used, there is value that can be optimized.