Scanrail - Fotolia
Vendors seem to be all over the place when it comes to explaining the value of data. It's an old, important and simple question that few organizations have answered satisfactorily. Now, as data begins to amass at a multi-zettabyte per year rate, our failure to adequately address this issue will almost certainly come back to haunt us.
How do we value data so we can create more effective strategies for its management? Here are some data management best practices to keep in mind.
Mapping the DNA of your data
Data is really just a bunch of anonymous ones and zeroes. It has no inherent value whatsoever if you don't know the application and business process associated with it. Data inherits its importance, like DNA, from the process it serves.
This important truism should shape a lot of the data management best practices we use to host, protect and preserve data throughout its useful life and how it's deleted. Unfortunately, we often don't know what we need to know in order to be effective data stewards.
Why? It's simple. Because we usually don't bother to find out.
- Think about it: Your hypervisor vendor probably advises that you create and deploy active-active or active-passive clusters, the latter with data mirroring, to ensure that your data is always available. While this kind of thinking is flawed at its foundation (clustered or mirrored nodes situated close to each other or instantaneously replicating the same bits can both be destroyed by the same logical or physical event), it also ignores that most of us lack the budget to double up on every server or storage node in our data centers.
- More to the point: We shouldn't even try to do this, as only some of our applications actually need to be highly available or always on. Only by understanding our data can we apply the most appropriate protection strategy -- the right protection service use case at the right cost -- to it.
- Another point: We do not manage data more correctly or appropriately by compressing or deduplicating it, regardless of what vendors say. Those technologies are about storage capacity management, not data management. From a data management perspective, compression and dedupe are only tactical measures designed to slow the rate at which the junk drawer is filling and, perhaps, defer the expense for buying more capacity.
These technologies do nothing for sorting out the data junk drawer.
Organizing the data junk drawer
Sorting our data requires good data hygiene (purging duplicates and dreck) and real data archiving (moving infrequently accessed or modified data that must nonetheless be retained) onto less expensive and more durable media -- with all data movement performed under the auspices of sound data management best practices. To do a good job of data management policymaking, you should understand not only what business process the data supports, and what the legal and regulatory mandates for managing that data require, but also how that data is used.
This involves more than consulting just date last accessed/date last modified metadata on every file or object stored. That only provides you with an idea of file activity, not the whole story.
Databases and the shelf life of data
Take the case of databases: Can you get by with a summary and checkpoint reference to data or do you need all of the gory details of every transaction?
To maintain an in-memory database (IMDB) you must routinely de-stage, summarize and checkpoint older transactions and then restage the condensed data tables. This must be done to ensure that the data doesn't outgrow the DRAM used to store the IMDB itself.
So summarizing and checkpointing older data may be a good strategy for database size containment, but is it the right data management policy? Can you delete older transactions?
This also goes to the issue of the useful life of data.
It is said that the majority of data amassing today has a useful life of four to seven minutes. Analysts making this observation are usually referring to mobile commerce transactions or Internet of Things inputs. That is, data that drives analytical processes and forecasts.
It begs the question, however, of what is to be done with the data after it has enjoyed its few minutes of fame? Do we discard it?
Maybe, maybe not.
Vendors seem to be suggesting that we should quiesce storage nodes once they are crammed full of original files, objects or transactions that are no longer referenced. Then, spin the drives down and shelter the data in place (archive) so that we do not need to cope with the friction in our servers or networks created by actually copying data anywhere for safekeeping. Such a concept sells a lot of nodal gear, but I am not certain it provides adequate protection for the data or in any way resolves the question of useful life or good data management best practices.
The data management dilemma
All of the above points to a quandary in data management. We don't understand the data we have because we are hosting, staging and servicing the data as we are told to by hypervisor, hardware and cloud storage vendors who are really only interested in selling more nodal hardware, software licenses or capacity. We also lack policies for data lifecycle management because we don't understand the useful life of our data or the difference between capacity management and data management.
Unfortunately, there are no magic bullets (no vendor with a promising technology) to resolve this issue. Ultimately, the burden of data management best practices -- as well as data protection, security and preservation -- rests with IT.
Attack this problem by first determining what data serves which business processes. Only then can you begin to apply even the most rudimentary classification and criticality metrics to help guide your decisions about data hosting, archiving, protection, security and deletion.
About the author:
Jon William Toigo is a 30-year IT veteran, CEO and managing principal of Toigo Partners International, and chairman of the Data Management Institute.
Proprietary systems vex data storage management
How to manage data through object storage
How to monitor storage capacity through RAID