In late 2011, the Data Management Institute looked at survey data from more 3,000 large, medium and small firms...
and determined that on average, companies were wasting up to 70% of their disk storage capacity by storing data that did not need to be retained on expensive disk infrastructure. Approximately 40% of the data was inert, based on low rates of re-reference and even lower rates of modification, and likely suitable for archiving. Another 30% of disk capacity hosted orphan data, contraband data, or duplicates and dreck that, with a bit of data hygiene, could be eliminated from storage altogether.
Why, then, is there such little guidance for creating a storage reclamation strategy -- steps for eliminating data that doesn't need to be there from disk so that the enormous space consumed by that data could be returned to productive use? Part of the explanation may be that vendors prefer consumers to adhere to what I call the Doritos model (remember that old tag line, "Crunch all you want. We'll make more"?)
However, the simple truth is that data classification might be outside of these firms' authority due to corporate politics; maybe they can't specify how much space the high-dollar sales guy can use or how long he can park files. Or it may be something they believe is beyond their skill set. Or it may be something they simply lack resources -- staff, hardware, time or budget -- to do.
At any rate, to address the problem of wasted disk storage capacity, you need to have at least one of the following:
- A strategy for identifying data assets that have low re-reference rates (and therefore could be safely or nondisruptively moved from expensive storage to less expensive, higher-capacity storage).
- A strategy for migrating older data assets to capacity storage (or, in the case of duplicates and dreck, off the storage infrastructure entirely).
I think the choice comes down to doing a granular analysis of data assets (the first strategy, one I think is much more effective) or alternatively, using simple metadata to push older, less frequently referenced data to less expensive storage media.
Given the prohibitions in many companies regarding the deletion of any data, the second strategy might be the more advantageous to pursue. To migrate less frequently accessed data assets out of production storage and on to archival disk storage capacity or to tape storage, thereby preserving expensive production disk capacity for use by new and active data, one approach is to implement simple hierarchical storage management (HSM). Usually provided as a software function, numerous vendors offer HSM, either as part of larger storage-management software suites or as standalone utility software. IBM Tivoli Storage Manager and EverStor's Hiarc HSM are two examples of the suite component approach, while FileStor-HSM from Crossroads Systems is an example of an excellent utility. Using hardware-agnostic software -- as opposed to the on-hardware, value-added HSM features delivered with some arrays -- is preferred, in order to avoid the expensive lock-ins that limit archive platform choice.
Most software products enable you to set policies for when data should be moved and to where it should be moved. These policies are typically triggered by metadata changes. If the metadata fields DATE LAST ACCESSED and/or DATE LAST MODIFIED exceed a set limit (i.e., 30, 60 or 90 days hence), the file associated with the metadata is automatically moved to its destination.
With the arrival of the Linear Tape File System and partitioned tape media (IBM and Oracle tape and LTO version 5 or higher), another alternative is to write files both to disk and to tape using LTFS. Then, when the data re-reference rate drops below a set limit, simply delete the copy on disk and let the files continue to live out their archival life on LTFS tape.
Strategies like this focus on the data that is consuming capacity and provide the means to alleviate primary storage congestion without deleting data. Hierarchical storage management is generally superior to tactical capacity management technologies, such as compression and deduplication, which are sometimes used to "squeeze" more storage into the same amount of disk turf. The reasons HSM is superior are that it doesn't materially alter data (a legal issue with respect to some kinds of data), works with all data (dedupe ignores encrypted or already compressed data), and does not place data at risk of loss due to problems with compression or dedupe software.
Keeping your production storage clear of inert and contraband data can also breathe new life into data protection processes ranging from mirroring and replication to backup, since only production data will be exposed to those data protection services. Data that has been moved to capacity storage and that doesn't frequently change can usually be replicated for protection less often, and that replication doesn't impact production workload.