Are there any tools that you can recommend that will allow a company to, first, assess the kinds of data that are...
being stored on disk so that the older files can be deleted or migrated to secondary storage, and, second, to establish policies to keep data storage optimized once it has been groomed? First, you have nailed an important issue: the need to understand the kinds of data that are actually being stored. Most companies have no idea what data is actually filling up the hard disk storage in their shops. They haven't needed to be too concerned about it, given the low cost to deploy more storage whenever it was needed.
Today, high rates of data growth, plus the inherent unmanageability of ad hoc storage acquisitions and the high administrative and labor costs inherent in them, are forcing companies to reconsider their storage acquisition strategies. Companies want to plan storage acquisitions strategically in order to ensure that storage can scale as needed and in a non-disruptive way, and that so additional storage management personnel are not needed with every few gigabytes that are added.
The problem is, how do you predict capacity requirements if you don't know what's stored on the disk you have? Answer: you can't.
You seem to have reached this conclusion already. You are getting ready to assess your current storage to see how much data is replicated unnecessarily and how much may be stale -- unused for a sufficiently lengthy period of time -- so that you can be assured that backing it off to tape (or deleting it entirely) will not cause too much disruption.
What tools can you use to help? I am aware of none. UniTree or some other hierarchical management program may provide functionality that will scan current files and report on duplicates and files with older creation dates, but these are only imperfect indications. A 10 GB email file for example may contain 9.9 GB of old mail that could be deleted. However, the date on the email file may be today's date, owing to the last date that the file was changed (the last date that an email was actually received and stored).
To make a long story short, your initial assessment will be laborious, time consuming and labor intensive. There are no shortcuts. You may even need to get very up close and personal with -- dare I say it? -- users. They may be the only ones who can tell you which files can go, which can stay, and they may be the only ones who can explain why they have backed up all of their files -- including software and operating system -- into that network-based directory you intended for use as an on-line backup of critical data. (It's all critical, isn't it?)
Once you have assessed the data and optimized disk storage to better monitor usage trends, you'll probably need to set policies -- the human kind, not the software kind -- to ensure stored data doesn't slip back into the unkempt condition from which you have just labored to rescue it.
One organization I know had an effective solution: they told the heads of each business unit/workgroup/department that they would be charged money on a monthly basis for all data stored on disk by their business unit with a creation date older than one month. There would be no charge for migrating older data to tape. Funny how that sort of thing made all of the business unit managers start laying down the law to users regarding the type of data to store and how long to store it.
Hierarchical storage management would be the ultimate solution for many organizations. However, HSM has not enjoyed very much success in the ruggedly individualistic world of distributed computing. Perhaps now, driven by the economic slowdown, the need to manage storage technology expenses and the desire to make fewer tech support personnel more efficient through better storage management -- a workable HSM solution will appear.
I have looked at several popular ones and they don't strike me as ready for prime time, particularly in a heterogeneous server and storage environment. I also have not seen an HSM product that can do an effective job when the targets (storage devices themselves) are a mixture of SAN, NAS, Server-Attached and internal disk.
Roll up your sleeves and get to work!
Dig Deeper on Data management tools
Related Q&A from Jon Toigo
Parallel computing technology has not seen widespread use in the business world, but could that change? Jon Toigo discusses parallel I/O for ...continue reading
Software-defined storage architecture can be implemented in several different forms that all expose software functionality to hardware across an ...continue reading
Flash wear is an important concern in VMware and Hyper-V environments because features such as caching and deduplication can negatively impact ...continue reading
Have a question for an expert?
Please add a title for your question
Get answers from a TechTarget expert on whatever's puzzling you.