Data classification is "like eating an elephant," according to Michael Masterson, IT manager at a Fortune 500 life sciences company that's in the middle of a data classification project. "Don't get discouraged," he says. "You can't do it all at once."
Masterson's office has 60 Windows servers and a handful of Unix machines, plus the latest EMC Clariion CX3 array for primary storage and a Nexsan Technologies system for nearline, noncritical data. He's using EMC Documentum for document management, and has 9TB of unstructured data floating around unmanaged.
About a year ago, Masterson's company decided it needed to better understand the files it was storing before throwing any more disk into its data center. Unfortunately, says Masterson, this information isn't available in the meta data provided by Windows systems.
"People have dumped stuff on me like I'm a landfill, but I'm not in the storage business," he notes. He is, however, responsible for ensuring that the company's scientists can find files months or even years after they've created them--and with a recovery rate of minutes or hours, not days. Drug discovery is a competitive field, and it's heavily regulated by the Federal Drug Administration and Sarbanes-Oxley. "The risk of not managing these files is huge," says Masterson.
Masterson uses what he calls a "folksonomic" approach to data classification. Folksonomy is Internet parlance for tagging Web content on the fly to make it easily discoverable to users of that content. "People will not adapt consistently to one system ... it's human nature to be constantly reorganizing," he says, "and files are no different."
He's been piloting Abrevity's FileData Classifier software for approximately one year and is impressed with its ability to work with legacy files and file systems, and to provide custom file classification and tagging. "It uses tags [that] users have already provided and words within the file system that they already understand," he says.
Aside from email and the usual Microsoft Office files, fluorescence-activated cell-sorting (FACS) files--more commonly called instrument files--make up much of the company's unstructured data. These are text files produced by flow cytometers, instruments used to measure microscopic particles in fluids. As the instruments become smarter, they increasingly crank out data, all of which must be stored and managed. Analysts report that more dollars were spent last year for these types of instruments than for IT storage systems, and an order of magnitude more files were generated by them than by Microsoft Office or email users in most of these life sciences facilities.
Masterson notes that while other data classification tools (he looked at products offered by Arkivio and Kazeon Systems) are designed to extract known values from a single document and don't create indexes for multidocument searching, Abrevity's FileData Classifier can search and parse FACS headers, extract target data, tag files with new meta data for classification and then allow for policy-based management.
"Engineers nest folders within folders, so it's important to be able to search across these without having to open each file, which can take hours or days," he says.
More significantly, FileData Classifier offers context-based discovery rather than text searching using a proprietary database technology the vendor calls SLICEbase, instead of a relational database. This "speed[s] up searches tremendously," claims Masterson. "They've got the right approach [to] preserving context."
Still, Masterson says that showing users how to tag files with a business value is an arduous task. To that end, he built a survey and created interview questions to find out which files are important given business and regulatory requirements. The secret is to keep classifications simple. "We have security and retention tags only. Don't get too complex with it and create slices that people will forget are even there," he advises. He also recommends creating a short list of the most important data--files for a legal discovery case or HR files, for example--rather than trying to tag everything.
So far, Masterson has indexed about one-third of his office's unstructured content. His next step is to turn on policy automation to force the back end to move files to the right location.
"It will [take] a while for us to achieve nirvana," says Masterson. The dream is for users to tag files with the appropriate values when they save them. Ideally, this functionality will be built into the operating system, but for now the Abrevity tool is a good start, he says.