The latest wave of data classification products is focused on helping organizations with specific initiatives such...
as e-discovery and storage tiering .
The information lifecycle management (ILM) buzz of the last few years spawned a rash of data classification products that aimed to locate and identify files and documents, categorize them with greater precision based on policies and business value, and, in some cases, search or index the information and assist in migrating lower priority data to less expensive storage.
But as the initial noise died down, some of those vendors dissolved or were acquired, and many of those left standing recognized a need to focus their attention on the markets where they apply their technology.
"What we discovered over time is that customers need to be able to take some action on the data, not just find it," said Karthik Kannan, vice president of marketing and business development at Kazeon Systems Inc. "Nobody wants to do data classification just for the sake of it. It has to be coupled with a strong business reason."
Kazeon Systems invested engineering resources to build applications on top of its core enterprise search and indexing technology and concentrated its messaging on e-discovery, as did vendors such as Autonomy Corp., Guidance Software Inc. and StoredIQ Inc., to name a few.
"That's where the market demand is right now. E-discovery is usually what is bringing us into accounts," said Ursula Talley, vice president of marketing at StoredIQ. Talley added that once some customers complete their litigation and records management work, they realize the product can also help to classify data for storage optimization purposes.
Getting deeper into data classification
Many of the e-discovery-focused products offer a deeper level of data classification than the more traditional meta data attributes of file owner, date created or modified, file type, file size or application.
Some of the sophisticated search and indexing engines targeting e-discovery for litigation or compliance purposes look within a file or document for particular words, phrases, concepts, context and relationships between ideas. They can classify content based on multiple parameters, including business value or even the risk a document poses to the organization.
"Data classification really has two different viewpoints," said Greg Schulz, founder and analyst at StorageIO Group in Stillwater, Minn. "One is simply discovering what data you have: 'Hey, I've got a thousand files, and half are PowerPoint and half are Word documents.' That's basic data classification. That's what some would call storage resource management."
The e-discovery-focused vendors that sprung up over the last few years tried to convince users that basic classification was inadequate and organizations needed to look into the documents for recurring words/themes and context, Schulz said.
"The hype has subsided," he said. "Some of those companies aren't around. Others have become more specialized, and they're finding where their actual opportunities are."
Data storage considerations
Data classification is defined by workloads/performance and business relevance/value when improving storage utilization is the main objective, noted Noemi Greyzdorf, a research manager in storage software at Framingham, Mass.-based IDC. Once users figure that out, they can make decisions about the appropriate storage platform to achieve greater cost efficiencies and align business needs.
"In order to really realize and get the benefit of data and storage classification, you have to start with a business process," said Greyzdorf. "And it has to start from conversations with the business units and understanding the needs and requirements of the business. Only at the end, once you actually have everything in place, should you be looking at technology because then you'll have a better set of requirements for that technology."
After the IT and business stakeholders have those discussions and draw up categories and criteria, they'll find a wide range of products that offer at least some form of basic data classification capability. The list includes databases, backup and archiving tools, storage resource management software, data-loss prevention products, desktop data management suites and even storage arrays.
One of the most elementary examples is email filtering software. Users set up the criteria for what defines spam and then the product takes action by delivering, holding back or deleting the messages.
"Data classification was never a product or really a market," said Brian Babineau, a senior analyst at Enterprise Strategy Group in Milford, Mass. "It was a capability that allowed people to take actions, and most of the actions were the market. Archiving was a market; electronic discovery was a process that led to a market; tiered storage, multiple markets in that. Data-loss prevention is an action enabled by classification capabilities."
The storage vendors that offer some level of data classification in their arrays to help users manage the storage more efficiently or data retention schedules tend to offer the deepest capabilities with their own products, said Christine Taylor, an analyst at Taneja Group in Hopkinton, Mass.
"The tier ones," she said, "aren't that interested in being able to classify someone else's data."
The challenge arises when trying to automate the process of migrating or moving data from one storage device to another, because even small companies might have more than one location for storage.
"It's very, very difficult to run a program that goes out to all these storage arrays, locates and classifies data and moves it, say, to a big, fat central secondary archive," Taylor said. "This would be wonderful, but it's very difficult to do."
Independent vendors that can classify and move data on different vendors' storage arrays are an alternative, Taylor said, adding that the list includes Abrevity Inc., Digital Reef Inc., Kazeon Systems, Mimosa Systems Inc., Renew Data Corp. and StoredIQ.
"Data classification is all well and good, and it's a fundamental technology for managing storage and finding data. But in the past," Taylor said, "it hasn't been enormously useful because there wasn't much of a way to act on its information or its results.
"Nowadays," she continued, "automation is getting much more common. And that's really what makes data classification so useful -- the ability to automate the action that follows from classifying the data."
CVR Energy turns to Autonomy for data classification
CVR Energy Inc. in Sugar Land, Texas, sought out Autonomy's Intelligent Data Operating Layer (IDOL) technology for document retention to categorize and index documents regardless of age. IT staff and business stakeholders established the business categories, such as financial and environmental, in Autonomy's policy engine.
The categories and other criteria that the CVR Energy team set up teach the Autonomy software how to classify data based on concepts, through a semantic and mathematical analysis. Some 400 connectors enable the system to find the data in just about any application or system an organization might use.
"Basically, the classification engine works by looking at a document, reading its content or location, and determining what it believes the document to be -- and that can be as many categories as you want," said Michael Brooks, vice president and CIO at CVR Energy. "Autonomy built a single index of all of our unstructured data."
In the future, CVR hopes to enforce retention policies on the automatically classified documents. Disposing of data that's no longer needed or required will help to lessen the costs of analysis and review for e-discovery purposes, mitigate legal risk and reduce storage needs.
"This is the only way you can handle the volume of data in the next five years. You can't manually classify it," Brooks said. "If you have 20 million documents, exactly how are you ever going to catch up? You won't."
RiskMetrics' traditional approach to data classification
But an e-discovery initiative is no small undertaking, and not every organization has a pressing need or the wherewithal to take on a project or acquire the tools. Some companies stick with the more traditional data classification approaches, as they simply try to manage their storage more efficiently and cost effectively.
New York City-based RiskMetrics Group Inc., for instance, has EMC Corp. Clariion 15K rpm Fibre Channel (FC) drives in a RAID 10 configuration for its tier 1 storage. Tier 2 is the same, except the drives are 10K rpm. Tier 3 is 10K rpm in a RAID 5 configuration, while tier 4 has 7,200 rpm SATA drives in a RAID 5 configuration. Tier 1 backup is disk, and tier 4 backup is tape.
The decision about which data goes on what tier is a discussion point for IT and the appropriate business groups. The storage team fields the requests and requirements, sets up the applications in a non-production user acceptance testing environment on tier 3 and studies the performance numbers. The business clients can log in and test out their applications themselves.
Ed Delgado, storage architect at the financial services firm, said performance numbers generally determine the tier, although the specific product or application is also taken into account. Tier 1 is generally reserved for the databases that are hit the most often. Microsoft Corp.'s Exchange Server is tier 2, while a SQL Server database instance might be tier 3, and a file server goes on tier 4, Delgado said.
One tool that RiskMetrics uses to help with the data classification is Tek-Tools Inc.'s Profiler. With it, the storage team is able to classify file systems by business group and observe the patterns associated with each group.
"I get some base classification from the Profiler," Delgado said.
The company's EMC Clariion array also has tools to help visualize where the data resides. Delgado said he uses two reports: one showing the RAID groups, color-coded; and a second one providing information about what logical unit numbers (LUNs) are in each RAID group. The array's "migrate LUN" feature helps to shift data from one tier to another with no downtime.
"You don't have to touch the server. You don't have to notify a DBA. It lets you do data classification yourself as the admin," said Delgado. "To me, that's huge."