News Stay informed about the latest enterprise technology news and product updates.

Data classification: The vendors

Corporate information resources regularly stretch into the terabytes (thousands of gigabytes) with many thousands (even millions) of individual files to contend with -- but the problem is now far more complex than simply "finding files". Finding specific files among this confusing hodgepodge of information is inefficient and frequently incomplete. Companies often fail to recognize the importance of their data and its impact on everyday business operations. The process of "data classification" attempts to fill this void by helping businesses understand what data is actually available, its location in the enterprise, how that data is being accessed and how it must be protected to meet legal and regulatory requirements.

The process of finding data, identifying data types, pinpointing storage locations, and extracting meaningful file information, demands support from powerful software tools. Major vendors in the data classification market include -- in no particular order -- Kazeon Systems Inc., Archivas Inc., AppIQ Inc. (providing storage software to Hewlett-Packard Co.), Network Appliance Inc., Abrevity Inc., EMC Corp., Arkivio Inc., Application Matrix (a service provider), and StoredIQ Corp. Each vendor (see sidebar) focuses its efforts on specific aspects of enterprise data management.

Managing long-term archives

Numerous industries require long-term archival storage. Unlike a typical backup, archival records must be readily available for access, though the data is not accessed frequently enough to justify high-performance storage. ArC software from Archivas works on clusters of low-cost, Intel/Linux servers to enable long-term online archiving of fixed content -- information that simply cannot change over time, but must remain available for access such as medical records (e.g., patient X-rays or MRI data) or corporate e-mail. According to Asim Zaheer, vice president of marketing at Archivas, an Archivas implementation appears to enterprise applications as a single (very large device) accessed through a file system interface.

"You can set policies as well for the content coming in," Zaheer says. "So you can set a retention policy that says 'this content cannot be deleted for ten years.' We also store meta data associated with the content that is coming in so that you can identify content by its meta data, or search for content by its meta data." Zaheer notes that meta data can include file creation dates, file originator and even the deletion date. The metadata, file and associated policies are all stored together as an entity that Zaheer terms an "archive object." In addition, Archivas takes steps to prevent duplicate information. "We also authenticate files as they come in," he says. "As a file is stored, we run a cryptographic hash key against it and give it a unique digital signature so that we ensure there's only one copy of that file on the archive."

Data Classification Vendors



Network Appliance (NetApp)




Application Matrix


Extracting business intelligence

When discussing data classification, the notion of information as a "business asset" recurs time and again. Many companies realize that simply "finding" data has relatively little benefit to the enterprise. Management must be able to process and interpret their information in order to make solid business decisions. Abrevity emphasizes intelligence in its SearchBASE data classification product. "One of the largest needs … is in the area of delivering business intelligence -- classification of data to empower business intelligence," says Bill Reed, consultant at Abrevity.

"Maybe I'm a sales or marketing executive and I'm trying to extract specific information from last quarter related to my sales," Reed says."I've got hundreds of thousands or millions of files and I need to find 'the' ten files that have exactly what I'm looking for in them. And then, I need to extract that information out of them so I can use the data with my analysis tools or create reports from it." SearchBASE software extracts data from files (e.g., names, companies, key words, e-mail addresses and so on), and helps to classify information based on user-defined compliance or business taxonomy, including retention and deletion policies. After classification, information can be migrated to other storage solutions for backup, cost savings (tiering) or performance purposes. Duplicate or unwanted content can also be managed.

Pressing for integrated management

It's impossible to make informed decisions unless management can see the content and it's location in the enterprise. In many cases, customers employ an assortment of unrelated storage resource management and hierarchical storage management tools to help them identify and understand their data assets. Companies like Arkivio are pushing hard to develop tools that help automate key classification tasks such as aiding discovery, grouping, policy assignment and data migration. Each feature is integrated into a single product. "We do have a classification system built into the product," says Buzz Walker, vice president of marketing for Arkivio. "There's one place -- one management console -- you can manage a billion files behind 'one pane of glass.' "

Walker notes that data identification is often the most difficult part of any classification process. "If you don't know what you have, how do you even start making up processes for it? Our product goes out and discovers what data and what storage you have and what's on there and what users/groups are using that data." Arkivio's agentless approach can perform detailed discovery across the enterprise. "Without installing any agents on any computers in the environment, we'll go out and tell them what they have." Once content is understood, the classification process can begin. According to Walker, Arkivio software allows users to create groups, apply the necessary policies to groups and then move data to the best storage solution available in the enterprise. "We don't require them to classify everything," Walker says. "Just create classifications around groups that are important to you." He cites an example of one customer who found a large number of media files on its network during the discovery process (a violation of corporate policy). They created a group for those unwanted files, used the Arkivio software to locate the unwanted files and then moved them off the network into secondary storage. Another policy was defined to delete the unwanted data after 30 days.

Combining hardware and software solutions

Since data classification practices readily support tiered storage, some companies opt to update or revamp their storage systems as part of the classification process. Large companies like EMC are recognizing the value in services, bringing more holistic solutions, including hardware, software and service offerings, to the data classification process. EMC hopes that a closer relationship between hardware and software can help to bridge the gap that has traditionally existed between IT and management. EMC suggests that such an "all-encompassing" approach plays directly into ILM, which embraces all information across an enterprise regardless of format, storage mechanism or creation tools. "It's an ability to drive a single policy management, single access, single security policies and so on," says Whitney Tidmarsh, vice president of solutions marketing for EMC Software. "Our real value is an understanding, a practice and the right set of holistic tools to help companies manage information of all types across their organization."

Tidmarsh explains that EMC's hardware, software and service approach to data classification and management reaches across several important IT layers. First, the company offers hardware products like Symmetrix (for high-availability tasks) and Centera (for long-term archiving tasks), as well as other storage products like Celerra or Clariion -- each targeted to different availability and service level requirements. Next, Tidmarsh points to the storage management layer, citing EMC's Control Center software for storage setup and monitoring. At the application layer, Tidmarsh looks to EMC's Documentum family of software to capture, classify and manage content. EMC's Database Extender is an alternative software product allowing for control of databases and other structured content.

Go to the next part of this article: Data classification: User perspectives

Or skip to the section of interest:

  • Introduction
  • Data classification: An overview
  • Data classification: Strengths and weaknesses
  • Data classification: The vendors
  • Data classification: User perspectives
  • Data classification: Future directions
  • Dig Deeper on Data storage compliance and regulations