Apps to classify and find data

Data classification may seem like an arcane art, but a growing set of information classification and management products make sorting through your company's data easier than ever. These tools provide the foundation for litigation discovery, cost reduction, record management and retention, archiving, deduplication and usage control.

This article can also be found in the Premium Editorial Download: Storage magazine: Salary survey reveals storage skills are in demand:

Data classification tools provide the foundation for litigation discovery, cost reduction, record management, retention, archiving, deduplication and usage control.


The problem of controlling and managing the ever-growing volume of data living on a company's file shares is unrelenting. Companies need to find ways to reduce the overall cost of storing data, but must often keep it for longer periods of time to adhere to new regulations and requirements. Adding to the problem are more stringent privacy laws and court-ordered mandates to quickly find and produce specific files adrift in a vast sea of unstructured data.

Data classification product sampler
Click here for a Data classification product sampler (PDF).

New data classification tools, which work below the file-system attribute level, crack open files and extract their content to allow complex searches, reporting, lifecycle management and legal retention based on policies. All of the products described on the following pages provide data classification and reporting, although their functionality differs widely (see "Data classification product sampler," PDF file). Most are relatively new products that are still in early release cycles with expansive roadmaps.

Unstructured data
Unlike information stored in well-defined application databases, or in semistructured e-mail servers and document management systems, the file shares in most companies are a dumping ground for more than 400 types of file formats. As corporations control the data in ERP apps and e-mail servers, users are increasingly using Microsoft Office apps to store their personal productivity files, which are often critical to the business' day-to-day operation. This leads companies to provide higher levels of service for this data at an ever-increasing cost. Unfortunately, this data may be everything from pictures of the grandkids to highly confidential customer documents containing private information.

The problem is that this data is stored in a user-defined fashion that's rarely controlled, searchable or organized in any meaningful way. A company must therefore find a way to separate mission-critical data from data that requires less costly service levels without adversely affecting productivity. Equally critical is the need to identify older data, duplicate copies of data and orphaned data that's no longer needed and can be deleted. Finally, when the need to find, protect or destroy specific pieces of information arises because of litigation or new regulations, how do you quickly respond to these demands? The short answer is to classify data, ideally with automated, user-friendly tools.

The primary purpose of information classification and management (ICM) tools is to provide intelligence about the files residing in file shares or share drives. These files may reside on individual Unix/Linux or Windows servers connected to SAN or DAS, in NAS filers or serviced by NAS blades inside a SAN chassis. Historically, management tools for file systems focused on the file attributes from the file systems themselves. ICM tools discover the file attributes of a file system, but their power and functionally comes from their ability to actually read the contents of a file and search for specific patterns (like Social Security or credit card numbers) or, in some cases, to create a complete index of all text in the document (including numeric information).

They create a repository of meta data by "crawling" a file system or reading a data stream and capturing the file attributes and/or the content of each file. They don't actually store the file in the repository, but instead store the data as individual attributes or entire full-text indexes. The initial process for all but two of the tools described in this article is fairly slow (they may have to read millions of documents); however, after the initial crawl, they all have the ability to do incremental crawls on a periodic basis, which run faster with less impact on the file system. The repositories are then searched to produce reports or take actions on the file. In general, the size of a repository will run from 3% to 15% of the total amount of storage being classified, depending on the type of files and the amount of data kept in the repository (file attributes, specific patterns or full text).

Refining reports
While all the products described in this article provide canned reports, data classification exercises often produce reports that have thousands, or even millions, of lines that may need further refinement and data manipulation to be useful for your data management tasks. To better understand and analyze a report's trends, you'll probably need to consolidate, filter and manipulate the report's data.

In most cases, using a relational database is the best method for analyzing and reporting the data. Even if your information and classification management (ICM) product creates reports, the size of the data set or limitations of the report generator will often require summarizing or filtering the data for decision making. Most of the tools don't allow much flexibility in output (graphs, tables, etc.). Spreadsheets can be appropriate for small datasets or for creating specific reports, but databases are generally the best repositories because of their ability to filter and sort data based on specific queries and extracts. All of the ICM products described in this article can output their data in either comma-separated values (CSV) or XML formats, which can easily be imported into databases or spreadsheets.

Reporting on classification
Some offerings have canned reports for file aging, file type, file ownership and so on, with detailed and summary levels, and the ability to chart and graph data (see "Refining reports," at right). Most have graphical report engines that allow you to design custom reports by clicking on search patterns or entering regular/Boolean expressions as you would do with an Internet search engine. Some ICM tools go beyond basic classification and search capabilities and can manage, archive, retain, deduplicate or delete data; the long-term roadmaps of all ICM products include plans to introduce new features in this area. Most of the tools come from small independent vendors, although a few have established strong OEM partnerships or are looking to do so.

The list of data classification tools in this article isn't exhaustive; the products discussed were chosen because they're recognized as market leaders by users and analysts, or because their functionality is unique and compelling. Many new vendors have entered this space in the last six months, and some with existing file-attribute classification tools are adding content capability.

Abrevity Inc.
Abrevity offers two data classification products, FileData Classifier and FileData Manager. They generally share the same functionality, but FileData Classifier, targeted at the small- to medium-sized business (SMB), has a 3TB limit and can't be combined with Abrevity's database module. The products have broad capabilities, including tagging, migration, discovery and copy. Abrevity has done a good job of finding other vendors to partner with to offer greater functionality. This includes a partnership with Intellisophic Inc. to provide vertical industry regulatory lexicons and taxonomic content. The firm recently announced a fully integrated SMB-focused archiving appliance with disk and tape capability using QStar Technologies Inc.'s hierarchical storage management and Breece Hill LLC's disk and tape storage appliance, which gives it an integrated data management solution that includes tape. In addition, the products can perform specialized database and laboratory classifications that are offered as advanced upgrade modules.

While it hasn't made the kinds of inroads with larger storage OEMs that some of the other vendors have, Abrevity's focus on partnerships for technology, as well as sales and integration, allows it to offer a strong solution for the SMB market and department/group deployments in enterprises. It has the lowest entry price point by virtue of offering its product as software in small increments with modular extensions, rather than as a fully integrated offering or appliance. In that respect, it may offer a good way to get your feet wet with data classification. However, its lack of larger OEM partnerships and its SMB focus may not make it the right choice for larger enterprises.

Index Engines Inc.
Index Engines differentiates itself from all the other vendors cited here because of its focus on performance and scalability. It's the only vendor that can process data at "wire speed," and its appliances for SANs, LANs and tape drives can reside inline between a SAN target and a backup server. The products scale to handle a very large environment of multiple servers and geographical locations managed through a single interface. Products can also classify data on tapes and provide retrieval from tape media. Index Engines doesn't provide canned actions for migration, retention, litigation hold and so on, but it does have a published open API that allows storage vendors and end users to integrate other products with its search and reporting engine.

If speed and flexibility are your primary requirements, and you can create your own integrated data management routines, Index Engines may be a good choice. Its offerings are designed for users with a technical focus. The products can best be described as toolboxes; to use them you'll need to write your own programs to their APIs to perform actions on your data, or wait for some of its OEM relationships to bear fruit. Its products outperform the other vendors when installed inline, but for a full-featured data management suite that queries data at rest, this may not be the product for you.

Kazeon Systems Inc.
While not the oldest of the surveyed vendors, Kazeon is the best known of the ICM tools vendors. It has established a strong OEM relationship with Network Appliance (NetApp) Inc., which is reselling Kazeon products under its brand name. Kazeon has the most units in production environments, and is aggressively working on new features and functionality. Its IS1200 platform, which runs on a modified Linux kernel on Dell servers, is available as a rack-mountable appliance that can be clustered for greater speed and capacity. Kazeon offers a broad cross-section of functionality, including predefined reports and searches, active data migration capability, Google desktop integration and support for EMC Corp. Centera retention storage.

IS1200 is a general-purpose offering designed to appeal to a broad range of requirements and vertical markets. Its integration with primary storage vendors and Google, and its partnership with NetApp, means it will most likely remain a viable choice. Even though it was the first data classification tool to market, its functionality in some key areas is just now catching up with the competition; its report generators and lexicons are far behind some other ICM tools in terms of capability and ease of use. The appliance model also means that its entry-level price is high, and some users have expressed concern about not being able to choose their own Linux versions or hardware platforms.

Scentric Inc.
Scentric's Destiny is a relatively new offering that's primarily integrated with Microsoft's Windows OS to provide native access to CIFS file systems and Microsoft Exchange. The reporting engine has extended graphical and charting functionality, which makes for a near dashboard-like experience out of the box. Destiny runs on Windows hosts, and can be installed on existing servers in the storage environment.

Scentric has separated the classification and catalog (repository) components so they can be hosted separately. Destiny includes a fairly extensive predefined library of classification rules for storage management and compliance. The user interface for policy action based on classification rules is well thought out and is a point-and-click experience.

Scentric has integrated with several archive and content-addressed storage platforms, making copying and retaining data for specific periods a quick and seamless exercise. Packaging its functionality as a set of software components allows for greater flexibility in deployment. At this time, Scentric's product is a Microsoft-focused offering; if you have a lot of Unix-based NFS data in addition to your Microsoft CIFS data, it may not be the right vendor for your needs. Scentric has some customers evaluating its product, but hasn't sold any licenses yet; the lack of reference sites, as well as an untested support model, is something potential buyers should consider.

StoredIQ Inc.
StoredIQ is the veteran player of this group. Originally known as Deepfile, it began as an SRM offering for unstructured data in 2001. It's since been transformed into a full-featured ICM product. Originally targeting the healthcare compliance space, it has an extensive offering of predefined classification rules for verticals like healthcare and financial services, as well as more generic functions like human resources and Sarbanes-Oxley compliance.

StoredIQ's ICM 5000 product ships as an appliance in a rack-mounted configuration on blades with integrated storage. It runs on Fedora Linux with a PostgreSQL database for the repository. It has a strong partner relationship with EMC and is fully integrated into the Centera platform. This gives it well-integrated data-retention policy management with advanced features like single-instance storage. It has recently added support for NetApp data movement, which offers even greater flexibility. Of all the vendors mentioned here, its canned libraries are the most extensive, including complex linguistics to avoid false positives and find data in context.

While not marketed as vigorously as some of its competitors, StoredIQ offers a mature and ready-to-use product for a variety of data management tasks. Its platform is delivered as a fully integrated appliance with disks. This constrains the amount of data the appliance can handle and doesn't allow you to store meta data on a secure file-system target. Currently, its product doesn't allow multiple devices to be managed from a common interface, although that capability is on its roadmap. Its integrated model, while high on ease of use and functionality, doesn't allow the flexibility of a software solution and keeps entry-level costs somewhat higher.

Considerations for choosing an information and classification management vendor
  • Integration with other tools
  • Expanded functionality beyond classification and discovery
  • OEM relationships
  • Reference customers
  • Predefined pattern searches for vertical markets
  • Enterprise-ready (scalability and support)
  • Product stability and maturity
  • Professional services or third-party partners for business processes and data-retention policies

Attaching policies to classified data
Cost pressures and demand for longer, more controlled retention and access to file-system information are forcing companies to look at data classification tools. Products vary in features and performance, but all extract file attributes and content from file systems and store data in repositories for reporting, policy enforcement and data archiving. These tools can be the foundation of many activities, including litigation discovery, cost reduction, record management and retention, archiving and data deduplication. While the tools provide base functionality, an effective solution will include business processes, user education and well-defined policies.

The ICM market is still emerging, and most of the products are in their second release cycle. In selecting products to evaluate, pay close attention to the relationships and integration ICM vendors have established with the vendors in your environment. While this market is currently hot, there are more than a dozen companies with offerings, many of them in the early stages of development. Spend some time gathering your requirements, choose the vendor that best fits those requirements, and carefully investigate its plans for product enhancements and directions (see "Considerations for choosing an information and classification management vendor," at right). The right choice, combined with good policies and business processes, will provide the basis for you to classify, control and manage your unstructured data.

This was first published in November 2006

Dig deeper on Data management tools

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close