Buyer's Guide: Data classification tools
By Stephen J. Bigelow, Features Writer
08 May 2006 | SearchStorage.com
As the volume of corporate data continues to grow, storage administrators are faced with two distinct problems. First, all data cannot be treated the same, so administrators must match the cost and performance of each tier to the value of each data type across the enterprise. Second, growing data volumes make it increasingly difficult to find specific files once they're stored. Just imagine searching through millions of files to locate a missing memo or important e-mail. The practice of data classification allows a corporation to organize its data according to its relative value so that it can be stored to the appropriate tier and more easily retrieved in the future.
Automated tools play an important role in the data classification process. Many tools allow administrators to discover the data resources that are available, apply uniform classification rules to data across the entire enterprise, create and manage a searchable index of detailed metadata, move data to the corresponding tier and later comb the metadata to conduct detailed searches. This article explains the essential concepts of data classification tools and their role in the enterprise, highlights the leading vendors in the marketplace and offers some advice to help ease purchasing and implementation issues.
Understanding data classification tools
Data classification tools generally cover four main areas to some extent: discovery, classification, search and migration. The discovery process identifies the files and data types available in your infrastructure -- it tells you what you have. Classification works on the discovered data, applying metadata to each file and file type based on a defined set of rules. Metadata is then stored in a database that can be searched and referenced later. The rules themselves may be developed internally within the enterprise or may be imported into the tool from a third-party source. Once implemented, rules can be changed and tweaked over time as business and technical needs dictate.
Search capabilities are a natural extension to classification, utilizing the metadata created in the classification process to locate files based on criteria that goes well beyond conventional metadata, like filenames or creation dates. Search features are particularly important when data is being classified for archival or compliance purposes. Since data classification is usually coupled to a tiered storage strategy, data migration features (sometimes called policy management) can help to move files across the storage infrastructure. For example, non-critical files can be moved from Fibre Channel (FC) disk to a SATA storage array, or infrequently accessed data can be moved off to a content-addressed storage (CAS) platform. It's important to note that not all data classification tools provide search and migration capability.
Analysts are quick to note that data classification tools are becoming more robust and thorough, often able to examine files and documents for keyword sequences and make contextual decisions about the data. "Now it's more about getting in and looking at the data," says Greg Schulz, founder and senior analyst at Storage IO Group. The trick is in giving the tool enough information so it can draw inferences and make intelligent decisions about the data it is examining.
Hardware vs. software
Data classification tools may be implemented as hardware or software. Software-based tools are installed on at least one server in the enterprise, though multiple servers may be aggregated together to improve discovery and classification performance. In fact, multiple servers may be essential for larger organizations managing hundreds or millions (even billions) of files, or requiring hefty classification rates (e.g., 1,000 files per second).
Tools may also be implemented as hardware appliances -- essentially dedicated servers running data classification software. Although more expensive than software-based tools, hardware appliances are generally easier to integrate and configure, especially when clustering appliances together, and support a wider range of enterprise operating systems.
Vendors and product selection
The data classification arena is broad -- most vendors have a unique take on the scope, utility and scalability of their own data classification tools. Recognized vendors like Kazeon Systems Inc. take an all-encompassing approach. Kazeon's Information Server IS1200 appliance promises to catalog and classify all files on the network, providing detailed reports intended to help improve storage efficiency. StoredIQ takes an even broader view, building on discovery, classification and migration features to include retention policies and maintaining an audit trail of classified data activity. This general-purpose approach is consistent with the general definition of data classification.
Abrevity Inc. also takes an all-in-one approach with its FileData Classifier software, intended to offer discovery, classification, policy management, security, backup and archiving features for small and midsized businesses (SMB). Comprehensive search capability is provided by Abrevity's separate FileData Manager utility. Even emerging products like Destiny, from startup Scentric Destiny, touts a universal product to data classification, allowing cataloging, classification and control of structured and unstructured data. Arkivio Inc., Network Appliance Inc. and Hewlett-Packard Co. also offer general data classification/information lifecycle management (ILM) products.
But some companies take a more narrow view of data classification, catering to specific applications within the enterprise, such as Exchange. One example is Exchange E-mail Indexing Appliance from Index Engines Inc. The appliance interfaces to the SAN, indexing e-mail and documents during the backup process. Intradyn Inc. focuses on the SMB market, offering the ComplianceVault06 appliance designed for e-mail archiving and retrieval with applications like Exchange and Lotus Notes. NearPoint from Mimosa Systems Inc. also deals specifically with Exchange, providing archive, discovery, recovery and storage management features through a software-based product.
And of course, EMC Corp. touts numerous hardware and software products designed to address the various aspects of ILM technology.
Selecting the right product
Selecting a data classification product can present some unique challenges for an organization. Each tool is different -- often focusing on a particular strength, such as data migration or searching. As a rule, determine what functionality you need from a data classification tool in advance, and then weed out tools that do not provide the desired feature set. Once you narrow the field, a few potential candidates can be thoroughly tested in-house. Analysts suggest the following points that can help you identify the best product for your own production environment.
Consider the product's versatility. Any data classification tool must be compatible with the types of data that you work with. Since the majority of company data is unstructured, global data classification initiatives should use tools that fully support structured and unstructured data. Tools that handle only structured or unstructured data, or are only intended for certain applications, may not meet your objectives.
Consider the product's scalability. Data classification products generally have a practical limit to the number of files that they support. Make sure to select a product that can accommodate your current and anticipated future data volume. Understand the upgrade path so that you can estimate the cost and effort needed to expand the data classification platform later on.
Evaluate the support for external rules. All data classification products rely on a set of rules that drive the classification engine. Early data classification tools relied almost entirely on rule sets created in-house, but many of today's tools can import established rule sets -- often to support medical or legal industries. Also determine if imported rule sets can be modified or adapted to your specific needs.
Consider the impact of hold capabilities. If your primary concern is locating and protecting specific data involved with litigation, consider a data classification tool with litigation-hold (or file-hold) support. That is, when a search is conducted, the data involved in the search can be frozen to prevent modification or deletion -- even if deletion had been previously approved.
Evaluate compatibility with outside tools. Although some data classification tools can manage policies or move data to the appropriate tier natively, many tools look to outside policy managers and data movers to handle those tasks. See that your tool can interface with any external policy managers, migration applications or storage platforms currently in your environment. For example, a data classification tool might identify financial or Health Insurance Portability and Accountability Act data, and then move that data to an existing EMC Centera or another CAS device.
Evaluate the performance characteristics. Understand the time required to discover and work with enterprise data, and determine the maximum amount of data that the data classification tool can support. Also understand how the data classification platform handles data in terms of files and size. "If a vendor tells me they can classify 1 gigabyte [GB] per hour, that might be interesting," Schulz says. "But how many files is that [per hour]?" For example, an organization with a large numbers of small files may opt for a data classification tool that favors such behavior. An organization with a lower number of large files may do better to select a product that focuses on overall throughput.
Best practices for implementation
Regardless of how you implement a data classification tool, analysts suggest keeping a close eye on performance figures during the classification or search process, such as files per hour or GB per hour) and verify that you are receiving acceptable performance. Make sure that the product does not become bogged down under significant classification processing tasks. Poor figures, or performance that falters when the environment scales, may suggest a need to reconfigure the data classification infrastructure. Some other general policies can help you get the most from any data classification tool.
No substitute for human intervention. No tool can determine the value of your corporate data, so corporate leaders must be involved in any data classification initiative. Tools are improving, and prefabricated rule sets are increasingly common. This often eliminates the need to develop classification guidelines from scratch, but even the most comprehensive rules must be tweaked and refined for your specific business.
Avoid the urge to over classify. When properly implemented, data classification can enable efficient and cost-effective storage, but it's sometimes hard to know when to stop. Many organizations only support up to three storage tiers or service levels; usually high-performance FC SAN, some form of low-cost, high-volume SATA storage and a tertiary tier that is often tape. As a rule, classification schemes typically reflect these tiers. Applying finer levels of classification than tiers allow yields little benefit.
Don't be afraid to get help. If there isn't enough in-house expertise to address your data classification initiative, consider contracting the services of a consultant that specializes in your industry -- particularly legal and financial industries. An outside consultant can sometimes help mitigate the effects of internal politics and bring a focus to the classification process that might not otherwise be possible.
Start small and build out. Companies can find data classification to be a daunting exercise, so analysts suggest focusing their efforts on a specific objective to start and then expanding the initiative in phases over time. "Do a pilot (a prototype) to address a particular business need or pain point," Schulz says. "Use it to build up support."