Data classification tools can help organizations locate their data, then organize the data based on user-defined rules. Once the data is classified, many of these tools can migrate data to the appropriate storage subsystem or tier.
The product snapshots below include key specifications for a cross-section of data classification tools. The products were selected based on input from industry analysts and SearchStorage.com editors, and the specifications, which were provided by vendors, are current as of January 2008. The specs are periodically updated, and vendors are welcome to submit their updates and/or new product specs to Matt Perkins.
Go to the first product snapshot, or select the desired product below:
Product: FileData Classifier and FileData Manager from Abrevity Inc.
Data types supported: Hundreds of file types (MS Office, Adobe PDF, media files, etc.), email (Exchange pst, msg), MS Access DB, ODBC Databases (SQL, Oracle).
Metadata: Each product creates an attribute-based metadata repository including parsing out the file path words and file name words. Query engine enables the user to create attribute-based queries using Boolean logic (and/or/not and ranges).
Other rules criteria: Extraction engine allows users to search and index the content of files, including keywords/phrases, patterns (such as social security numbers, credit card numbers, account numbers, etc.) or create custom extraction goals.
Policy engine: The policy engine can take "action" on query results including tagging, file movement/migration (copy, move, archive, delete) and extraction.
Automated classification: All classifications (tagging, movement, extraction) can be automated through the Saved Query feature.
Manual classification: Files can be tagged, moved and content extracted from them.
Third-party integration: Supports any third-party data movers that use XML, CSV or text-based input.
Search features: No search.
Tiering support: Users can define volume(s) as a tier of storage and use our policy engine to move the data between tiers. We can integrate all common VTL and archive systems. File Genealogy feature allows tracking of multi-tier data movement regardless of how many times it has been moved.
Management Features: Saved queries, automated tagging, role-based actions.
Operating system: Both tools run on MS Windows but we can support any file system including Unix file systems through NFS.
System requirements: 2GHz processor, 1GB RAM, Gigabit Ethernet, Windows 2000, 2003, XP, Vista and two internal hard drives.
Availability: FileData Classifier and FileData Manager 3.0 currently available.
Base cost: $30,000 for the first 5TB and $3,000 per extra managed TB.
Detailed specs: http://www.abrevity.com/products.php
Go to beginning
Product: Arkivio Inc.; ARKIVIO auto-stor
Data Types Supported: File system data all Windows, UNIX and Linux that support the CIFS or NFS protocols including file system data on NAS devices from NetApp, EMC, and others.
Metadata: File meta data collected includes file type, size, creation date, last modified date, and last access time.
Other Rules Criteria: ARKIVIO auto-stor also collects context of the data information for each file. File ownership from the directory service and file location from the directory structure.
Policy Engine: Based on the files discovered and classification the policy engine can simulate data movement and deletion as well as actually move, copy, delete, restore or migrate (move data to the repository and leave behind a stub file or link for end user access to the data in the repository). Data is moved across heterogeneous servers and file systems without using server agents and is out of the data path by design. The policy engine also uses its contextual classification system to select the appropriate subset of data to send to a content indexing system for later search and retrieval using index based tools.
Automated Classification: Yes; There are a standard set of classes that are established out of the box including obvious categories like Office files. The list is easy customized for each company through a 4 step wizard.
Manual Classification: A four step wizard enables classification of the file system into any groups based on the needs of the company. These groups can be created using any combination of the metadata and contextual information collected.
Third-party Integration: The ARKIVIO auto-stor solution works with all devices that support a CIFS or NFS interface across SAN, NAS, and DAS network architectures. In addition it is integrated into API sets for the EMC Centera, EMC Celerra, and all NetApp filers. It is also interoperable with the HDS HCAP CAS solution.
Search Features: Yes, this is a built in search and retrieval from an archive repository based on the classification criteria and date ranges of interest. In addition, data can be sent to a content indexing tool and that search engine used to select data.
Tiering Support: ARKIVIO auto-stor supports all storage systems that support a CIFS and NFS interface.
Management Features:Agentless discovery is the key step required before classification and the ARKIVIO solution has that as an integrated part of a solution. Data that is discovered is available to the classification systems four step wizard, the policy engine, simulator, and a detailed reporting and monitor system for a complete ILM data management solution.
Operating System: All Windows, UNIX and Linux operating system versions that support CIFS or NFS
System Requirements:The central management server runs on a Windows 2003 server, with dual Pentium processors, 1Gb memory.
Vendor Comment: Arkivio, Inc. provides leading Information Lifecycle Management (ILM) software solutions for file-system archiving, regulatory compliance and retention, data classification, consolidation to NAS and SAN, and backup and disaster recovery optimization. ARKIVIO auto-stor enables enterprises to agentlessly profile their data and automate its discovery, classification and placement across heterogeneous storage environments.
Availability: Currently available
Base Cost: Starting at $4,000/TB and $10,000 for central console
Detailed Specs: http://www.arkivio.com/2/products/autostor.asp
Vendor URL: http://www.arkivio.com/ p> Go to beginning
Product: Brocade Communications Systems Inc.; Storage X software
Data Types Supported: StorageX supports any type of unstructured file data. This includes project data, group data, departmental data, home directories etc.
Metadata: For migrated data, the global namespace is used to re-point the users to the new location. Classification of data is based on any of the following parameters: last accessed, created, or modified times, directory size, number of files etc. The key for tiered storage solution is transparency for the users.
Other Rules Criteria: The classification engine is flexible enough to satisfy any admin-defined criteria. StorageX contains a scriptable engine for the admin to specify the criteria for selection of files or data movement. The storage balancing policy can be used to balance the storage across several machines based on the capacity utilization of each volume. Migrations are transparent to the user because they are done behind the veil of the namespace.
Policy Engine: The policy engine moves data based on administrator-selectable criteria. The global namespace is updated to transparently point the users to the new location without having the users re-map drive letters to the new location. Global namespace is the enabling technology to future proof the storage and optimize the usage.
Automated Classification: Yes; it is a policy-driven engine that allows an admin to migrate data based on a schedule. For example, the scan can be done nightly and the migration can be done over the weekend. The policy-driven engine is rule-based.
Manual Classification: Yes; manual classification is possible. The policy is flexible to inform the admin of choice of data to be moved and the admin can select which data movements make sense. The admin can then select the data to be moved and the schedule for the migration. <
Third-party Integration: The data classification engine can run any script (batch file) and can thus integrate with any scriptable mover.
Search Features: Standard file movement is based on file meta-data. Other functionality can be achieved through the scriptable interface of StorageX.
Tiering Support: StorageX supports disk-to-disk solutions.
Management Features: The global namespace provides transparency to the users by shielding the users from knowing where the data resides. An admin can run reports based on the completion of a migration to know what got moved and when.
Operating System: Windows 2000 and above, Linux, Solaris, NetApp, EMC
System Requirements: Windows or Unix environment with one or more file servers hosting shares or exports. Enterprise solution that can scale to 100,000+ users
Vendor Comment: StorageX provides transparent policy-based data movement between tiers of storage by moving infrequently data to a lower tier and re-pointing the users to the new location via the global namespace.
Availability: Currently available
Base Cost: Not provided
Detailed Specs: http://www.brocade.com/products/fan/storagex.jsp
Vendor URL: www.brocade.com
Go to beginning
Product: Brocade Communications Systems Inc.; File Lifecycle Manager (FLM) software
Data Types Supported: Unstructured file data on primary NetApp storage systems. This includes project data, home directories, group data etc.
Metadata: A file stub representing the complete file is used to re-point the users to the lower tier. Classification of data is based on any of the following parameters: last accessed, created, or modified times, size, file attributes etc.
Other Rules Criteria: An external data source can be used to specify the files to be moved. For example, this could be an output generated by a content classification engine.
Policy Engine: The policy engine runs on a schedule to identify the files to be moved. An admin can run a simulation or actually migrate the file to a lower tier. When a file is migrated, a stub is left behind. To the user, the stub looks and acts like the original file.
Automated Classification: Yes; an admin can configure the policy to automatically classify the files and run on a schedule to move the files.
Manual Classification: Yes; an admin can manually classify the data and decide what data needs to be moved; can run simulations to see if the policy is suitable or adjust it as needed.
Third-party Integration: FLM uses its own data movement engine to migrate data.
Search Features: FLM migrates data based on file metadata (size, dates, attributes etc) and properties.
Tiering Support: FLM supports disk to disk data movement
Management Features: FLM migrates data based on the policies set by the admin. FLM is transparent to users and applications.
Operating System: NetApp Storage Systems for primary. NetApp or Windows for secondary tier.
System Requirements: One or more Windows servers are required to run FLM software based on the size of the primary tier. A SQL Server database is used to hold the list of migrated files.
Vendor Comment: FLM frees up disk space on primary NetApp storage system by transparently migrating data to a lower tier of storage. FLM reduces the cost of backup of expensive primary storage by off-loading unused files to a lower tier.
Availability: Currently available
Base Cost: Not provided
Detailed Specs: http://www.brocade.com/products/fan/flm.jsp
Vendor URL: www.brocade.com
Go to beginning
Product: EMC Corp.; EMC Infoscape
Data Types Supported: Unstructured data; files in Windows and Solaris file shares. Infoscape uses CIFS and NFS protocols to access the files.
Metadata: Both file attributes and file content are used for classification. File level attributes such as file owner, last accessed/modified date, file type, file size, file path, are used.
Other Rules Criteria: The content of files is analyzed for keywords and patterns. Files can be classified based on simple keywords or patterns of numbers (SSNs, credit card numbers, etc.) or text (email address, etc.). Files can also be classified based on user groups.
Automated Classification: Infoscape in many respects is a policy creation and execution engine. It provides a platform to create policies that embody the rules for identifying and classifying information as well as for defining what actions are to be taken for different categories of files. The three main functional areas of EMC Infoscape - discovery, classification and actions - are automated and executed through the policy analysis and execution platform.
Manual Classification: The tool does not support manual classification. Classification is based on user defined rules.
Third-party Integration: No integration with third-party data movers at this point.
Search Features: Both basic and advanced search is available. Files can be searched for simple keywords, wildcard quires, Boolean expressions and file attributes. The product will also enable search based on Infoscape-specific metadata, such as what services a file is receiving or what stage in a lifecycle a file is in.
Tiering Support: Tiering is supported through policy based data migration functionality in Infoscape. The data migration is focused on EMC storage in the current version. Primary/Source; EMC Celerra (NAS platform), Secondary/Target; EMC Celerra (NAS platform), EMC Centera (CAS platform)
Management Features: Infoscape administrators can define services, such as security, as well as service levels such as low, medium and high within each service area. Once defined within Infoscape, these services are assigned to categories of files based on policy. Infoscape can compare current service levels of a file with the intended service levels defined in a policy. This analysis generates a service gap report that can be used, along with service modeling analysis to help administrators apply appropriate services and service levels to files. When the service level objectives change over the lifecycle of a file, the Infoscape life cycle management feature permits service levels to be based on the age of the file. This capability extends scalability and automated execution of information management policies over the lifecycle of millions of files. Infoscape also has powerful reporting and analysis functionality that allows customers to better classify and manage information.
Operating System: Windows 2003 standard edition
System Requirements: Four CPU or two dual-core CPU server with 4 GB RAM; SQL 2005
Vendor Comment: EMC Infoscape is an enterprise information risk management solution that classifies unstructured data and takes actions based on user defined policies. Infoscape is the only solution in the market that leverages both file attributes and content to classify files and execute IT actions such as copy, move, archive, secure and full text index.
Availability: Currently available
Base Cost: $125,000 for base license plus $9000 per Terabyte of data managed
Detailed Specs: http://www.emc.com/products/software/infoscape.jsp
Vendor URL: http://www.emc.com/
Go to beginning
Product: IBM; IBM Classification Module for OmniFind Discovery Edition
Data Types Supported: Any content in text form including email, business documents or database content.
Metadata: The IBM Classification Module can be configured to accept any metadata associated to a document or piece of content for use in classification and in turn the IBM classification Module can create any type of metadata required.
Other Rules Criteria: Primarily, IBM Classification Module automatically categorizes content via semantic understanding of free form text, in combination with rules-based classification.
Policy Engine: Via semantic understanding, the IBM Classification Module executes statistical analysis of the content of the free form text. This algorithmically based analysis of the text's content can in turn be combined with rule-based classification to categorize content.
Automated Classification: Classification of content is typically executed in an automated manner.
Manual Classification: The IBM Classification Module Workbench tool allows the administrator to manually classify content.
Third-party Integration: No packaged data mover integration exists
Search Features: IBM Classification Module is integrated with multiple search products within IBM's Content Discovery portfolio, including IBM OEE and IBM ODE.
Tiering Support: IBM Classification Module has a service-oriented architecture, focused on handling classification requests independent of storage technology. Classifications made by the IBM Classification Module can be defined by the customer for their particular business need.
Management Features: Classification Workbench for defining and maintaining a taxonomy; Classification Console allows the administrator to monitor all installations of the software
Operating System: Consult documentation for detailed OS support
System Requirements: No predefined network requirements
Vendor Comment: IBM Classification Module for OmniFind Discovery Edition automatically classifies long form requests such as emails, case management notes, discussion group comments, and documents. This module is used primarily in solutions such as our Case Resolution, Contact Center, and Self Service offerings that must effectively process end user requests that go beyond the keywords, phrases, and questions typically expressed in a conventional search application. The Classification Module is used by hundreds of organizations to address online problem resolution and has helped organizations in many cases automatically resolve up to 40% of their online service requests without the need for live assistance.
Availability: Version 8.3 currently available
Base Cost: Pricing based on configuration
Detailed Specs: http://www-306.ibm.com/software/data/enterprise-search/omnifind-discovery/class.html
Vendor URL: www.ibm.com
Product: Index Engines Inc.; ILM and Data Classification appliance
Data Types Supported: Support all common unstructured file types as well as Exchange email and PST files.
Metadata: For documents; Title, Author, File name, File type, File category, Size, Age, Create date, Modified date, Accessed date, Location and Security. For email; Subject, From, To, CC, BCC, Size, Age, Date, and Location
Other Rules Criteria: Full text content is available for classification. We index all text content on the first pass of indexing rather than a second pass after the data set has been narrowed down. Additional criteria includes pattern matching such as Social Security and Credit Card numbers.
Policy Engine: Index Engines delivers a comprehensive software development kit (SDK) that allows custom development of policies and rules based on information content.
Automated Classification: No automated policy engines. The SDK is utilized in order to build custom automation that best fits the end users environment.
Manual Classification: Index Engines supports tagging allowing for manual addition of classification tags to documents and email. Once a document, or set of documents are tagged they can be queried and organized using the product interface or via API's.
Third-party Integration: The Index Engines SDK consists of a SOAP based, XML development environment. This allows for 3rd party integration using standard development tools such as PERL and .NET.
Search Features: Index Engines delivers full metadata and content search. Search includes Boolean search capabilities such as AND, OR, NOT and NEAR. All search results are presented in sub-second response time for even complex queries.
Tiering Support: Our platform is independent from the storage tiers and can support any and all environments.
Management Features: We provide roll up reports on data based on the query. For example, a query can be submitted for all documents owned by a user, or for files not accessed in over 5 years. The results of the query will be presented and the user can select the report option to show high level roll up reports and graphs. Canned reports included: report by file type, size, location, owner, age as well as by risk (containing Social Security and Credit Card numbers).
Operating System: Our product is delivered as an appliance and is independent of the operating system.
System Requirements: Our product can integrate into a LAN or a SAN in order to ingest data. We do not crawl the network in order to ingest data; we integrate into the current infrastructure in order to index data. The LAN product connects using a network card and ingests data via NDMP. The SAN product connects to the network via a fibre channel connection and transparently ingests data as it flows to archive.
Vendor Comment: We provide the most scalable, efficient discovery and classification solution on the market, allowing for indexing of hundreds of millions of files across the enterprise. No other solution can integrate into the storage network in order to transparently discovery and classify data enterprise wide.
Availability: Currently available
Base Cost: List price starts at $50,000
Detailed Specs: http://www.indexengines.com/solutions_storage_management.htm
Vendor URL: www.indexengines.com
Product: Kazeon Systems Inc.; Information Server
Data Types Supported: The Information Server supports 390+ different file formats. This includes word processing formats, spreadsheet formats, presentation formats, graphics formats, compressed formats, database formats, email formats, healthcare-specific formats (DICOM), Microsoft Project, and MP3 files. The Information Server also supports hundreds of non-standard file formats through advanced parsers which can extract text from non-standard file formats.
Metadata: The classification engine can create and use just the file and file system metadata (file name, owner, file type, access time, etc.) or all the file system, application metadata as well as the content. The Information Server provides default levels with the ability for the customer to customize the classification dial to make it more specific to their deployment. Some examples of metadata include: file and file system metadata, application metadata, and content metadata. Classification Groups are groups of files based on a combination of the metadata. Administrators or end-users can assign their own metadata to files in a manual or automated fashion.
Other Rules Criteria: Any of the metadata fields listed (content or non-content) above can be used to classify data. The rules have the ability to search against the extracted full-text to identify phrases or patterns or can be as simple as checking the values of the file, application or extracted content metadata. Complex rules can be built using BOOLEAN operators.
Policy Engine: Information Server has a comprehensive policy engine that help define rules for how a file is classified, which files are migrated, moved, copied or deleted, what users are allowed to view files from search results, which files need to be locked down and for how long.
Automated Classification: Information Server can perform either manual or automated classification. If automated classification is performed, it can be scheduled to run on a periodic, recurring basis. In most scenarios, automated classification does not need access to the live production data but can be based on the existing metadata.
Manual Classification: Information Server does support manual classification of data. Manual classification can be performed by the end-user or administrator on a set of files (defined by a search query or a report).
Third-party Integration: Information Server uses it's own, highly optimized data movers
Search Features: Information Server does index and provide comprehensive search capabilities for any of the metadata. All common search queries such as keyword search, BOOLEAN searches, phrases and date range searches, field searches, proximity searches, etc. are supported. Additionally, search results are secure -- the user performing the search is only presented with the files that they have access to. Overlay policies can be created that restrict or open up the security level. Finally, all search queries are actionable. Administrators and end-users can perform certain actions on these files (copy, move, delete, lock down with retention date, tag, re-classify) based on the output of the search query results.
Tiering Support: Information Server supports all NFS or CIFS accessible storage tiers. This includes FC or SATA based storage, compliance storage, fixed-content or archive storage.
Management Features: Depending on the role of the user (end-user, auditor, administrator), they are able to manage their data by performing the actions describe above or organize their data by assigning tags to their data.
Operating System: Information Server is operating system agnostic -- it uses standard file system protocols such as NFS and CIFS to access and classify data. As a result, any operating system that supports either NFS or CIFS is supported by the Information Server.
System Requirements: The Information Server is a cluster of appliances that come pre-packaged with hardware and software. The appliances need Gigabit Ethernet connectivity to the corporate network and access to the data via NFS or CIFS. If authenticated access is required, connectivity with either Active Directory or NIS is required.
Vendor Comment: The Information Server is designed for classifying, indexing and taking actions on unstructured data and can scale from few million files to 100's of millions of files and terabytes to petabytes of information.
Availability: Available now in three models; IS1200-FRM for file reporting and migration, IS1200-SA for enterprise search, IS1200-ECS for enterprise content services
Base Cost: $40,000 per appliance
Detailed Specs: http://www.kazeon.com/products/
Vendor URL: http://www.kazeon.com/
Product: Scentric; Destiny software
Data Types Supported: File data, E-mail data, SQL data
Metadata: Destiny can utilize file system and application metadata such as owner, access date, modify date, group ownership, file type, size, email to/from/cc/bcc, subject, date and attachment.
Other Rules Criteria: File content, email content
Policy Engine: The policy server allows complete automation of all functions including rule enforcement.
Automated Classification: The Destiny policy server allows the automation of rules and policies as well as content and metadata indexing. Approximately 80 rules are included with the Destiny.
Manual Classification: Rules and policies are editable and administrator configurable.
Third-party Integration: Yes; Scentric Destiny integrates with the Dynamic Information Services platform from Permabit, EMC Centera and HDS HCAP. It does not integrate with third-party data movers.
Search Features: Yes; Scentric supports major search engines including Google, Microsoft. Destiny can utilize the search index created by MSFT and Google. Currently Destiny includes search capabilities for administrators.
Tiering Support: Destiny supports EMC, HDS and Permabit archives, and user definable tier 1, 2, 3 etc.
Management Features: After indexing and classification data can be migrated between tiers, compressed, archived, copied and deleted.
Operating System: Destiny runs on Windows Server and can index and manage CIFS, NFS and NetWare shares.
System Requirements: Requirements depend on deployment
Vendor Comment: Destiny is the first universal data classification solution for the enterprise that is scalable, easy to use and WAN friendly.
Availability: Version R2 currently available
Base Cost: Pricing varies by deployment -- contact the vendor for detailed pricing
Detailed Specs: http://www.scentric.com/products/scentric_destiny.jsp
Vendor URL: http://www.scentric.com/
This was first published in January 2008