New tools to classify data

Putting data on storage systems appropriate to its value requires the ability to classify data. An emerging category of applications, Information Classification and Management apps, can index enterprise information and execute precise actions based on its content.

This article can also be found in the Premium Editorial Download: Storage magazine: The best high-end storage arrays of 2005:

One of these days, your CEO will ask the big question: "How is our data storage infrastructure driven by the business value of our company's information?" Aside from pulling out a few strands of your hair, how would you respond?

Answering accurately would take time and plenty of clarification. However, if your organization is among the few that have installed one of the new Information Classification and Management (ICM) applications, your answer could be as succinct as the question itself: "We have ICM deployed above the storage infrastructure to manage information based on content values. Those content values are determined by business objectives established by our compliance, security and internal IT services teams."

ICM isn't just another disposable acronym; it's a concept and product category that unites business process with storage. Companies in this segment hope to bridge the gap between high-profile processes like compliance, corporate security and IT consolidation, and the storage infrastructure those processes depend on. If ICM succeeds, storage management and information management will become blissfully blurred in a world of automation with full transparency about what information goes where, when and why. In short, ICM could be the missing link between your day-to-day storage reality and the information lifecycle management (ILM) marketing dreams conjured by storage vendors. Startup vendors in the ICM market all share the same key insight: You have to classify your information if you want to control how you store it.

Evaluating ICM

The following criteria should be applied when evaluating Information Classification and Management (ICM) vendors:

Business-solutions focus: A key differentiator among ICM players is their relative focus on business-level applications. Products may focus on compliance, infrastructure controls or security.

Integration capabilities: An ICM offering's ability to integrate with back-end storage software is an important consideration. Given the wide range of potential data-movement software and solutions available, finding the right player requires careful evaluation.

Classification power: Consider a product's breadth and flexibility of classification coverage. Does it classify only unstructured content, or does it integrate with e-mail and database offerings? How easy is it to create new lexicons? Is it a completely extensible environment?

Deployment method: There's some variance in how ICM is deployed. Most products are out-of-band, but others support in-band operations. Additionally, some products use the LAN for indexing on hosts, while others use mass streams like backup jobs for data collection.

Performance and scale: For ICM, performance is measured in the number of files per hour that can be indexed. With regards to scalability, ICM products should be compared by their high-end file range of support and by the ease with which they scale. A relatively small number of ICM devices (no more than four to five) should be capable of covering the unstructured content of a typical Fortune 1000 data center.

What is ICM?
ICM software indexes enterprise information and executes a range of precise actions on that content. Based on policies, ICM can determine access rights to an object, as well as its residency, movement and final disposition within the storage infrastructure.

It's deployed as a standalone technology that interoperates with existing data movement and storage technologies. ICM isn't dependent on top-level applications, and doesn't need to have a dedicated interface to a proprietary data movement technology (such as a volume manager, snapshot tool or backup application). Upon initial installation, ICM software establishes an index by proactively scanning or "crawling" the file environment. After establishing a baseline, the software conducts ongoing, nondisruptive indexing in the background or at specified time intervals. When deployed in front of an enterprise storage solution, ICM can achieve the following goals:

  • Ongoing classification of file information based on a range of programmable meta data attributes such as business owners, history, creation, directory and so on. This classification process can be automated based on predetermined policies created by an administrator, which may be applied to production application environments and to secondary or archival environments.
  • Ongoing classification of information based on content-related attributes that are extracted from file-level inspection, such as social security numbers, customer names, employees--literally any programmable keyword. As with meta data classification, content-based classification can be automated according to preset policies. The policies can also be applied to production environments and to secondary or archival repositories.
  • Creation of classification templates called "lexicons," which become the basis of policies carried out by the ICM app. These can range from simple directives ("Always restrict access to documents authored by Jane Doe wherever they originate") to complex templates filled with specialized jargon and nested logical operations. Typically, the more complex lexicons are architected to address specific business processes such as HIPAA or SEC 17a regulatory compliance.
  • Granular file-level controls--searching, retrieving and acting--against any content that has been indexed by the ICM deployment. For example, this may include an administrator searching for a particular document using a Google-like interface, and then restricting access to a given user or user group. But it could also include encryption, deletion or migration of that single file. As with all other functions in the ICM category, these controls may be applied to primary application environments and to secondary or archival environments.

Compliance, security and ILM
Over the last three years, all of the ICM vendors independently came to the same conclusion: Despite the advances of storage resource management (SRM) and backup products, those products didn't expand our understanding of what was being stored and its relation to various business processes. "We realized that many of our storage challenges were only solvable if we could get knowledge about the content itself," says Michael Masterson, an information systems architect at a leading life sciences company now evaluating several ICM products. "We needed to know information about each piece of data if we wanted to link it up with business policies for compliance and corporate security."

To address compliance and security requirements, technologists like Masterson can use an ICM product to create detailed policy templates, or lexicons, that use meta data and content attributes to precisely automate what content is accessed by whom, when content migrates and its retention period. Masterson believes this kind of granular control is not only desirable, but inevitable. "Getting semantic control of our data and turning it into information is what this is all about," he says.

The other major area of interest driving ICM is adding long-promised granular controls to the ILM process for unstructured content. Specifically, this means more intelligent content movement capabilities. The poor visibility that most enterprises have regarding their unstructured file information and its usefulness to the organization is nothing short of staggering.

Taneja Group conversations with storage admins routinely reveal that many low utilization rates for file storage are directly related to a lack of knowledge about what's being stored, which leads to inaction based on a fear of deleting important files. This translates into inefficient backup as static content is redundantly protected on tape and disk libraries. By using ICM solutions, enterprises can become much savvier about what they store where, for how long and how it will be migrated to an appropriate storage tier over time.

Meet the players
To date, the ICM category has four announced players: Arkivio Inc., Kazeon Systems Inc., Njini Inc. and StoredIQ Corp. Scentric Inc., another ICM company, is still in stealth mode.

Mountain View, CA-based Arkivio was the first to release an ICM product, and offers a range of meta data management capabilities to classify content and drive its policy engines. Its auto-stor appliance has been focused on the ILM usage model described earlier, although the company also positions it as a compliance solution. Arkivio helps a user to intelligently move content off production environments. The company recently completed integration with EMC Corp.'s Centera as a front end to that content-addressed storage (CAS) archive. Arkivio doesn't focus on content-based indexing; instead, it delivers its functions based on the files' meta data attributes.

Kazeon Systems, also in Mountain View, CA, is a new entrant. The Kazeon Information Server (KIS) was developed over two years by search, database and storage experts. KIS delivers full meta data and content classification, and a totally programmable policy engine. A unique differentiator for Kazeon is its focus on integrating very easy-to-use, Google-like search capabilities. With KIS, users will be able to search for any piece of indexed information in the infrastructure and then view an entire range of allowed actions for that stored object through a single user interface.

Njini, based in Surrey, England, offers an in-band ICM product. It consists of modules, such as njiniENCOUNT, which prevent unnecessary duplication of unstructured data objects. Hierarchical storage management (HSM) and compliance tools are expected in the next six to nine months; all work with the njiniENGINE. Because it sits in the data path, Njini believes it has an advantage over competing products because it can take policy-based actions on data before it reaches the storage devices.

StoredIQ 3.0, from Austin, TX-based StoredIQ, enables meta data and content-based indexing and full controls over content, as well as search and query functionality. The firm is focused on compliance and corporate security, and has developed regulation-specific lexicons for HIPAA, SEC 17a and Sarbanes-Oxley that automate compliance controls for unstructured content. Based on the positive feedback from StoredIQ's early customers in the healthcare industry, a preset lexicon for compliance apps could become a common approach across the ICM category.

ICM: What it does

Proactive data inventories: Information Classification and Management (ICM) inventories existing data sets by indexing or "crawling" the environment to collect data. This may take place as an activity on the LAN or as a batch process to inspect content during its movement for migrations or backup operations.

Meta data attributes: ICM applications depend on meta data indices to achieve their goals. The ICM product collects all available information about files residing in the data pool.

Content attributes: ICM "cracks" and inspects file-level content, enabling content-based classification and control of the data based on its own attributes.

Lexicon creation: ICM supports the ability to create any manner of business-value templates of keywords and logical operations, known as a lexicon. This constitutes the brains for managing a complex business process such as compliance or corporate information security, as well as disk archival management.

Execution and initiation of controls: ICM can act upon content to execute certain controls (e.g., encrypt, restrict, delete and migrate) and to initiate a chain of operations that might be executed by an associated data movement technology such as a volume manager, snapshot application or backup application.

ICM deployments
For the foreseeable future, ICM implementations will likely be deployed almost exclusively as network-mounted devices. To date, most vendors have chosen to deploy "out of band," but there's no architectural requirement for this, and other vendors will assuredly emerge as in-band providers later this year. There's no need for server-side agents to be permanently deployed on the servers under management by ICM. That said, some vendors may chose to deploy agents in the future to increase application-specific functionality in a move analogous to the CAS archiving evolution we've seen where API or CIFS/NFS interfaces are available.

The network device hosting ICM software directly accesses all assigned servers and then communicates with a centralized storage movement or management app, as required, to hand off data-movement activities. All ICM players also provide the means for file-level data movement, but they seem to realize that users expect ICM software to integrate with existing data movement software.

ICM devices consume storage, but not much because only meta data is stored. Early indications suggest that storage requirements will be between 10% and 25% of the production data capacity being indexed and managed. It's worth noting that ICM storage doesn't have to be high-performance disk; Serial ATA is perfectly acceptable because ICM is a relatively low IOPS application. To scale, the ICM devices can be clustered. Based on scaling metrics from ICM vendors, no more than a few physical devices would be needed to meet the scaling requirements of most large enterprises.

Alternatives to ICM?
Are there any current alternatives to ICM? The short answer is no. SRM products can't do what ICM does: proactive indexing, content-aware inspection, and provisions for detailed policies and lexicons. Current SRM products can only scan file-level environments and collect information on file meta data. It's possible that ICM could eventually be integrated into SRM products.

At this early stage, some may claim that cluster or distributed file system-based namespaces (e.g., Ibrix Inc., Isilon Systems, PolyServe Inc.) or network file management (e.g., Acopia Networks, NeoPath Networks, NuView Systems Inc., Rainfinity Inc.) can do all or some of what ICM does. This misconception likely stems from confusion about the benefits of a global namespace. With a global namespace, file-level content is abstracted from physical device relationships and can be classified and moved according to business-level decisions: simple virtualization. That's not proactive indexing, content-aware or applicable to a business process like compliance. Further, global namespaces cover only the servers in question, not an entire infrastructure.

There are also no manual approaches for content-aware file inspection or policy creation. The people who ask whether it can be done manually have historically catalogued file meta data and confuse ICM with file auditing, which has been done manually for many years.

ICM is another step in the evolution of storage. Controlling content based on its own attributes and associated meta data amounts to a transition from a world of opaque data management to one of transparent information management. The implications of this trajectory are clear: Data storage will increasingly become a matter of the architecture of information policies and data values as much as it is about storage topologies and device management.

It will take another one or two years of development before ICM finds its way into the hands of Tier-1 storage vendors. Because ICM enables enterprise IT to establish such direct control over high-profile processes like compliance and corporate security, ICM is a technology class that appears destined for wide deployment atop enterprise storage infrastructures.

This was first published in August 2005
This Content Component encountered an error

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close