Data classification: Getting started

Classifying data and knowing how its value changes over time will improve service levels, create a better working relationship with business units and reduce costs.

Classifying data and knowing how its value changes over time can lead to good things: higher service levels, a better working relationship with the business units that create and own the data, and the ability to reduce costs by storing data on an appropriate class of storage.

A data classification project doesn't have to be complex or difficult to accomplish, but it can easily escalate in complexity depending on how granular the classification effort becomes. Like it or not, data classification will be the cornerstone for a much larger information lifecycle management project.

Data classification provides the following main benefits:

  • Reduced storage costs through lower consumption and cost per unit of storage.
  • Higher service levels for storage consumers.
  • Reduced risk of unprotected or underprotected data.
  • Shared accountability between the service provider and user.

The best way to begin is to use a minimal methodology and a high-level approach to classifying data. This way, there's a clear balance between your level of effort and the return on investment. Merely classifying your data is an interesting exercise, but unless you take action, no benefit will be derived. The way data is stored will need to be changed; to do this without creating havoc, the organization has to agree that the current method for allocating storage isn't as effective as it could be. Selling this idea is easier said than done, however.

Ask your business units what they need. The answers may illuminate inadequacies in your storage department, which will lead to infrastructure changes that reflect real business needs. With everyone's buy-in, there will be funds to pay for these changes. The following describes the data classification process, key elements, common pitfalls and new products that promise to make the effort less manual and more granular.

From requirements to classification
Data classification simply means mapping business requirements to your infrastructure. Data classification begins with a structured interview with the user, typically an application or project owner. Having a structured storage management organization, standard procedures and a standardized infrastructure are essential prerequisites to the long-term success of data classification. Don't worry if "firefight" or "chaos" are two words that best describe day-to-day operations. Go ahead and use data classification as a way to reach out to the user community and to get a handle on the business requirements behind the service requests.

Data classification efforts typically lack structure and rely on informal meetings between the storage staff and business units, interaction during application or server rollout processes, or just e-mail correspondence to obtain user requirements for storage services. Sometimes it boils down to hallway conversations or phone calls to your pals in operations to get the right service setup.

Often, a knowledge gap exists between the user and infrastructure team, so requirements end up mapping to the "high end" of the scale. It boils down to "What can you do for me?" instead of a "What are your requirements?" conversation. When this is the case, everyone's data is the most important and requires the most expensive, high-performance storage.

Approaching users with a structured set of questions (such as "How would you rate the performance of this application?" or "How mission-critical do you consider this application?") with specific ranges for answers provides consistency and allows business requirements to be mapped to various aspects of storage (see Mapping business requirements to storage).

Business requirements map to a storage service through key performance metrics. For instance, production recovery time metrics for an enterprise storage environment might range from two hours to a day. A complete inventory of business requirements facilitates the delivery of multiple storage service types. You should obtain or create a logical mapping of the user environment to your infrastructure. This usually means mapping projects or apps to your infrastructure (hosts, arrays, file servers, etc.). Once you have a complete collection of requirements and infrastructure meta data, the requirements gathering phase is done.

Next, align requirements to service offerings by developing a service catalogue, which is the menu of storage services. This living document describes the service provided and offers technical details for each standard offering within the storage service type. A catalogue item might be twice-weekly data replication to a remote site 200 miles away with the data stored on a tape library. The service catalogue provides a reference point for users, and is referenced in subsequent service level agreements (SLAs). As technology infrastructures change, so does the service catalogue.

Unique business requirements map to specific types of storage services. To provide a manageable and flexible storage service, the service must accept changes in business requirements and infrastructure. To accomplish this, storage services must be segregated into discreet storage service domains. Typical storage service domains include primary storage, disaster recovery, backup/recovery and archive. Within each storage service type, tiers of service are often developed once a representative sampling of business requirements is available from the data classification interview process.

Service catalogue development requires an iterative approach. Business requirements must be aligned to service offerings, and that alignment takes vision, work and refinement. The first pass will get you off the ground, but subsequent iterations and improvements to the service offerings will be required before a service catalogue is enterprise ready.

Mapping business requirements to storage

Closing the deal
A mature data classification model includes the SLA and a cost model; however, they're not core to the discussion of data classification. The SLA is the key point of interaction between IT and the user. It represents a contract, outlining and quantifying how and when the service will be provided. A cost model facilitates service offering development and the occasional "realignment" of user requirements should a service cost exceed the organization's ability to pay the bill. In reality, chargeback isn't a realistic goal for all organizations; however, user accountability can still be achieved through a combination of cost modeling, measurement and reporting.

Data classification creates a dialogue and a process between users and IT. Activity is likely to span IT groups, business units and disparate user groups, risking political mayhem. Often, data classification creates a first-time dialogue or crosses burned bridges where relationships went wrong long ago. Even if you have a high-level executive sponsorship, internal politics are likely to occur during the process. Identify the type of executive sponsor with enough clout to demand accountability and cooperation within the organization.

Chicken or egg?
The industry often views the building of tiered services as a "chicken or egg" process. It's difficult to standardize on a finite number of tiers when starting with disparate business requirements. Conversely, it's also difficult to present finite tiers without first gathering data. To sidestep this potential problem, work iteratively to define service tiers. Build an initial set of service offerings that reflect the requirements and then classify sets of user data against these offerings. The initial classification can then be used to conduct follow-up dialogues with the business units to confirm the mapping of data requirements to storage offerings.

Introducing vast amounts of procedural change creates more pain than benefit. For data classification, start with a pilot effort that includes a representative sampling of your environment from a business and technology point of view. Don't try to tackle all storage services at one time. Pick one domain, such as primary storage, and focus the service offering development there. Trying to accomplish everything at once will open a Pandora's box of unforeseen problems.

In the early stages of the data classification project, don't collect a lot of meta data about a file. There's only so much meta data an organization can manage, and the benefits of going to deeper levels drop off precipitously after a certain point. However, for those with a vision of more granular meta data management and the business plan to prove its value, there are several new products that automate some of the above processes.

These new products are designed to help storage admins struggling to apply highly granular policy requirements to the Wild West of unstructured data. This is especially true for large file server farms supporting businesses with rigorous compliance requirements. Data classification vendors are taking file-level storage resource management (SRM) concepts to another level by adding context-based meta data to apply policy-based actions on how the file is stored. For example, if a file contains a social security number, the file will be moved to a highly secured storage device.

As these products evolve and take on new, more sophisticated ways to determine and classify a file's contents, the more rule-based actions such as copy, movement and security for enterprise unstructured data can be applied. Applying security changes, such as narrowing the permissions list for a file, can be achieved natively through most file systems, while encryption requires the integration of third-party tools. Data copy can be facilitated via APIs to various products, ranging from backup applications to write once, read many (WORM) disk storage devices.

Data classification products
A variety of tools now tout data classification features. Many vendors offer point solutions for e-mail archiving, compliance and file system management. Most products are focused on unstructured data. The tools typically take a bottom-up approach to collecting vast amounts of meta data and address individual issues like compliance searches of data or hierarchical storage management (HSM)-style file movement. Here are some of the companies focused on data classification:

  • Abrevity Inc., San Jose, CA, provides point solutions for compliance and service-level policy enforcement by providing bottom-up meta data. Its FileBase server and client software generate meta data similar to an SRM tool, but with more depth. The company claims FileBase is compatible with any data mover technology and uses tagging techniques to track classified and migrated data.
  • Kazeon Systems Inc., Mountain View, CA, will shortly release a full beta of its file-based searching and reporting software. It claims its tools make storage more "content aware." The software catalogs assets and tags them via a meta data repository that allows basic policies to be set and run on an ad hoc or scheduled basis. Kazeon plans to extend this rudimentary data movement capability and build on its content- and pattern-based searching in later releases throughout the year.
  • Scentric Inc., a startup in Duluth, GA, plans to address the three major categories of data--files, messages and databases--with equal aplomb. At the heart of its strategy is making applications become "self-describing" and abstracting low levels of complexity into meaningful information. Scentric believes its toolset will be usable for policy makers and storage engineers.
  • StoredIQ Corp., an Austin, TX-based startup formerly known as Deepfile, has retooled its suite of HSM and SRM software to help users address risk issues through data classification and policy automation. Focused mainly on compliance and security for storage, it's still in super-secret stealth mode. StoredIQ claims to leverage a complex searching capability to illuminate file-based meta data using interactive dialogs with the user, as well as a sophisticated "lexicon" that captures multiple parameters for more robust classification of information.

The linchpins of a successful data classification project are detailed planning and meaningful dialogue with users about business requirements. The idea is to match different levels of storage with users' requirements. Make no mistake: Defining policies to map requirements to service tiers is arduous and time-consuming work, but it can be achieved through a sound methodology, an iterative development approach and a rapidly evolving set of tools in the marketplace. The long-term benefits of data classification include cost reduction, risk mitigation and QoS improvements.

This was first published in July 2005

Dig Deeper on Data management tools



Find more PRO+ content and other member only offers, here.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: