A step-by-step approach to data classification

The most common shortcoming of a data classification project is the perception that it can be completed through technical analysis at the storage layer without engaging business users. While discovering and analyzing storage is part of the process, good classification requires engaging business users or their IT representatives.

Data classification is the foundation for storage strategies that significantly lower costs, increase service levels,...

reduce risks and keep business customers happy.

If you ask 100 storage professionals to define data classification, you'll probably get 100 definitions, all of which sound reasonable, if a little vague. Ask the same 100 professionals whether they've ever completed a successful data classification project and almost all of them will say "No." But if implemented successfully, data classification is the foundation for a wide variety of long-term storage initiatives: tiered storage, information lifecycle management (ILM), data privacy and security, regulatory compliance, data cleanup, service-level catalog definition and cost reduction.

At a high level, data classification is the process of collecting the business requirements of data and apps, and using those requirements to store, protect and manage data at the appropriate service levels. A data classification project must begin with a definition of what's being classified and what metrics are appropriate for the level of classification desired. Data should first be classified in terms of application data sets. This is the level of classification needed to successfully align business apps with the storage infrastructure (see "Create a manageable process").

As many companies have discovered, a data classification project can be difficult to complete successfully. These projects involve bridging the divide between business requirements and storage infrastructure, and usually require engaging business units, legal departments, compliance officers and other non-IT organizations. Many data classification projects get stuck or even abandoned before achieving the desired results. Some companies skip data classification entirely, instead focusing directly on technical solutions for storage consolidation and virtualization that address cost or complexity, but don't necessarily serve the needs of the business. Other data classification projects may start well, but come to a grinding halt when resistance from business or IT staffs is encountered.

What's data classification?

Any data classification project should begin with a comprehensive inventory of your company's applications and their associated data sets, followed by classification into groups with common requirements. These requirements may include traditional IT metrics such as recovery time objectives, recovery point objectives, backup schedules, maintenance windows, etc. Successful data classification will also involve more business-centric metrics such as business criticality, revenue and productivity impact over time, business-continuity objectives, application performance, application criticality, data retention periods and security requirements.

While this may seem like an academic exercise or an overly theoretical approach, this process is critical for successful storage projects. Most organizations have some type of classification scheme in place, and there are often multiple, conflicting ones--application "tiers," disaster recovery (DR) levels, business-continuity tiers, etc.--each with its own unique purpose. Typically, these schemes were defined some time ago and haven't been kept up to date. These "slices" of classification are rarely sufficient for making storage decisions, or aren't complete enough to set true service-level objectives.

Different data sets have different business requirements that, taken together, define a service level. For example, data residing in enterprise resource planning (ERP) systems will usually require the highest level of service to ensure that it can be accessed quickly, restored in the event of disaster, protected from theft and available in more than one location. Common sense tells us that ERP test data doesn't require the same level of protection and recoverability. Why protect and manage these two different types of data at the same level when their needs are different? Data classification is the process of creating formally defined service levels for different apps, and sorting the application data sets into these defined service levels.

We must also consider the time dimension. Over time, business requirements for data can change, with data assigned to different service levels and migrated to different storage tiers to reduce costs. But the first and most important step is to classify the active data sets into the right service levels, and to place them on appropriate storage tiers when they're created. For most organizations, this initial data classification --along with correct placement on tiered storage-- will deliver the biggest and fastest returns on investment and effort (see "Will my storage environment become more complicated?"). Once these basic elements are in place, an ILM strategy can provide significant incremental gains over the longer term.

Create a manageable process

Ironically, most data classification projects fail because they're too large or overly complicated. While it's always a good idea to keep a holistic "big picture" in mind, trying to create a complete data classification scheme across the enterprise--and getting buy-in from a large number of business units--often turns out to be an overly ambitious endeavor. If this is your first project, or if you're new to this type of exercise, it may be appropriate to select a subset to classify: either a subset of applications in a large data center (a single business unit, for example); just the applications in a single, small data center; or a single filer or e-mail server. This smaller set of data is often easier to classify initially. Once the project has been completed, a "cookie cutter" process is created that can be applied in pieces across the organization. This is as true of a project aimed at storage tiering as it is for a project focused on file-level classification for archiving or compliance retention for e-mail.

Inappropriate protection levels

By taking a comprehensive approach across the enterprise at the application level, a data classification project can gather the information necessary to make informed decisions about the service levels needed for data kept on spinning disk. Why is this important? While the cost of physical disk per gigabyte may be decreasing, the incremental costs associated with providing the highest level of service continue to rise. Increasing regulation, litigation discovery requirements and user expectations are stretching budgets and staff capabilities.

As your application catalog and data have grown--or been acquired or migrated between different types of systems--much of your data is now most likely maintained at an inappropriate service level. Typically, 10% to 20% of a company's data is underprotected. This means the data isn't managed at the service level required by the business. More often than not, underprotected data can't be recovered quickly enough in the event of a partial or complete disaster. This represents a real risk to the business and its customers.

Perhaps a bigger problem, at least from an operating cost perspective, is overprotection; typically, 40% to 60% of an organization's data is overprotected. In most cases, this data is overreplicated remotely and locally, or backed up in a costly manner. While users aren't likely to complain about too much protection, it represents considerable overspending as data continues its explosive growth.

Without proper data classification, these expenditures will continue to grow, crowding out other IT initiatives. Classifying data, creating agreed-upon service-level agreements (SLAs) for data and changing the storage strategy will result in lower storage costs in the near and long term.

Four steps to data classification

While not all data classification projects examine the same data types, have the same scope or are used to create the same strategy, a successful approach will involve four common steps. Breaking the project into these steps--and completing each one--will ensure your success and lower your stress levels.

Step 1: Choose your target

  • Define your goal. Are you trying to resolve a specific pain point, reduce costs, create service levels or ensure compliance? Having a desired outcome will ensure that you collect the right information.
  • Determine the project scope. Should you classify a subset of your enterprise or is the whole environment achievable? Do you focus on a single app or file system, a department or the entire data center?
  • Set the level of classification: application data sets, file systems, files, business objects or messages.

Step 2: Map an approach and appropriate toolset

  • Determine the metrics you'll collect. This is the data that will drive your strategy and enable you to think outside the technology stack all the way to the end users in your business groups (see "Engage business users," at right).
  • Define your data sources. These include existing classification and tiering studies, spreadsheets, resource management reports, organization charts, legal policies, DR plans, etc.
  • Determine which new or existing tools you need. This includes storage vendor tools, discovery engines, ad hoc scripts and database queries. (An upcoming article in this series on data classification will describe some of the tools that can make the process easier and more automated.)
  • Determine the percentage of completion needed to set a strategy. In general, the last 20% of information won't be worth the effort. Can you live with a 60% sampling and still accomplish your goals? Check your progress periodically and decide whether it makes sense to continue collecting data.

Step 3: Gather your data and validate it

  • The heart of a classification project is organizing and collating the business requirements with the infrastructure. This is true regardless of the type of data classification project you're undertaking.
  • Determine your data container--spreadsheet, document, database or reporting tool. While a spreadsheet works well for small projects, a database is more appropriate for projects with a larger scope. Ideally, you should plan on keeping the information up to date with either a continual process or a periodic refresh.
  • Follow up your data collection with interviews with key stakeholders. Often, the "statistical data" you collect won't look the same to the group who collected it. They may know of data points that are changing, or provide interpretations that will change the way the data affects your strategy.

Step 4: Organize and communicate the data in a form that will lead to positive change/action

  • It's all in the final report: No amount of quantitative data will lead to success. You'll need detailed information for the IT teams, summary information for the executives and financial data for the budget process. Don't let your project miss its mark because of a failure to communicate.
  • Revisit your goals from Step 1. Did the data help deliver the solution you were looking for? Do you need to revise your strategy to meet the true requirements? A data classification project will often change the perception of the people involved, so be flexible in adjusting your follow-on projects and resources to meet needs uncovered during the process. For instance, a storage-tiering exercise for primary data will identify an obvious archiving project. Let the information lead you to service-level communications with your business units, a design strategy or tactical projects.

Will my storage environment become more complicated?

Yes and no. The process of classifying data can get complicated, although a structured approach will help. For storage-tiering projects, organizing applications into groups and building a consensus around service levels takes time. Once the process is completed, good classification actually simplifies ongoing management of storage systems. It often decreases the total number of service-level agreements, reduces finger-pointing and creates standard storage configurations. It clears a path for implementing information lifecycle management and moving older data out of primary storage systems. It also reduces the total amount of data requiring active management.

Data classification is a key foundation for many storage projects, and can ensure that the technologies deployed and dollars spent are used to their greatest benefit. There are different levels of classification, and picking the right level and the appropriate metrics for your project are keys to gathering the requirements you need to be successful (see "Sample data classification metrics by data type"). Not all projects need to be enterprise-wide to assist in crafting strategy or making technology decisions. Sometimes choosing a subset of the data can be an easy way to get started.

Finally, be sure to engage the business in your project. The value of data classification is that it ties the infrastructure you love and care for with the needs and requirements of the business that pays for it. It also allows for open communication and collaboration between IT and business units, which will result in solutions that are cost-justified and help the company pursue its core business.

Dig Deeper on Long-term archiving