News Stay informed about the latest enterprise technology news and product updates.

Data classification: An overview

Corporate information resources regularly stretch into the terabytes (thousands of gigabytes) with many thousands (even millions) of individual files to contend with -- but the problem is now far more complex than simply "finding files". Finding specific files among this confusing hodgepodge of information is inefficient and frequently incomplete. Companies often fail to recognize the importance of their data and its impact on everyday business operations. The process of "data classification" attempts to fill this void by helping businesses understand what data is actually available, its location in the enterprise, how that data is being accessed and how it must be protected to meet legal and regulatory requirements.

An enterprise must grapple with many different kinds of data, including financial reports, personnel files, research...

results and customer records. The problem is that most companies do not correlate information with their business process, often resulting in a poor utilization of storage resources. Even worse, the business may not be positioned to address compliance audits or legal discovery challenges simply because they just can't find the data in question, placing the company in an extremely vulnerable position. As a result, companies often waste valuable storage resources by retaining all data on expensive, high-performance systems. The practice of data classification seeks to overcome this potential weakness by aligning information with business needs, categorizing the data based on these needs and then using the resulting classifications as a roadmap for retaining and storing information. This is a fundamental underpinning of information lifecycle management. "Data classification is a methodology to align business requirements to infrastructure, so that infrastructure service delivery properly supports data storage and management," says John Merryman, senior consultant with GlassHouse Technologies Inc. Greg Schulz, senior analyst at Evaluator Group puts it even more simply. "Data classification organizes data so that IT can manage it."

Understanding is a prerequisite to success

A successful implementation often requires a solid understanding of the needs for data classification in the first place. Data classification is a serious endeavor, so companies must first address the question of "why do it?" The driving factors usually involve risk mitigation. For example, some companies may be concerned about meeting compliance audits such as the Health Insurance Portability and Accountability Act, the Sarbanes-Oxley Act or other forms of corporate governance (e.g., a life science company may be concerned with particular test data for the Food and Drug Administration). Other companies may wish to ensure adequate response times in the face of legal discovery challenges. Still other companies may focus on more tangible objectives such as increasing the availability of important information for end users, improving the responsiveness of storage resources or saving money by shifting less critical information to secondary storage solutions (a.k.a. tiered storage).

Once an organization understands "why" data classification needs to be implemented, the real work of classification can begin. The classification process can be long and involved (depending on the size and scope of each organization), but classification is largely a manual process. While there is software available to help discover and evaluate information assets, there is no practical tool that can tell you what information is worth to your business. Each company must derive that answer itself. "It's a very high-tech concept that starts in the very lowest tech imaginable," says Steve Duplessie, senior analyst at the Enterprise Strategy Group. "It starts with a piece of paper and a whiteboard, and two people having a discussion." Eventually, this collaborative effort extends to every key area of the company. Michael Peterson, program director of the Storage Networking Industry Association's Data Management Forum, says that it's really a team effort. "The team usually consists of IT, information management, information security, finance, business and legal," he says. Additional corporate departments may also become involved at some point during the classification process. Peterson notes that data is often classified by application, by company group -- such as finance or manufacturing -- by meta data or by type, though the actual categorizations depend on the specific needs of each particular business.

Ultimately, the trick is to establish a manageable set of classifications that can suit your entire organization. Perhaps the most tangible advice shared among analysts and vendors is to approach data classification in small pieces. Peterson suggests starting with specific data types, such as backups or e-mail. "Start building out little islands. Get practice making the system work and getting policies in place," Peterson says. The next hurdle is to build interest within other areas of the organization. "You need to show that you're succeeding. You need to have some important wins along the way, so the very first place that you can see some wins is by removing cost." He cites the move to tiered storage as one major cost savings that companies can easily measure. If the concepts of data classification are simply overwhelming your organization, turn to consultants and professional services to help jumpstart your process. Again, they cannot determine your data's true value, but they can ask the meaningful questions that will get your effort started.

No substitute for the human touch

Unlike many emerging IT developments, data classification is almost entirely a human decision-making exercise. "Tools can help in the automation and enforcement of classification policies, but 'they' don't classify -- you can't eliminate the human thought process," Duplessie says. While there is certainly software and hardware products available to help discover your data within the enterprise, determine its location, set policies on that data and measure the adherence to those policies, no product has the intelligence needed to determine the "value" of your corporate data. No software can possibly know that loosing a certain file may result in an indictment of the chief financial officer.

Still, the tools are evolving to supplement and automate data classification, and related tasks -- such as data migration or retention -- though such tools perform very limited and specific functions. "Those [tools] are typically server-based applications that have an interface and pulls meta data from client environments -- not unlike a backup or storage resource management product, only with more detail about the data," Merryman says. But such products can only work with data based on definitions derived during classification discussions. "It's all about understanding the organization and its requirements, and the only way to really get that is through 'people' and through 'process'," he says. See "The vendors" later in this article for more details on emerging software products. No specific hardware devices are needed to support data classification at this time, but storage subsystems can add an indirect benefit. As one example, a tiered storage system may indirectly support data classification by influencing the cost of storage -- allowing less valuable data to reside on less expensive disks.

Go to the next part of this article: Data classification: Strengths and weaknesses

Or skip to the section of interest:

  • Introduction
  • Data classification: An overview
  • Data classification: Strengths and weaknesses
  • Data classification: The vendors
  • Data classification: User perspectives
  • Data classification: Future directions
  • Dig Deeper on Long-term archiving