Data retrieval goes well beyond the scope of email and database systems to embrace all types of records. While databases contain structured data and email works with semistructured data, general data retrieval must address the huge majority of unstructured data in the enterprise, including documents, presentations and all types of media files. Consequently, finding documents or unstructured records in the data center has been likened to "finding a needle in a haystack," and this poses a serious dilemma for storage administrators responsible for managing data, handling regulatory compliance audits and meeting legal discovery requests. Document management software provides many of the features found with email and database archive tools, but several key features are essential for unstructured data. This chapter explains indexing, reporting, data analysis and policy management.
Understand the business needs
According to a study conducted by Xiotech Corp., about 90% of U.S. corporations are involved in some amount of litigation -- often juggling multiple lawsuits at any one time. In addition, noncompliance with government or industry regulations can carry penalties, including hefty fines, sanctions and even jail time. With such serious issues to contend with, it's easy to make a business case for document/records management or discovery tools. However, these tools can vary dramatically in their complexity and features. It's crucial for corporate stakeholders to understand their exposures and liabilities, and then evaluate the features that are most relevant for their industry or specific needs.
Index and search
Large enterprises can easily possess hundreds of millions of unstructured files, making it practically impossible to locate specific data using traditional filename or creation date information. Any document/records management tool should have a very strong indexing and search capability. As you saw in the overview for chapter 2,
retrieving data from archives
, indexing typically adds specific pieces of metadata to each file. Metadata goes beyond the basic file system details and can include a wide array of descriptive information that can easily be searched -- it's all about preparing a file to be found at a later date..
Comprehensive indexing is usually matched with an equally powerful search capability. In most cases, searching will sort through files based on previously created metadata. For example, a typical search might look for files created by "M. Smith" on "03 April" with "IPO" in the name or description. However, search capabilities are increasingly contextual, looking inside of files to locate important keywords. For example, a storage administrator dealing with the Securities and Exchange Commission (SEC) investigation into a brokerage firm might search all documents from "M. Smith" during "2005" containing words "promise," "guaranty" or "returns."
With such a potentially huge volume of records to examine, storage system performance is also an important consideration. Performance isn't just an issue with regular metadata searches, but it is particularly notable with context searches within documents. Lab testing and evaluation is strongly encouraged to gauge performance and allow for performance tuning within the storage infrastructure.
Many types of document management software will output data in a search engine-type format, such as Google, but legal discovery software may also capture and deliver documents to litigation-oriented software, such as CaseCentral, Concordance, DB Textworks, Documatrix, Etech, Introspect, JFS Litigator's Noteboook, Lextranet, Nmatrix, Ringtail, Summation and Virtual Partner. Other tools specialize in organizing data specifically for the Department of Justice (DOJ), the Federal Trade Commission (FTC), NASD and SEC investigations.
Reporting and data analysis
But searching isn't enough -- consider the reporting and analytical capabilities of your document management software. You need to have an overview of the data that is available, including age, type and value details. This helps storage administrators get a view of the data they're storing and its adherence to retention policies.
Workflow analysis and auditing capability should be able to document file access and identify users that are interacting with the organization's data. This can help to protect sensitive data against unauthorized access and identify users that are operating outside of corporate workflow policies. For example, auditing can document users that delete files. Unauthorized users can then be identified and corrected.
Policy management and enforcement
Today's glut of unstructured data is also subject to corporate data retention and deletion, so document management software must allow administrators to manage and enforce policies so that each data type is retained for the appropriate period. Retention periods will vary depending on the data type and the industry. For example, patient records will be retained far longer than a common corporate memo. Documents are then securely deleted once the retention period has expired. Any deletion should be properly documented to avoid accusations of spoliation (destruction or alteration of evidence). If litigation is a key concern, software should also provide a "litigation hold" feature where relevant data is exempted from deletion.
Finally, it's important to note that retention policies do not come from software. Instead, policies are set through a comprehensive understanding of government and industry regulations, along with a thorough knowledge of business objectives and risk factors. No two businesses will necessarily have the same retention policies for a given data type. Experts suggest that it's easier to integrate document management software in the enterprise when there is a well-defined and established "paper" retention policy already in place.