Archiving unstructured data

The meta data database
The data used to populate databases like CommVault's CTE is the meta data--the data about the unstructured data--that's generated during the analysis of each e-mail and file. The type of meta data will depend on the type of underlying content analysis tool used by the product and the policies currently in place.

The meta data includes common attributes such as file owner, creation date, last modified date, as well as the sender, receiver and subject line for e-mail messages. Content analysis also occurs during this stage, as a text-mining tool analyzes the content and context of the documents. For example, Zantaz's EAS uses AltaVista's indexing engine to open, examine, and summarize the content of files and e-mails.

Changing compliance regulations

Requires Free Membership to View

While indexing and archiving all unstructured data may satisfy regulators for now, it's--at best--a short-term fix. Auditors' demands are becoming more and more specific when examining the relationships among e-mails and files. For example:

Relationships between documents Users may be asked to retrieve all documents that are germane to a specific topic, accompanied by summaries that reflect both their content and the context in which the words are used. This may even mean being able to find relevant documents where a specific name or number isn't mentioned explicitly, but rather alluded to or implied. Doing so requires taxonomies found in enterprise content management (ECM) software and in advanced text-mining algorithms that employ techniques like lexical analysis, neural network-based intelligence and content scanning.

Policy-setting capabilities Searching and managing a data archive requires the ability to set and change policies. Policy capabilities should allow administrators to encrypt documents, quarantine documents for supervisory review and track when, how and who accessed a document. There should also be accommodation for some type of information lifecycle management mechanism that can either remove documents from the archive at the end of their regulatory life or recognize requirements to retain them for longer periods.

Creating an indexing policy
Creating policies so that unstructured data can be properly indexed, archived and retrieved is no longer an onerous task. Lubor Ptacek, EMC's director of marketing, found that many companies that purchased Documentum would create a task force to identify needed policies and categories. Often, the companies would get bogged down debating how to categorize and classify their data before ever using Documentum.

To aid implementers, Documentum now comes with a set of default categories and policies. EMC recommends companies first address only a subset of their unstructured data. Policies can then be tuned over time to accommodate trends that emerge in the usage of the data or specific statutory requirements.

Many ECM and archiving products offer a data classification taxonomy in addition to a policy engine. The taxonomy provides other ways to classify data, such as by department, purpose or client, and data can be classified in multiple categories. The taxonomy makes it possible to find all documents pertinent to a specific subject without requiring multiple database queries.

A taxonomy doesn't lessen the importance of policy engines. For instance, a policy can be set to index all occurrences of a word. During content analysis, the contents of an e-mail or file are evaluated based on existing policies, with the results stored and indexed in the meta data database to enable fast searches on the words or phrases defined in the policies.

But looking only at specific words or phrases has its limitations. For instance, the word "football" may mean one thing to Americans and something else to the rest of the world. To actually understand the use of the word "football" in the context of an e-mail or file, natural language processing (NLP) algorithms are used to approximate what humans do--analyze and interpret words within their context. Thus, the content analysis process would recognize that the word "football" could mean "soccer" as well as American football.

The ability to search for a word or phrase across all managed unstructured data sources from a single point is one clear advantage ECM products have over their archiving counterparts. Archiving software doesn't make API calls into other unstructured database repositories, which ECM apps do. Zantaz can only do this for unstructured data types that they integrate with, mainly at this point by using NAS heads. Documentum creates a single repository for all types of unstructured data and provides users with a common set of services to manage them. It then converts these different data types into objects and creates meta data associated with each object. Converting this data into objects also allows relationships to be defined among different objects and retained in the meta data repository. The downside of the ECM approach is that it can take a lot of time up front to define and build these relationships, and may require businesses to reengineer their existing processes. In some cases, the cost to set up and manage the unstructured data relationships may outstrip the gains.

Implementation considerations
Licensing also varies widely among these archiving products. For example, the ability to discover and index messages on e-mail servers requires a user to license Open Text's Livelink for E-mail Monitoring module, while another license for the company's E-mail Archiving component is required to manage and archive e-mail. Open Text's E-mail Management module license delivers all of this functionality in a single package.

Apps like Open Text's Livelink and Veritas Software Corp.'s Enterprise Vault communicate with e-mail servers through TCP/IP ports using standard APIs. For Exchange, they use MAPI and Lotus APIs for Lotus Notes. However, IBM Corp.'s DB2 CommonStore uses Notes-specific protocols, Notes RPC and Domino Internet Inter-ORB Protocol to extract Notes database information while using WebDAV, a set of HTTP extensions, to access public folders on Exchange.

Because these products use common network protocols to communicate with e-mail and file servers, administrators should check their internal network to ensure that the appropriate IP ports on the firewall are open or routes exist within the routing tables.

Unstructured data management tools provide a means of managing e-mail and file archives. Current products, however, lack a simple, comprehensive way to deliver an enterprise-level archiving solution. For now, organizations will have to rely on archiving point solutions. Companies ready to tackle longer term compliance issues and bring some data mining functionality to their unstructured data, should look at ECM tools that provide content searching and analysis tied to preset polices and customized taxonomies. But ECM isn't easy to implement or manage, and users will still need archiving software for their e-mail and files. For data that's of indefinite value now, most users will find it easier and cheaper to implement an archiving application and deal with the content management aspect later.

This was first published in August 2005

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: