Finding Data

Archiving applications are increasingly being used to minimize online data stores and to meet compliance requirements. Most of those archivers include search features, but the capabilities vary widely. Understanding how these search tools work will help you find the best fit for your company.

This Content Component encountered an error
This article can also be found in the Premium Editorial Download: Storage magazine: Big 3 backup apps adapt to disk:

Archiving applications are increasingly being used to trim online data stores and to meet compliance requirements. Most archivers include search features, but they may not have the horsepower to meet your company's needs.

Like the proverbial needle in a haystack, finding a single e-mail, file or record buried in terabytes of archived data is often futile. While many companies have successfully implemented automated data archiving, most are just beginning to grapple with the issue of how to recover specific portions of that archive.

Nearly all archiving applications provide some type of search capability to retrieve data items, but not all provide the flexibility and level of sophistication needed to meet the often-demanding requirements of litigation and regulatory compliance activities. Search-only products that bolster the search capabilities of archiving apps are appearing in greater numbers. Still, some companies may not be 100% confident that the search tools in their archiving arsenal will discover all relevant materials and may seek additional assurance by using an outside discovery service.

Advanced search concepts
Hyperbolic tree: A graphic representation of found instances of data that shows the relationships among the data objects.

Fuzzy search: A search engine with this capability can find variations in the spelling of a search word or even misspellings.

Phonic search: A phonic search will distinguish instances where two words may have different spellings but sound alike (e.g., Smythe and Smith).

Stemming search: A stemming search will find all the variations of the search word where the root of the found word matches the search word.

Natural language processing: Natural language processing uses the context of a file to determine if the search result matches the intent of a search; for example, by distinguishing between "sue" and "Sue."

Brian Erdelyi, information security officer at Toronto-based Blackmont Capital Inc., learned this lesson while working at another company, where a high-profile legal case required restoring all e-mail for 12 specific users. "Because this company was so large, everybody's mailbox was on a different server," recounts Erdelyi, "so we had to restore 12 servers 12 times for each month in the past year." Because they had to restore the data from tape, it took them approximately six months to complete the chore. With his current setup at Blackmont using Fortiva Inc.'s archiving service, Erdelyi says a similar request was completed in "a day or two."

John Hegner, vice president of technology services at Liberty Medical Supply Inc., Port St. Lucie, FL, implemented iLumin Software Services Inc.'s (now owned by Computer Associates International Inc.) Assentor Discovery after he had to restore e-mail messages from backup tapes. He sent tapes to a data restore service that charged "over $100,000 for the work," says Hegner. Intent on avoiding a similar experience--and expense--he installed iLumin's product so that he could keep restore efforts in-house.

Incidents such as these are quite common. They're a stark reminder that new data retention and retrieval requirements are beyond the scope of traditional backup apps and procedures, and generally mean adding new archiving/search tools to the storage environment.

The trick is to know how deeply a search will need to delve into archives. In many cases, companies implemented archiving specifically to address storage and application performance concerns. By paring down the amount of application data stored on pricey, higher performing storage, companies could forestall buying additional primary disk. Searching through those data archives may have been a secondary consideration initially but, as the archives grow, the demand for search capabilities also grows.

Regulatory compliance and litigation aside, companies have found that search tools can help identify potential legal problems and, in doing so, may avoid legal issues altogether. Jim McGann, vice president of marketing at Index Engines, Holmdel, NJ, sees this application of search technology as a growing area of interest, citing an investment bank customer that routinely searches e-mail for certain words and then saves those search results in encrypted .PST files.

Erdelyi has been archiving Blackmont Capital's e-mail with the Fortiva Suite of outsourced services for approximately a year and anticipates using the search function proactively in some cases. "There's a feature within Fortiva where I can set up policies that are effectively keywords," says Erdelyi. "If these keywords are detected, [the e-mail] can be flagged for review." It's an intriguing prospect: An HR group could detect an instance of harassment or other improper behavior long before it became a serious legal matter. Besides detecting HR-related indiscretions, proactive searches can keep a company compliant. "There are certain code-of-conduct rules that our traders have to follow," notes Ederlyi, "so we set up keywords that will trigger [certain documents] and our compliance department will review those on a daily basis."

Search criteria
The granularity of a search depends on the elements the search application stores and examines. Basic search engines catalog the meta data associated with a file--typically information like file name, last access date, and the ID of the user who created or modified the file. For e-mail, sender and recipient, subject, date sent and other basic message information can be searched. Searching on such rudimentary meta data elements might work fine with a small pool of data, but will likely yield a results set that's impracticably large when searching voluminous data stores.

Only a handful of products still rely on such a simple, restricted meta data search. Some products allow users to customize the meta data by adding additional identifiers such as keywords or tags. Keywords help to narrow searches, but adding them is generally a manual chore and a uniform set of keywords must be maintained to ensure any degree of search consistency.

Rather than rely solely on meta data and keywords, most search app providers now index the content of the file or e-mail message and its attachments. This allows for much more focused searches, as the full content of the file is compared to the user's search criteria; this makes it possible to include more unstructured data in the scope of a search. "We're finding an increased need to go out and look at the content of the data," says Mark Diamond, president and CEO at Contoural Inc, a Mountain View, CA-based consulting firm. "So a tool that can only look at file attributes has limited value."

There are, however, some penalties associated with full-text indexing. First, it takes time and processing cycles to create the index, although most archivers manage the indexing process behind the scenes to limit any impact on application performance. The second issue is storage space. A full-text index of hundreds of thousands--or even millions--of data objects can result in an extremely large index that may use significant disk space and slow searches. Paring the size of the index is the Holy Grail of archiving, and vendors employ proprietary technologies to keep their indexed output as compact as possible. For example, Index Engines claims that its full-text index requires only 8% of the disk space required by the original files.

Reliability is another consideration. Vendors make the indexing process as transparent to users as possible, so little is revealed about its inner workings. That's generally good, but if the indexing process fails to complete properly, there may not be any indication that the source data wasn't fully indexed. This could result in searches that appear to be comprehensive but, because files were missed, the results aren't "correct and complete" in the eyes of the court, leaving a company vulnerable to considerable penalties.

"When indexing engines are performing indexing operations, they routinely fail just like any other software fails," says Peter Mojica, vice president of product management at AXS-One Inc., a records compliance management company in Rutherford, NJ. The failure could be extremely difficult to detect because a portion of a single document attachment or an e-mail failed to index properly. Mojica says AXS-One's Rapid-AXS Search & Retrieval system traps errors and notifies administrators that re-indexing may be required.

Data classification
A more promising alternative to manual keywording or tagging is to use some type of data classification system, especially when dealing with unstructured data such as word processing and spreadsheet files. Products from companies such as Index Engines, Kazeon Systems Inc. and Njini Inc. allow users to create custom policies that are applied to help categorize files, generally at the time of file creation or during the backup process.

Data classification can do more to make searches more comprehensive than just adding custom attributes to the data. For example, out of the box, some apps will apply standard classifications to files to tag specific elements like Social Security numbers. Rather than having to devise search criteria that looks for a specific pattern or numerical sequence, the data is, in effect, pre-screened for having the attribute of containing a Social Security number. iLumin, for example, includes this capability in its Assentor Discovery product. iLumin calls this classification technique "smart indexing," as it allows the application to segregate files that include the Social Security numerical pattern so that subsequent searches will only have to plow through a subset of the data. Other patterns that may be specific to a particular business, such as part numbers, can also be included.

Positioned as an information lifecycle management tool, CommVault Inc.'s Data Classification Enabler module, part of its QiNetix 6.1 suite, classifies archival data based on file-system meta data and content. An upcoming release of the product will allow users to add custom meta data to the classification criteria. Kazeon's Information Server IS1200 appliance also provides capabilities that go beyond data classification for search and retrieval, such as file copying and moving data to WORM media. Kazeon recently announced a partnership with Google to integrate its Information Server with Google's enterprise search offerings.

Beyond the basics
Meta data and index-based searches may suffice for many organizations, but litigation issues are likely to require more advanced search capabilities. Not surprisingly, the push for more sophisticated search functionality is being spearheaded by companies that have considerable experience with the discovery process, such as "highly litigious corporations," notes Michael Clark, managing director at EDDix LLC, a Washington, DC-based electronic data discovery research firm. He cites the tobacco, financial services, insurance, energy and telecommunications industries as examples.

For users, the goal is simple: Ensure that all material relevant to a litigation or regulatory case is found quickly. "Ultimately, you need to get beyond keyword search and Boolean operators," says Clark. Some of the newer, more advanced search tools address this issue, and "reduce the overall cost of a project," he adds.

A keyword search will turn up all occurrences of a word or phrase, but more advanced search engines work more like the human mind. For example, some search engines use a technique called latent semantic indexing (LSI), which is based on a statistical system that reveals associations among words or phrases within files. For example, an LSI-enabled engine might discover during a search on the word "contract" that the phrase "binding agreement" appears with enough consistency that a logical association can be assumed. So the "contract" search may return files that don't even contain that word but are linked logically.

Shopping for search
Here are some tips to keep in mind when evaluating the capabilities of archive and search products.
  • Ask the vendor how much space its full-text index requires; this is usually expressed as a percent of the size of the source data. And find out if the indexing process will considerably slow down the application's performance.
  • Know your search needs--consult with legal, compliance and human resource departments to determine what types of searches they're likely to require.
  • Ask the vendor about its roadmap for product development. For archive vendors, ask about new or more sophisticated search features that they plan to add. For search application vendors, find out what additional archive applications they'll support.
  • Test the user interface to determine if it's intuitive enough so that users in your company's business units will be comfortable using it. A Web-based interface is the easiest to implement and provides universal access.

Some search providers have already incorporated LSI into their applications. For example, San Francisco-based Recommind Inc. provides conceptual search capabilities in its MindServer Retrieval products. iLumin doesn't do conceptual searches per se, but includes a number of advanced search techniques such as natural language processing, which can recognize the usage differences that distinguish words with the same spelling--such as the name "Sue" and the verb "sue." Zantaz Inc.'s EAS Search currently provides proximity searches that return results for two or more words or phrases that appear near each other within a document. The firm says it will soon include conceptual searching as well as relevancy scoring of found data objects.

Other search techniques appearing in document management, archiving and search applications include fuzzy, phonic and stemming searches (see "Advanced search concepts"). Many of these have been used for some time by Internet search sites.

The key to enhancing search capabilities with these complex, compute-intensive algorithms is incorporating them without sacrificing the performance of the search process. To this end, companies like AXS-One suggest using more general search techniques on a dataset first to create a more manageable subset that can be used with the advanced search functions.

Regardless of the specific search functionality employed by each vendor, it's clear that the state of the art in searching is steadily advancing. "The tools have reached the point where they're as or more reliable than human beings," says Andy Cohen, senior counsel and director of global solutions practice lead for compliance at EMC Corp. Cohen is also a member of the Sedona Conference, a group of lawyers, jurists and other experts that offer publications on electronic document retention and management, among other topics.

End-user tools
Right now, most searches are performed by storage administrators or other IT personnel, a natural development as storage managers are typically those who acquire and implement archiving applications. "These tools are still too new to be in the hands of the users," says Greg Forest, Contoural's vice president of services delivery. Although with e-mail archivers, users are increasingly doing the search queries themselves.

But the trend is toward more user-friendly interfaces so that legal, HR and compliance personnel can do the searches themselves, and view and analyze the results immediately. In a recent Osterman Research Inc. survey commissioned by Mimosa Systems Inc., the maker of NearPoint, a Microsoft Exchange archiving product, more than 77% of the respondents indicated that they want their archive/search applications to have end-user tools to reduce the reliance on IT.

Blackmont Capital's Erdelyi agrees, and is currently moving the responsibility for searching the Fortiva archives into his user community. "We've trained our compliance and our legal departments so they're able to perform the searches themselves," he says. Turning searching over to his users will also help Erdelyi manage the company's e-mail system better, rather than enforcing mailbox quotas. Quotas tend to result in users dumping mail into .PST files that consume disk capacity, slow backups and create the potential for legal exposure. By letting e-mail users use the search functions to find individual archived messages, Erdelyi hopes to discourage the use of .PSTs. "They'll always have access to [their messages] using Fortiva," he says.

Search federation
Two undeniable factors will affect the direction that search tools take: The amount of data that companies have to retain will continue to grow and finding specific pieces of that data will become more difficult. Most archiving tools in use today are for e-mail, file systems and databases. For the most part, these are discrete tools designed to integrate with specific apps. If a company uses several different archivers, they're also likely to have several associated search tools.

Ultimately, the proliferation of search tools will become unwieldy and the likelihood of missing something critical will increase. One solution may be an overall archiver/searcher that can work with data from multiple apps. But Contoural's Diamond warns users to "be careful about one-size-fits-all systems, because what's unacceptable is to have your archiving systems slow down the performance of the applications from which they're archiving data."

EDDix's Clark agrees, saying there will be "multiple vendors [and] special-purpose archives." If Clark and others are correct, that scenario opens the door for a search-only product that doesn't do any archiving, but can access multiple vendors' archives. Clark says vendors that develop those solutions will "position themselves as the hub in a hub-and-spoke model where they will be able to provide data mining, analysis and other tools independent of who the [archiving] vendor is." A number of products are evolving toward this model, such as MetaLincs Corp.'s MetaLincs 2.0, an e-discovery product that supports many of the popular e-mail archives and has a roadmap that calls for extending support to other applications.

Standards would facilitate the development of federated search products, but e-discovery experts say there's little, if any, activity in that area at this time. It's more likely that some of the archive players will cede search development to the search-only vendors and provide those vendors with the appropriate APIs so that their products will be interoperable. That, of course, shifts the responsibility to users, who must ensure that the archive and search applications are, indeed, compatible.

This was first published in April 2006
This Content Component encountered an error

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchSolidStateStorage

SearchVirtualStorage

SearchCloudStorage

SearchDisasterRecovery

SearchDataBackup

Close