Apps to classify and find data


This article can also be found in the Premium Editorial Download "Storage magazine: Salary survey reveals storage skills are in demand."

Download it now to read this article plus other related content.

Unstructured data
Unlike information stored in well-defined application databases, or in semistructured e-mail servers and document management systems, the file shares in most companies are a dumping ground for more than 400 types of file formats. As corporations control the data in ERP apps and e-mail servers, users are increasingly using Microsoft Office apps to store their personal productivity files, which are often critical to the business' day-to-day operation. This leads companies to provide higher levels of service for this data at an ever-increasing cost. Unfortunately, this data may be everything from pictures of the grandkids to highly confidential customer documents containing private information.

The problem is that this data is stored in a user-defined fashion that's rarely controlled, searchable or organized in any meaningful way. A company must therefore find a way to separate mission-critical data from data that requires less costly service levels without adversely affecting productivity. Equally critical is the need to identify older data, duplicate copies of data and orphaned data that's no longer needed and can be deleted. Finally, when the need to find, protect or destroy specific pieces of information arises because of litigation or new regulations, how do you quickly respond to these demands? The short answer is to classify data, ideally with automated, user-friendly tools.

The primary purpose of information

Requires Free Membership to View

classification and management (ICM) tools is to provide intelligence about the files residing in file shares or share drives. These files may reside on individual Unix/Linux or Windows servers connected to SAN or DAS, in NAS filers or serviced by NAS blades inside a SAN chassis. Historically, management tools for file systems focused on the file attributes from the file systems themselves. ICM tools discover the file attributes of a file system, but their power and functionally comes from their ability to actually read the contents of a file and search for specific patterns (like Social Security or credit card numbers) or, in some cases, to create a complete index of all text in the document (including numeric information).

They create a repository of meta data by "crawling" a file system or reading a data stream and capturing the file attributes and/or the content of each file. They don't actually store the file in the repository, but instead store the data as individual attributes or entire full-text indexes. The initial process for all but two of the tools described in this article is fairly slow (they may have to read millions of documents); however, after the initial crawl, they all have the ability to do incremental crawls on a periodic basis, which run faster with less impact on the file system. The repositories are then searched to produce reports or take actions on the file. In general, the size of a repository will run from 3% to 15% of the total amount of storage being classified, depending on the type of files and the amount of data kept in the repository (file attributes, specific patterns or full text).

This was first published in November 2006

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: