Long-term or "deep" archive is an application with very specific requirements. Building one requires that you address some key design parameters. In this tip, I’ll focus on two considerations: How will you decide what to archive? How will you ingest data into the archive?
Deciding what to archive is a subject too often given short shrift. In many firms, little is done to classify data in the first place. One reason for this is the difficulty many IT planners say they have with classifying user files.
More than 50% of the data that is being created and stored in contemporary firms is in the form of user files, according to many industry reports. We increasingly confront the daunting task of sifting through a growing pile of user files in order to decide which files are important to retain. Examples of important files include those that are proof of compliance with regulations or laws, those that have business historical value, or those that contain intellectual property. The problem is users rarely know themselves which of the files they commit to storage have the properties that would necessitate (or at least recommend) their archival retention. So, the response to this challenge has been simply to retain everything -- a bad idea from the standpoint of production storage efficiency and cost, and from archive efficiency and cost, as well.
We know data classification schemes generally don't work when delegated to users. They are more likely to succeed if the process for selecting and migrating -- or ingesting -- files into an archive occurs in a manner that is transparent to the user. This can be done in numerous ways. A popular approach uses file metadata to identify files that haven’t been accessed or modified in a specified period of time, and then mark them for migration into an archive repository. Since users may eventually need to find these files again, either advise them of the address of the repository where the relocated files can be found, or leave a “stub” that automatically forwards a file request to the archive repository.
This approach is non-granular, since it does not consider the business context of the file. Additional granularity is needed if you are applying retention policies to specific data, specifying, for example, how long the file is to be retained. One basic contextual reference that is usually valuable to include with the file is some reference to the business process that the file supports. There are numerous products, including Trusted Edge from SGI (part of its recent acquisition of FileStor), that will let you classify data files by the role of the user who created them.
Alternatively, if you are a Microsoft shop, you can avail yourself of the File Classification Infrastructure (FCI) that is already part of the operating system and that Redmond has recently opened up for use. FCI provides the check boxes that are seen when you show the properties of a file in Microsoft’s NT file system (right click, select Properties and you see boxes marked “Read Only,” “Hidden,” etc.). A couple of years ago, Microsoft opened up FCI in order to allow users to program their own check boxes. Doing so in the operating systems of all the workstations deployed in, say, the accounting department -- adding a box called Accounting that is permanently ticked for all users in the department -- will add more metadata to the file that can be used to guide more exacting policies about its care and handling, both in the production environment and in the archive.
These tricks can help define more exactly what is going to be included in a deep archive. You need a way to move the selected data into the archive repository itself. This can be done in numerous ways, from copying files manually to the archive, to establishing a batch process for greater automation, or by intervening in a well-defined workflow.
Manual copying is likely to be the first technique you use when establishing an archive, especially if you view archiving simply as placing the files from the production system into a similar file system in the archive platform. However, while using a native file system to hold files created and saved with “time-bound” office automation software may seem like the simplest method to ingest data into a deep archive, it may not be the best idea.
Chances are that files stored for many years will no longer be accessible, either because the software used to create them or the platform required to operate that software no longer exists. This point makes many archivists look to software “containers” or “wrappers” that will enable files to be read even if their original read/write tools disappear. In a growing number of cases, files are converted into commercial formats like Adobe .PDF or into standardized XML wrappers prior to their ingestion into the archive platform, improving the likelihood that they can be read a decade or more into the future.
There are exceptions worth noting to the above. In some industries, like media and broadcast, the tools used to create and edit video files are developed using industry standards designed to facilitate preservation and archiving. In fact, the workflows used in production and post-production file creation can be leveraged to facilitate automated migration into an archive. An example of such a workflow-centric data-archiving approach is Spectra Logic’s Deep Storage product, which takes object-oriented content from video post-production workflows and migrates the objects, via a proprietary protocol and gateway server called Black Pearl, to a back-end archive repository based on tape, disk or cloud technology.
Hopefully, as workflows in other industries become better defined and standardized, more automated methods will emerge for selecting, containerizing and ingesting data into archives.