File formats that stand the test of time

Storage professionals are beginning to grapple with long-term data archival. At the same time, archivists and records...

management professionals have poked their noses out of their musty books and are examining the problem of digital document preservation. By starting at the application layer rather than at the physical device layer, they seem to be making progress.

This fall, the International Organization for Standardization (ISO) approved PDF/A, a version of Adobe Systems Inc.'s Portable Document Format (PDF) file format suitable for long-term archive. As a subset of PDF version 1.6, the goal is to make documents stored in the PDF/A format recoverable and viewable for a much longer time than if they were saved as regular PDF files.

But how much longer they can be saved is unclear. PDF 1.0 was first copyrighted in 1985, and can theoretically still be read with a contemporary PDF reader. In practice, that's not always true--anecdotally, users report being unable to open PDFs created with Acrobat 2 using Acrobat 6.

"The intent of the ISO working group was not to get fixated on any period of time, but to do things to extend the time," says Stephen Abrams, digital library program manager in the Office for Information Systems at Cambridge, MA-based Harvard University Library and project leader of the ISO PDF/A project. Whatever the lifespan of a PDF document, "PDF/A will have a lifespan that will exceed that," he notes.

PDF/A promises longer document life by ensuring that documents meet several criteria: device independence, self-documentation and "transparency," which Abrams describes as "how amenable an object is to direct human analysis with basic tools." In other words, if a document is transparent, you should be able to determine its content using a text editor and a reference manual, says Abrams.

In practice, PDF/A takes the PDF command set and requires, recommends, restricts or prohibits its use. For example, PDF/A requires that documents embed any fonts they use directly in the document because 20 years from now, it's not safe to assume that the system rendering the document will have those fonts loaded. Similarly, PDF/A requires the use of a device-independent "color space" or color-rendering algorithm such as RGB. At the same time, PDF/A prohibits the use of encryption because it's inherently "inimical to transparency," says Abrams.

The ISO has yet to tackle the creation of archivable forms of other common document types such as Microsoft .doc and .ppt formats. "Microsoft Office documents are proprietary and have a closed specification," says Abrams at Harvard University Library, and "that complicates long-term archiving of objects in those formats." However, Microsoft will support the PDF 1.4 format in Office 12, the next version of its productivity suite.

There are other standard formats, including XML-based document formats such as the Open Document Format (ODF), a standard developed by the Organization for the Advancement of Structured Information Standards (OASIS) industry consortium. ODF applies to text documents, spreadsheets, charts and presentations, and is the default document type for Sun Microsystems' StarOffice and OpenOffice productivity suites. This fall, the Commonwealth of Massachusetts finalized plans to make ODF its standard for all office documents by 2007. Microsoft has stated it won't support ODF.

In library circles, another XML schema called the Metadata Encoding and Transmission Standard (METS) is common. Beyond text, the METS schema captures information about the logical structure of the document; for example, page numbers and chapters. Typically used alongside a scanned TIFF image of a document, METS has become the default digital archive format for many libraries, says Abrams.

Has all of this thinking about digital document archival trickled down to the commercial sector? "There's a newfound awareness that preservation is an important and difficult function," says Abrams. "For a long time, the commercial sector equated preservation with backup--and that's not preservation in the sense that we in the library community talk about it. There's a lot more to preservation than just having the same bits; you also have to be able to render those bits in an understandable form."

Bill Tolson, principal analyst at Contoural Inc., a Los Altos, CA-based data archiving consulting firm, agrees that the commercial sector is increasingly aware of the problems presented by long-term archival. "When we talk to people about archiving, it's clear that they're not talking about a couple of years," he says. "They're talking about much longer time periods--10 or 15 years, sometimes more." However, there's little, if any, awareness of possible solutions. While almost everyone uses PDF in some form or another, Tolson has never heard customers ask about PDF/A or ODF. "I think it will depend on whether the ECM [enterprise content manager] vendors start recognizing and utilizing these file formats," he notes.

This was first published in November 2005

