IN 2004, cryptographic researchers announced they had discovered a flaw in the popular MD5 hashing algorithm that allowed it to be "cracked" in a matter of hours. Questions were immediately raised about Centera, EMC's archiving platform that relied on MD5 to create a "content address" of an object to ensure its authenticity.
Whether or not the MD5 crack means anything to Centera customers is a matter of opinion. First of all, Centera no longer uses MD5, but a proprietary version of SHA-256. Even so, it would be exceedingly difficult for someone to take advantage of the MD5 vulnerability and, say, remove incriminating evidence stored on a Centera.
But from a compliance standpoint, the algorithm used to create a content address must be unassailable, says Paul Carpentier, CTO at Caringo, an Austin, TX-based content-addressed storage (CAS) startup. "As it stands, an expert today cannot say 'Beyond a shadow of a doubt, this is the original document.'"
It should come as no surprise, therefore, that newcomers to the CAS space have been careful to point out that they don't have Centera's alleged shortcomings and, that with their products, users can upgrade the hashing algorithm.
Strictly speaking, Hitachi Data Systems' Content Archive Platform (CAP) isn't a CAS system because it doesn't actually use a hashing algorithm to rename files that are stored on the system. The files are stored with their original file names, and are accessed using either NFS or CIFS protocols. The content hash used to ensure data integrity is stored as part of the meta data associated with the file. According to Hitachi, this means the hash could be upgraded in place without needing to rewrite the file in the event the hashing algorithm is cracked.
In other respects, Hitachi's CAP is similar to Centera. It has multiple front-end nodes storing data on a back-end array, except the front-end nodes are white-box, Intel-based servers running software from Archivas, which is resold by Hitachi. Centera, in contrast, is an all-EMC offering.
Caringo also claims its CAStor product allows in-place upgrading of hash algorithms for existing files. Caringo won't discuss intimate details of the technology, which it claims is patent-pending, except to say that the technology relies on a system of "split tags and seals" to produce the hash via software, rather than via hardware like other CAS systems do. That way, CAStor can allow the hash to be changed not only for new data, but also for existing files without having to rewrite them.
Architecturally, CAStor is based on a cluster of many small Intel-based nodes with a few internal disk drives. Nodes are created in approximately 60 seconds with the help of a USB startup key that contains the system software. They automatically discover the existing CAStor cluster and add themselves to it transparently. Redundancy is achieved across the cluster as a whole, with drive parity distributed across all the drives in the cluster rather than across a single RAID set. According to Caringo, this means CAStor actually recovers from a failure faster as the cluster grows in size.
This architecture will appeal to some storage managers, while others will prefer the tried-and-true EMC approach. On the other hand, the jury is still out on the topic of upgradeable hashes. In coming years, hackers may devise ways to alter files in spite of a hash.
Furthermore, says Mark Avery, EMC's senior director for Centera marketing, the lack of a concrete association of a hash with a given file could just as easily be called into question as a compromised hashing algorithm. And given that there may be no particular relationship between the technical facts and legal decisions, it's conceivable that a case could be made either way.
--Logan G. Harbaugh