There are several different ways to archive data, but each tactic has its limitations. For example, tape is a common...
archival medium, but it doesn't offer timely access (tapes are also vulnerable to loss or theft). Disk arrays frequently see duty as archives but users must often contend with multiple copies of a file. This makes it difficult to know which copy is the latest or official version. Traditional metadata entries, such as filename, creation date and so on, include very little direct information about the file itself, making it difficult to search for files in the future.
Content-addressed storage (CAS), sometimes called content-aware storage or content-addressable storage, seeks to overcome these limitations. CAS is a disk storage technology designed to offer efficient access to fixed or archival data that should not change over time. Rather than treating data as a file and allowing a file system to handle data storage, data is annotated with metadata and treated as an "object," which is then assigned a unique designator (a content address) and sent to a permanent location on hard disk. Since each object is unique, it's impossible to store multiple copies of the same file, so duplicate data is eliminated and the total storage requirement is reduced.
By attaching a comprehensive suite of metadata to the object, data can be indexed or searched without knowing specific filenames, dates or other traditional file designations. Quality metadata can also include contextual information that can help a user to understand or employ the data when it is accessed in the future. For example, including a doctor's diagnosis along with an MRI record can help other doctors quickly come up to speed on a patient's condition, track changes to their condition, find other patients with similar conditions and so on.
Saving space by reducing storage
Analysts agree that data reduction, sometimes called commonality factoring, is a key attribute of CAS, saving additional cost by reducing the total storage space needed for all of a company's data. When data is stored on a CAS system, a hashing algorithm is applied to the file or more granular file elements like individual blocks. Each time the hashing algorithm is run, it produces a unique value. The CAS appliance compares those values against its index of saved objects. If a hash value is new, that portion of data and metadata will be added to disk. If the hash value already exists, it means that portion of data has already been stored, so only metadata and a pointer to that existing portion will be saved. "If the data already exists, there's no reason to save it again," says Jim Damoulakis, chief technical officer at GlassHouse Technologies Inc. "All you need is a reference to point to that data."
For example, suppose that CAS is being employed to archive e-mails and 30 e-mails exist with the same attachment. In a traditional backup or replication scheme, that same attachment would be saved 30 times along with the e-mails. With data reduction techniques, the CAS appliance would save the actual attachment only once -- subsequent iterations of the attachment would only save pointers and different metadata. Since there are often many versions and copies of files scattered across corporate servers, the potential for storage savings can be significant.
"Commonality factoring is really the reason CAS exists," Damoulakis says. "The idea is that you want to store information in as efficient a manner as possible." He notes that more granularity in file storage can offer improved data reduction and should weigh heavily in any CAS product evaluation.
Ensuring data integrity
CAS is finding an important role in issues of corporate governance, risk mitigation and legal liability. Since all CAS data is uniquely identified through hash algorithm results, can only be stored once, cannot be modified and can only be destroyed outside of established retention policies, companies are increasingly evaluating CAS technologies to meet their compliance needs. The inclusion of detailed metadata also enables superior indexing and searching, allowing relevant files to be located long after their filename has been forgotten.
"CAS is also being used for corporate governance and to meet compliance mandates," says Tony Asaro, senior analyst at the Enterprise Strategy Group. "Because CAS provides WORM [write once, read many] and retention periods, it can be used as part of a discovery process and offer immutable digital evidence."
Damoulakis also suggests that any evaluation of data integrity should also include a serious consideration of robustness. Whether robustness is implemented through software or hardware, it's important to understand clearly how data is protected and distributed throughout the appliance to guard against failure. For example, replication within the appliance is one means to prevent data loss.
Strong software integration is essential
The first and most obvious consideration in any CAS storage system is the CAS appliance, (the actual storage system) itself. There are numerous CAS products available from leading manufacturers like EMC, Sun, Avamar, Nexsan, Permabit and others [see Content-addressed storage: The vendors]. But analysts are quick to note that software is often a more important issue and the interface between the storage application and the storage system play a critical role in any successful CAS deployment.
"One of the key pieces in making 'object-based' [technology] work is the software that runs on the host -- software that interfaces to the application -- referred to as the APIs [application programming interface]," says Greg Schulz, founder and senior analyst, StorageIO. He notes that the APIs form the "glue" that connects the storage system to the storage application(s). While many application vendors have already integrated their applications with the appropriate APIs, analysts insist that it is the overall quality of that integration or how well the hardware and software pieces work together to meet your CAS storage objectives, that really differentiates a CAS platform from an ordinary storage array.
Ultimately, a successful implementation of CAS requires a holistic view of the entire solution. "CAS is not really about hardware or software," Schulz says. "CAS is a solution that encompasses hardware, software and networks, as well as the integration of those and of the application(s)."
Go to the next part of this article: Content-addressed storage: Strengths and weaknesses
Or skip to the section of interest:
Dig Deeper on Data management tools