Content-addressed storage (CAS) explained
By Carol Sliwa
EMC Corp. cornered the market on content-addressed storage (CAS) with the introduction of its Centera line in 2002, but it didn't have the last word on the acronym. Some companies use the term content-addressable storage; others favor content-aware storage.
No matter which term you choose, CAS technology continues to be particularly useful in addressing two problems: the long-term retention of content for compliance and/or regulatory purposes, and the archiving of massive amounts of records, images or other information that rarely (if ever) change.
One reason CAS is so effective is its use of a hashing algorithm to assign a unique identifier, or digital fingerprint, to each stored object. That process, coupled with storage best practices, ensures that whatever goes into the system is exactly what comes out. If a data element changes, it receives a new unique identifier, aka content address. The stored object's physical location doesn't matter.
"CAS is not necessarily a category of storage, like SAN or NAS. It is a mechanism which allows you to do a number of things much more efficiently than would be possible using traditional techniques like file systems," says Paul Carpentier, chief technology officer at Caringo, Inc., a provider of content storage software. Carpentier developed CAS technology for Belgian software company FilePool BV, before the company was acquired by EMC in 2001 and its data archiving software became the forerunner for Centera.
A classic case for CAS is e-mail archiving. For instance, East Carolina University chose its CAS system over regular storage array disks after tests showed its IT department would need 60 man-hours to recover a year's worth of messages for any given employee with its existing backup system. Making matters worse, the existing backup process didn't ensure full recovery, since e-mails might have been deleted before the backups were performed.
"If we entered into litigation, we didn't have any proof that somebody didn't go in there and delete some of the files," says Brent Zimmer, assistant director of IT services at the Greenville, N.C., school. Potential fines for being out of compliance exceeded the cost of the solutions under consideration, he added. Now, running Symantec Corp.'s Enterprise Vault with Centera, in governance mode, guarantees that the records are kept for whatever time period is designated.
CAS managing storage clusters
Johns Hopkins University's Center for Inherited Disease Research (CIDR) turned to CAS for a different reason: storing an enormous volume of important data that had become unwieldy to manage. The Baltimore-based CIDR studies the DNA of patients and healthy individuals in hopes of finding new treatments or cures for complex diseases. Its genome scanners sometimes pump 2 TB to 3 TB per day of difficult-to-reproduce images into the system.
After reaching almost 130 TB in its 40 high-capacity PetaBox systems from Capricorn Technologies Inc., CIDR installed a double-density storage array with nine nodes, each with a dozen terabyte disks, from Rackable Systems Inc. Caringo's CAStor software now manages the storage clusters.
"You can set the replication for how much redundancy you want for the data, and it's so simple to use and to manage," says Lee Watkins Jr., director of bioinformatics at the CIDR. "When you need additional capacity, you add another node. You bring it up. It's part of the cluster. You're done. Honest, it's that simple."
CAS users often employ a redundant array of independent nodes (RAIN) architecture, allowing data to be copied to one or more servers in the cluster, instead of storing it on different disks in the same server.
"[RAIN] enables larger, more cost-effective scalability from a capacity standpoint," says Brian Garrett, technical director of the ESG Lab for storage research firm Enterprise Strategy Group. "I don't want to be encumbered by traditional RAID 5 rebuild penalties, so mirroring is better than parity. And if I'm going to mirror and use commodity servers, instead of mirroring within the servers, why don't I mirror the data between the servers over a commodity Ethernet network? What I get is cost-effective scalability."
CAS eliminates traditional file system
Another benefit for a CAS user with a massive amount of data is the elimination of the traditional file system, with its capacity limits and management challenges. Instead, users have one pile of storage kept in a big flat namespace, and they needn't trouble themselves with directory structures or file names – although many organizations do opt for a file system gateway to do translations.
On the downside, the knock on CAS has been its lack of performance. Running every bit of data through a hashing algorithm is processor-intensive, making CAS prohibitive for anything but infrequently used content.
"If you're looking for very, very, very fast storage, you might want to rethink going with CAS," says Greg Schulz, founder and analyst at The StorageIO Group. "With CAS, you're trading performance for intelligence, for information, for optimization."
CAS performance improving
CAS performance is improving, whether it's from Caringo or EMC, who is universally acknowledged as the market leader in CAS. Each new version of Centera has offered higher performance than the previous version. But few vendors take the same technology approach as EMC.
Gartner Inc. finally scrapped the narrow CAS category to take a broader market view, comparing Centera to other products aiming to solve the same user problems, even if the technology isn't strict CAS, notes analyst Pushan Rinnen. She uses the example of Hitachi Data Systems' Content Archive Platform (HCAP), which was acquired from Archivas. HCAP competes against Centera but isn't considered CAS because it makes use of a NAS file system on the front end, she says.
Other vendors with CAS offerings that compete with Centera include Hewlett-Packard with its Information Access Platform (formerly RISS), IBM with the DR550, NEC America Corp. with its HydraStor and Permabit Technology Corp.
"Legacy technology with scalability limitations"
Permabit was one of the earliest vendors associated with CAS, yet the company no longer wants to be viewed in the CAS space. Mike Ivanov, vice president of marketing, positions the Permabit product line as "disk-based enterprise archiving" using standard CIFS, NFS and WebDAV to allow other applications to write to it. He dismisses CAS as "legacy technology" that has "scalability limitations and traditionally required proprietary APIs to be able to write to those systems."
The eXtensible Access Method (XAM) standard that EMC and other vendors worked on, through the Storage Networking Industry Association, aims to address the proprietary tag for connecting applications to object-based storage systems. But XAM wasn't ratified until July, and XAM-supporting products have yet to make an impact.
"We gave a huge amount of intellectual property as the starting point for this open API, so an application that writes to XAM can store information in Centera or someplace else," says Steve Spataro, director of Centera marketing at EMC. He countered the proprietary accusation, saying that Centera's API was always available to anyone via the Web and EMC's intention was never "to keep a customer locked into Centera."
It remains to be seen how the new XAM standard will affect the CAS space, but the sands had already been shifting for quite some time.
"CAS used to have a lot of benefits in terms of single-instance store and some self-healing properties," says Rinnen, "but some of these [features] are getting less distinctive because other vendors have come up with deduplication technologies, which are even more superior than single-instance store."
The sweet spot for CAS is still secondary storage, although Caringo's technologists are trying to push the envelope. "Centera is really positioned as an archiving type product. Our ambition is much further. We are going after the active volume storage market as well as archiving," says Carpentier, conceding that it will be an uphill battle.
Garrett says he understands that argument, especially as processing power becomes more affordable. But for now, he says, content addressable is still for secondary storage.
14 Oct 2008