Under normal circumstances, a backup is simply a copy of data that is kept aside to protect against data loss -- when a file is lost due to user error, or data is corrupted because of system problems, the affected data can be restored from a backup. An archive is different from a backup because the data may not be used for months, even years, but must be accessed quickly when needed. This is further complicated by data archive sizes that are growing at an annual rate, up to 90% or more. There is simply no time to search through burgeoning volumes of tape or optical media to locate important files. Traditional backup platforms are poorly suited for archival data storage, and users are relying on disk storage systems for a mix of performance and reliability. Files can be archived to any disk storage system, but content-addressed storage (CAS) technology has appeared to support archiving efforts [see the SearchStorage.com Tech Closeup on CAS here].
At the simplest level, CAS is a specialized disk storage system. Since archival data is not accessed frequently, high-performance disks are not essential. In fact, most CAS platforms employ ordinary SATA hard disks for their low cost per gigabyte, though SAS disks may be used when added performance is needed to accommodate many simultaneous users. However, CAS technology incorporates a unique feature set designed to optimize storage space and improve long-term data management.
Next, CAS data cannot be changed once it is archived. This ensures data integrity and prevents tampering or spoliation. A corporate regulatory audit or litigation discovery can proceed with high confidence that the data being examined is original and unaltered. Tamper-proofing is generally accomplished by treating files as objects with unique designators and locations. Since most archival data has a finite lifecycle, CAS also manages data retention and disposal in accordance with regulatory or compliance requirements. Data reaching its retention limit is systematically deleted.
One persistent problem with traditional file copies is the inevitable duplication of files. If there are 100 different copies of an e-mail file attachment, all 100 copies are saved in the backup. For long-term archival storage, this kind of inefficiency can quickly exhaust available storage space. Another real strength of CAS technology is in data deduplication (a.k.a. single-instance storage or intelligent compression), which eliminates duplicated blocks of data. Only one iteration of data is saved, and subsequent copies of the file are simply referenced back to the one saved copy. Consider a file-level example. If there are 100 attachments and each is 2 MB in size, archiving to CAS would only take 2 MB to save all 100 attachment references, instead of 200 MB with an ordinary disk system. Experts note that data deduplication can reduce data demands up to 50-to-1. Conventional compression techniques may also be employed to reduce disk space even further.
Power consumption is an important consideration. As CAS systems scale up to hundreds of spinning disks, the power cost becomes substantial. Some archive systems are employing creative solutions to reduce power demands such as idling drives or powering idle drives down completely. Low-power drives and emerging drive technologies like "hybrid drives" can also help lower overall power demands.
Major vendors in the CAS market include -- in no particular order -- EMC Corp., Nexsan Technologies, Sun Microsystems Inc., StorageTek, Permabit Inc., Hewlett-Packard Co., Bycast Inc., IBM and Avamar Technologies Inc. Most CAS vendors possess a remarkably similar view of CAS, though each vendor puts its own unique stamp on the technology.
Once data is passed to CAS, it cannot change and must be protected against theft so other CAS products emphasize the immutability and security of archival data. Nexsan's Assureon product incorporated AES 256-bit encryption to protect files relegated to the archive. The Assureon also adds serialization to track the existence of each CAS location and prevent file tampering. Serialized locations can be scanned periodically to verify the integrity of each file, and any files that are damaged or incomplete can be dealt with promptly.
Still other CAS platforms embrace search and scalability features. Search capability relies on sophisticated metadata to help users to locate relevant file content long after the original file creator may have forgotten about it. Scalability is important to handle archival growth and handle huge numbers of CAS objects (into the billions) over the long term. EMC and Sun products both favor these areas.
Applications of CAS
CAS products are deployed in a wide variety of roles that extend well beyond archival storage to embrace backup/restoration, improve storage performance, meet regulatory requirements and save significant costs.
The data deduplication features of CAS platforms are sometimes used to reduce storage requirements. If corporate data can be concentrated into some fraction of its original space, backups and restorations (e.g. to tape or optical media) can be accomplished faster – simply because there is less data to transfer. Lower data volumes can also speed backup and replication tasks across WAN links to remote sites. Lost or damaged files can be restored directly from disk without the time or trouble of locating those files on other media.
CAS is sometimes selected over other long-term storage options to provide a superior user experience. For example, check images or X-ray data stored to tape or disc must often be retrieved manually once the corresponding media is located and loaded. It can take hours (even longer) for end users to obtain data stored on traditional media. A disk-based CAS system can keep that data nearline and supply files on demand without any manual intervention.