Lock up data with fixed-content storage

For most companies, fixed-content storage requirements are simple: Store the data securely, do it cheaply and provide fast access. With more data subject to external and internal audits, content-addressed storage products are becoming the preferred storage medium for long-term protection of fixed content.

This article can also be found in the Premium Editorial Download: Storage magazine: What to do when storage capacity keeps growing:

Content-addressed storage safeguards retention data and prevents its alteration.

For most companies, fixed-content storage requirements are simple: Store the data securely, do it cheaply and provide fast access. With more data subject to external and internal audits, content-addressed storage (CAS) products are becoming the preferred storage medium for the long-term protection of fixed content.

CAS products come in four different architectures:

  1. The redundant array of independent nodes (RAIN) architecture is the predominant way vendors offer CAS hardware. Inexpensive servers or nodes with high-capacity disk drives are clustered together; software locks the data stored on each node. As growth occurs, more nodes are added to the RAIN cluster.

  2. Network Appliance (NetApp) Inc. presents a network file system over an Ethernet connection on the front end while using WORM technology to lock the data down and data deduplication to optimize its capacity. The system accommodates growth by adding more disk capacity to NAS head configurations or allowing upgrades to higher capacity NAS filers. There's no way to move data from the NetApp disk to tape or optical media.

  3. The hierarchical storage management (HSM) architecture offered by IBM Corp. allows applications to archive and retrieve data from the CAS system using APIs provided by the CAS software. IBM requires users to deploy its Tivoli Storage Manager (TSM) 5.3 for Data Retention software, which comes with its TotalStorage DR550. IBM's CAS product differs from the other CAS architectures because it doesn't use data deduplication or single-instance storage (SIS) by default, although users can deploy these technologies and use TSM to manage the data.

  4. Nexsan Technologies Ltd. offers a networked storage array architecture that includes CAS software as part of the array to manage data retention and ways to move data between disk, tape and optical.
There are trade-offs with each of these designs. Each one requires some type of software to classify and then move the data to and from the CAS device. Products using RAIN architectures--including Archivas Inc.'s Archiving Cluster (ArC), EMC Corp.'s Centera, Hewlett-Packard (HP) Co.'s StorageWorks Reference Information Storage System (RISS) and Permabit Inc.'s Permeon Compliance Store--store files as objects. This introduces a new format for storing files whose long-term management costs and liabilities aren't yet well understood. File-system approaches don't support SIS and require third-party products to manage file meta data. HSM architectures are based on a model that may not respond well with large data stores, while the network storage array approach has limitations on total disk capacity and the disk it will support.

CAS features
All CAS products deliver the following fixed-content storage requirements:

  • Data is accessible by content, not storage location
  • Scales economically
  • Manages large amounts of data
  • Guarantees data authenticity and security
  • Manages data-retention periods
  • Facilitates rapid data recall
To deliver these basic requirements, many CAS vendors use the RAIN grid architecture. A RAIN device is usually a commodity server (called a node) with internal SATA hard drives and vendor-supplied CAS software. The nodes in EMC's Centera and Permabit's Permeon RAIN architecture support two personalities: an access or portal node and a storage node. The access nodes are clustered and connected to the Ethernet network to receive and process either incoming files or requests for data using a number of network protocols. The access node identifies where the object is to be stored or where it resides, and then stores or retrieves the object from the storage node.

Content-addressed storage products
Click here for a comprehensive list of Content-addressed storage products (PDF).

Vendors can support and configure storage nodes by deploying each storage node with a partner node and mirroring the objects between the storage nodes. HP deploys each of its SmartCell storage nodes with a partner so every node keeps a mirror of its partner's data. If a partner node fails, one or several nodes in the larger cluster are identified as replacement nodes to host the data until the partner node is replaced.

You can also preserve the integrity of data or objects by balancing data across multiple storage nodes. Archivas' ArC and Permabit's Permeon Compliance Store use this approach to enable users to deploy nodes one at a time; this allows data to be distributed evenly across storage nodes. Each object is copied and stored on at least two different nodes to prevent data loss due to a node hardware failure.

Protecting CAS data
Protecting data stored on a content-addressed storage (CAS) device presents an interesting dilemma for users because one of the primary purposes of a CAS product is to serve as a final resting place for data in an unchangeable format. Hewlett-Packard Co. deploys each of its SmartCell storage nodes in a mirrored configuration--each node keeps a mirror of its partner's data. In the event that one-half of the mirror fails, a free node in the grid is allocated as a replacement partner to mirror the data; the failed mirror can then be repaired and re-inserted in the grid.

EMC Corp. recommends replicating data to a second Centera at another site, but this is a costly alternative. To reduce the cost of a high-speed link between the two sites, the Centera Backup and Recovery Module (CBRM) allows existing backup software to connect to the Centera and back up its data to tape using the NDMP protocol. CAS products from Bycast Inc., IBM Corp. and Nexsan Technologies Ltd. allow you to copy each file or object to tape. However, this process re-introduces concerns about data integrity because someone needs to be responsible for moving the tape offsite and managing it in the long term.

Another consideration when evaluating each vendor's RAIN architecture implementations is the type of hardware to be used. Although each vendor uses off-the-shelf Intel server hardware to host its software, Archivas' ArC and Permabit's Permeon Compliance Store allow users to choose any vendor's brand of server, while CAS vendors such as EMC and HP require users to purchase server and storage hardware from them. HP only sells and certifies its ProLiant DL380 servers as nodes to support its StorageWorks RISS software.

Users with existing server hardware or server agreements may opt for Archivas' ArC or Permabit's Permeon Compliance Store because they run on any server vendor's hardware. For firms more concerned with deploying an end-to-end configuration sold and supported by a single vendor, choosing EMC or HP for the hardware and software in a preconfigured CAS product may be a better option.

The hashing algorithms a CAS product uses to create digital identifiers for each object is also important. Some hashing algorithms may be cracked or hacked over time; having the ability to upgrade the digital signature may therefore become more important. Caringo Inc.'s CAStor, a new CAS software product, lets users upgrade the hashing algorithm and digital signature as new ones become available.

Most RAIN architectures support only nodes with internal disk drives. Only Bycast's StorageGrid and Caringo's CAStor let users deploy nodes that support different types of external storage and manage the placement of data on these different tiers of storage based on policies set by users.

A final concern is the protocols used to access the RAIN nodes. One way RAIN vendors circumvent the API problem is by presenting a mountable file system to the OS and allowing apps to use the more common NFS and CIFS protocols to store and retrieve data. Most RAIN vendors, including Archivas, Bycast, HP and Permabit, support this configuration, and even EMC is jumping on the bandwagon.

Pros and cons of RAIN products
Click here for a comprehensive list of the pros and cons of RAIN products (PDF).

NetApp's NAS products use file systems, but they support CAS in a slightly different manner. By using SnapLock (an optional WORM feature) with the Data Ontap OS that comes standard with all NetApp filers, and its new Advanced Single Instance Storage (ASIS) feature, users can lock down data and optimize storage capacity on filers. The main drawback of file-system architectures is that they require either a separate appliance or third-party software such as Open Text or FileNet to classify each file, create and store meta data, and manage the file's data-retention periods and user access permissions.

IBM prepackages its TotalStorage DR550 with TSM for Data Retention software to enable apps to classify and manage data. (Shops already using TSM can host the Data Retention component on an existing TSM server.) For small- to medium-sized firms, IBM offers DR550 Express, which also ships with the TSM software, but supports only internal disk drives with an option for tape vs. the DR550 that supports external disk and tape and is available in clustered configurations. TSM is required to manage data placement, retention and security policies; all host apps will need to support TSM's APIs to store and retrieve data.

Capacity management
There are three primary ways CAS products manage and reduce the amount of data they store: object-based storage, SIS and data deduplication.

CAS vendors that support RAIN and networked storage array architectures store files by saving them as objects. Incoming files are scanned and a hashing algorithm creates a unique identifier for that file, which is stored in the CAS product meta data database used to reference and access that object in the future. This technique, called SIS, reduces the amount of storage. When a file is submitted to the CAS product for storage, the hashing algorithm used to analyze a file will always create the same unique identifier for the file even if some of the file attributes are different. This lets users save storage space because they're not storing multiple instances of the same file.

Before implementing SIS, users need to consider the time it takes the CAS product to generate the unique identifier and check its meta data database to see if that identifier already exists. Searching for a unique identifier may be done quickly during initial deployment, but as the size of the meta data database grows it takes longer to search it.

For the fastest file storage and recovery possible, users should use the latest version of the RAIN OS. EMC, for example, claims that under certain conditions the latest version of Centera's CentraStar OS performs four to five times faster than earlier releases. Another option is to upgrade hardware nodes with faster CPUs and 1 Gigabit Ethernet ports rather than the 100Mb ports common to first-generation nodes. Upgrading shouldn't be that painful because RAIN nodes may be taken offline and replaced nondisruptively, and different generations of nodes can operate in the same cluster.

Another factor to consider before turning on SIS is the type of file being archived. For certain types of files, such as check images, nothing will be gained by turning on SIS. Conversely, users will see significant savings using SIS when storing e-mail attachments, for example.

Data deduplication and classification
Some CAS products use data deduplication, which breaks files apart, analyzes them at the block level and only stores identical blocks once to minimize the amount of data stored. HP's StorageWorks RISS and Permabit's Permeon Compliance Store include this as part of their software, but users need to turn it on.

NetApp introduced ASIS last March and EMC has announced a partnership with Avamar Technologies Inc. to provide similar functionality for Centera. HP says users will experience a three- to five-fold reduction in total storage using deduplication, but the technology will introduce some performance overhead. NetApp estimates that its filers will experience a 1% to 3% performance hit when ASIS is turned on.

CAS products classify data in several ways, using mostly meta data databases. As files are stored in RAIN architectures, meta data is extracted based on policies provided by the vendor and user. NetApp's filers index files after they're stored, although users can use any data classification engine to index, classify and tag data. NetApp's IS1200 appliance uses Kazeon Systems Inc.'s algorithms to deliver this functionality.

IBM's DR550 classifies data based on policies set previously with its TSM software. TSM then places the data on the correct tier of storage, moves the data to other tiers of storage when appropriate, and deletes the file at the end of its retention period. For this scenario to work, TSM APIs must be on each server.

A problem with all data classification approaches is the need to re-index data if requirements change. Depending on the size of the data store, re-indexing can be a performance-intensive exercise.

CAS cost considerations
Upfront and ongoing costs associated with each CAS product should be considered; as data grows, the hidden costs/savings of these architectures will become more apparent. Purchasers of RAIN products from Archivas, Bycast, EMC, HP and Permabit should expect to pay as little as $7,500 for a 1.5TB configuration to as much as $350,000 or more for a 50TB setup.

Permabit is the only vendor with two pricing models: per node or by capacity. The per-node option, also offered by Bycast, EMC and Permabit, will probably be the less-expensive option because capacity on CAS products could scale into the petabytes. Per-node licensing is also more favorable for products such as Bycast's StorageGrid, which support portable media such as tape, which requires fewer nodes to house data.

There are exceptions, of course. Users who anticipate keeping data from different departments or customers on separate nodes may find licensing by total capacity to be cheaper. Users also need to examine what storage optimization features, if any, they'll turn on and if that will control data growth.

In the future, it's likely CAS products will emerge as the preferred means of managing structured and unstructured retention data. The four CAS options--RAIN, file system, HSM and storage architectures--all allow users to start small, scale economically and satisfy the data-retention requirements that meet specific apps. But integrating CAS with various apps isn't always as easy as sometimes advertised, and reclassifying data after it's stored is a major headache.

This was first published in June 2006

Dig deeper on Data storage compliance and archiving



Enjoy the benefits of Pro+ membership, learn more and join.



Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: