Published: 07 Jul 2005
Object-based content management products
Click here for a comparison table about object-based content management products (PDF).
Reducing backup data and managing it better is an increasing problem. The advent and growing acceptance of technologies such as virtual tape libraries (VTLs), network virtualization, ATA drives and iSCSI storage networks simplify the infrastructure and reduce the cost of housing data, but do little to solve the problem of managing data more intelligently.
By storing files as objects, object-based storage (OBS) products manage and protect data more efficiently than traditional backup products. OBS products use Ethernet interfaces and NAS protocols (such as NFS and CIFS) for ease of connectivity and to minimize the amount of data stored and the time required to retrieve it, although the techniques to do so vary among products.
Instead of storing entire files, OBS products create data objects and meta data from incoming files. By storing only the object, a company can increase its storage usage capacity and provide fast, secure, online access to archived files. All OBS products:
- Create a unique identifier for the file
- Store the same files together as one object
- Create and store meta data associated with the file
The OBS products' algorithms analyze the incoming file and then, depending on the product's underlying architecture, compress the file's blocks or store the entire file. Products such as Axion from Irvine, CA-based Avamar Technologies Inc. and DD200 Restorer from Palo Alto, CA-based Data Domain Inc. analyze the blocks that comprise each file and store them together.
Products like Archivas Cluster (ArC) from Waltham, MA-based Archivas Inc. or EMC Corp.'s Centera analyze the incoming files and create a digital signature by using a hashing algorithm (see Hashing algorithms) against the file. They then store exact copies of the entire file as one image and create an index that matches the digital signature with the stored file object.
Mike Luter, CTO at the Cancer Therapy & Research Center (CTRC) in San Antonio, uses ArC to store patients' X-rays to meet HIPAA requirements. If one byte in a patient's X-ray changes, a new file is created and stored. Because each X-ray is different, Luter sees almost no capacity optimization benefits from the Archivas product; however, he expects to see capacity gains as he begins to use it for e-mail archiving in the next few months.
When any OBS product examines an incoming file for its uniqueness, it also creates meta data that's used to manage the newly created data objects. OBS products that analyze files at the sub-block level create only a few meta data attributes, such as logical associations between blocks, because the main objective of this class of products is capacity optimization. OBS products that manage access to files create a much larger set of associated meta data, which includes file ownership, security permissions, retention and expiration dates.
For companies grappling with backup data growth, products like Axion and DD200 Restorer analyze the backup data files at the sub-block level and store them at that level. Products from companies such as Archivas, EMC, Hewlett-Packard Co. and Cambridge, MA-based Permabit Inc. address issues such as e-mail, medical record archiving and legal compliance. Their products analyze and store the file based upon the content of the entire file, not just certain blocks of the file.
Object-based backup products offer the following benefits:
- Reduced backup and restore times
- Smaller data stores
- Reduced bandwidth for offsite replication
- Elimination of tape
DD200 Restorer and Axion are the only two object-based products aimed squarely at backup. DD200 Restorer stores the incoming backup data as a file and then, in the background, breaks the file into individual 4K or 8K segments, and creates a unique data object for each segment. The product then verifies if that object already exists. If it doesn't exist, DD200 Restorer stores the new object and records its existence in its database. If it exists, DD200 Restorer records the object's occurrence in its product database and logical associations with other objects but doesn't store the object again.
The Axion system consists of an industry-standard server running Avamar's software and a set of intelligent agents deployed on each backup client. Users also have the option of purchasing Avamar's software and deploying it on their own compliant platforms. Avamar's approach differs from DD200 Restorer because its agent on the host server breaks apart the file's blocks and creates a digital signature by running a hash against each block of data. The agent then sends the digital signature to the backup repository to verify the originality of the blocks. If the repository determines this is a new ID, it signals the agent to send the entire block of data from the server to the backup repository. The primary advantage this approach has over DD200 Restorer is that it minimizes the amount of data moved across the network from the server to the repository.
Because neither of these products needs to manage large amounts of meta data, they can focus on and optimize data reduction and backup times. Frank Slootman, president and CEO of Data Domain, finds optimizing data reduction secondary to lowering backup times. "Users find data reduction interesting, but the product needs to be able to reduce the backup and restore times first," says Slootman.
Object-based storage backup products
Click here for a comparison table about object-based storage backup products (PDF).
Steve Degner, an IT manager at Power Integrations Inc. in San Jose, CA, was one of the early adopters of this technology. Degner found that DD200 Restorer reduced the time for his full backups from three days to one-and-a-half days and decreased his restore window from 14 hours to four hours.
From a network connectivity standpoint, Axion and DD200 Restorer use existing Ethernet IP LANs for communications. However, Avamar uses a secure SSL TCP/IP socket to send the backup data, while DD200 Restorer presents a standard NFS/CIFS interface as a target to the backup software.
The major difference--and it's a big one--between Axion and DD200 Restorer is how they interact with existing backup software. Avamar requires users to either replace their existing backup software agent with the Axion agent or to deploy Avamar's agent in addition to their existing backup software agent to work with Axion. The rationale behind this architecture is to compress the backup traffic at the host level, reducing the amount of network traffic.
"Avamar wants customers to abandon use of their backup software and tape libraries--solutions that took time, money and resources to put into place for operational backup and recovery," wrote Tony Asaro, a senior analyst at the Milford, MA-based Enterprise Strategy Group (ESG), in a recent report on Avamar. He praises the Avamar technology, but advises users to implement it in small doses, at the department level or at a remote office, before rolling it out to an entire storage environment.
Users who wish to maintain their existing backup software and use Axion will need to create two backup copies, one using their existing backup software and one using Axion. However, this approach increases network traffic, introduces more complexity into the environment and should only be considered as a short-term, stop-gap measure until the company standardizes its backup processes on Avamar.
Users with existing backup software will find DD200 Restorer a more palatable alternative. Data Domain's DD200 Restorer appliance works with backup applications such as CommVault Inc.'s Galaxy, EMC's Legato NetWorker and Veritas Software Corp.'s NetBackup and only requires backup administrators to redirect the backup output to the DD200 appliance.
Avamar and Data Domain can replicate data offsite once it's stored and compressed in a central repository, cutting down on the required bandwidth and time to transport the data. Data Domain reports that it has seen instances where bandwidth requirements for offsite replication are reduced to one-tenth that of a standard tape drive.
Object-based storage products that store the entire file as an object may give users a choice of hashing algorithms to create each file's digital signature. To choose the correct hashing algorithm, users should understand what a "hash" is, how it works and why one might be better for a particular environment.
A hash is a cryptographic function that takes an input of any length and produces an output of a fixed length. For instance, a common hashing algorithm used in object-based storage products is MD5, which produces a fixed-size digital signature of 128 bits. Hashing algorithms are particularly appealing for creating digital signatures because they create a unique output from the input; it's also thought to be impossible to compute the nature of the input from the output.
The primary differences between the types of algorithms used are:
Security of the hash
Possibility of "collisions"
Speed in generating the digital signature
SHA1, which generates a 160-bit digital signature, is another hashing algorithm used frequently by object-based products. While both MD5 and SHA1 are considered secure, SHA1 is more secure than MD5 because of its 160-bit digital signature, which makes it a much harder hash to break. The other benefit of using a 160-bit digital signature is that it eliminates the possibility of two hashes generating the same digital signature, something theoretically possible using the MD5 algorithm. But because SHA1 generates a larger digital signature, it runs slower than the MD5 algorithm.
Object-based content-management products offer the following benefits:
- Data preservation and consolidation
- Capacity optimization
- Regulatory compliance
- Fast, random access to data
- Constant data availability
OBS products for content management differ architecturally from OBS products focused on backup. Content-management products preserve the user's original data for a long period of time, make sure it's accessible when needed and ensure that organizations remain compliant. While storage administrators can set policies for individual objects, vendors say that most organizations set up a default policy for all files stored in a specific directory. For example, Archivas suggests admins go through the following preparatory steps for a new application:
- Create a directory on the ArC server for the application's files.
- Within the directory, create policies that get assigned to all files stored in that directory, such as retention period or what hashing algorithm is used to create the digital signature.
- Mount the directory and present it to the app.
Unlike products intended for backup, content-management products don't change or break apart the incoming file to store it in smaller blocks. They store the file as the object--either in its native form or encrypted/compressed as products like Permabit allow--and then use hashing algorithms to analyze the file for uniqueness vs. other files already in its repository. During this analysis stage, the product's algorithm also creates the meta data associated with the file object.
The meta data includes traditional file attributes such as file ownership, creation, modification and access date, user and group access. It can also include additional attributes such as which hashing algorithm should be used to create the object's digital signature, retention period, backup requirements and last successful replication or backup.
EMC's Centera Seek software allows storage admins to search and retrieve files from all of the applications on their EMC Centera. For example, an administrator can retrieve all documents from John Doe that were created between May 1 and May 31 with keywords such as "change," "alteration" or "conversation," regardless of which app was used to create the specific file.
Once files are secured, benefits like data consolidation and capacity optimization emerge. Users will see the most noticeable improvement with e-mail apps such as Exchange and Notes because they allow a single instance store of the same attachment sent to multiple users. This reduces the amount of storage and overhead on the e-mail server while allowing the organization to meet compliance regulations.
While products like EMC's Centera, HP's Reference Information Storage System (RISS) and Permabit's Permeon present a standard NFS or CIFS mount point to the e-mail server, they add a new NAS device to the storage environment. With organizations moving toward global name spaces and standardized NAS interfaces, the last thing the storage or network group may want to see is another specialized NAS product added to the environment. There may also be other considerations. For example, with Centera, users will need to ensure their e-mail software has the necessary APIs to communicate with Centera; they'll also likely need to purchase and maintain that interface as part of their ongoing e-mail management.
Most of these OBS content-management products need to provide availability 24x7 and deliver acceptable performance. To achieve these requirements, vendors are primarily using off-the-shelf Intel servers running a Linux operating system in some type of highly available configuration--clustered or N+1--with RAIDed ATA drives in the background. They generally have their own software running on each server that constantly monitors the integrity of the data, and will either repair or copy the data to another node if an error is detected.
Joshua Freeman, IT director at the New York Botanical Gardens, uses Archivas ArC because it's hardware-agnostic and built on open-source code. He also found that it gave him so much additional low-cost capacity that he was able to use it as both an archive server and a file server. This allowed Freeman's users to store and stage items such as field notes or images of plant specimens prior to their eventual placement in the Botanical Garden's main object database. Likewise, CTRC's Luter hopes to use his ArC configuration to automate the flow of X-rays from Tier-1 storage to lower cost storage, something his staff does manually.
Freeman, Luter and Power Integrations' Degner reflect the growing interest users have in better managing their storage and data. OBS products that minimize and eliminate duplicate data while taking advantage of low-cost storage technologies in highly available configurations are becoming more popular. These products will become even more useful when features such as replication and automated workflow are added.
- How to Choose an Object Storage Solution –Scality
- The Business Value of HPE Object Storage with Scality RING for Data-intensive ... –Scality
- Guide to OpenStack storage –ComputerWeekly.com
- Case Study: Pittsburg State University and Hedvig –Hedvig Inc