Imagine if each day you could parse every tiny segment of data in the enterprise and only back up or archive those parts that have truly changed, rather than backing up entire files, databases and objects. Object-based backup--an emerging field of storage technology that only a handful of companies are focusing on--introduces a new medium for data protection and retention. It presents a software infrastructure that reinvents the way...
we think about and visualize production data backup and archive activities. And it just may be the foundation for a strategy to use inexpensive commodity servers, disk arrays and IP networking far more effectively.
An object-based system can determine if any changes to a file or its attributes have occurred since it was last backed up. If modifications are detected, only the changes are backed up--not the entire file. This can eliminate the unnecessary copying of large amounts of data, thus significantly speeding up backups and reducing the amount of storage space required.
Hashing and storage
Hashing algorithms--one of the key components of object-based storage--were developed in academic computing circles decades ago and are widely used in computer security, encryption and authentication technologies. With hashing, a string of data is analyzed to produce a unique value, or signature, that identifies the original segment of data. In security systems, hashing algorithms are commonly used in public-key encryption protocols where a simple string of password data may be encrypted with a 128-bit hashing algorithm to produce a unique signature--unique enough, that is, to require as many as 2^128 guesses for an interloper to decrypt the string.
In object-based storage, hashing algorithms are used similarly to uniquely identify segments of data that are stored in file systems. Incoming data files are parsed into uniformly sized objects and a hash value is calculated for each one. The results are then compared to those in an existing hash index, which is essentially a database that holds the hash values--or meta data--for each data segment. If an identical hash value already exists in the index, then the piece of data it represents isn't copied (backed up). If the new hash value doesn't match any in the hash index, it's added to the index and the associated data object is copied into the object-based environment.
|Traditional backup is storage-intensive|
Why is object-based backup so important?
What makes object-based backup particularly appealing is that it drastically reduces the amount of data that needs to be backed up. Object-based backup vendors clearly differentiate themselves from vendors offering disk-only backup appliance solutions.
Because object-based storage is relatively immature, products are only available from a handful of vendors. But given the optimistic upside of the technology, more storage vendors are sure to follow. (See "Object-based storage vendors and products") Note that vendors often use different terminology to describe object-based backup, including reference data, commonality factoring, data coalescence, single-instance storage or content-addressable storage.
How it works
Vendor-specific software implementations commonly leverage the following hardware components in an object-based backup architecture:
- Client interface. This is could be standalone client module, a networked file system interface (such as NFS or CIFS) or existing backup software clients.
- Portal node. The portal node is typically a rack-mounted Intel server running a stripped-down version of Linux and specialized software that manages data and object-based processing (parsing, hashing, indexing, etc.).
- Storage nodes. These are often rack-mounted servers containing high-density ATA or Serial ATA (SATA) disk drives. Usually, five or six storage nodes will be clustered with a portal node. Storage nodes usually provide redundant storage services for data storage, and in some cases, redundancy for meta data stores.
- Gigabit Ethernet. These architectures use Gigabit Ethernet to connect clients to portals and storage nodes. Typically, switches populate the cluster environment to enable high-performance switched network traffic among clustered storage and portal nodes.
Architecturally, the combined hardware components form a clustered storage environment where meta data (hash-derived index files) and data objects are stored and managed across multiple storage nodes. Some implementations use a checksum routine, plus mirroring, to ensure data integrity, while others use RAID technologies. To avoid data loss that could occur at a single storage server or disk array, a RAID-type algorithm propagates data across the storage nodes in the cluster, ensuring that data parity is established for all data in the cluster. Portal servers are also clustered to enable fault-tolerance and load balancing for the front-end data processing function. In some deployments, data parsing, hashing and index processing is distributed across the portal and storage node clusters in a grid-computing style of workload distribution. The result is a highly intelligent storage cluster, built with low-cost, high-density commodity hardware components.
Client access to the subsystems occurs either through standard NFS- or CIFS-mounted file systems or through a custom client software interface, depending on which vendor and implementation strategy best fits the client infrastructure. A common theme--not unlike traditional backup systems--is that data is copied into the backup environment and transferred across IP networks into the object-oriented storage environment. Some object-based backup systems will work with existing backup software tools, while others completely replace installed backup systems and associated client applications. (See "Object-based storage vendors and products")
Object-based backup has the potential to substantially reduce the cost and time it takes to transfer data to a remote location for disaster recovery applications. Object-based backup's low bandwidth requirements may also enable distributed implementations, where smaller satellite installations can replicate data to other locations for disaster recovery purposes.
|Object-based storage vendors and products|
Data currency--to a backup software system--determines whether or not data has changed since the last incremental backup. If a single file attribute has changed since the last backup, that file will inevitably be backed up again. This applies to backup systems using the incremental forever methodology. The problem is compounded in environments where a full backup is run daily, weekly or monthly, in addition to differential or incremental backups. The result is that environments store many redundant versions of files, although actual subfile level changes to the data may be insignificant.
The volume of stored data stored can be extraordinary. Based on the backup metrics in place, the ratio of active to backup data may be as high as 1:25. "Traditional backup is storage-intensive" on this page shows ratios that are likely to occur with common backup scenarios. Specific ratios vary, depending on site-specific data change rates, backup policies, archival retentions and the underlying technologies being used for backup.
The redundancy of data exists because of the practice of evaluating data currency exclusively at the file level. Vendors aim to improve the ratio of primary (active) to secondary (backup) storage to a ratio of 1:2. This means that a 20TB object-based disk backup disk pool would sufficiently manage data backups in an environment with 10TB of active data. Because the ratio is low, a large pool of inexpensive disk for backup works in some environments.
Where it can work best
Two common enterprise file types--e-mail file stores and databases--illustrate how the benefits of object-based backup can be realized. These tend to be large files that are backed up entirely if they are changed--even when the changes are not extensive. It's such a concern that it has spawned a software cottage industry.
The ubiquitous Microsoft Outlook personal folder files (.pst) residing on vast numbers of corporate file servers are a case in point. Each time Outlook is run, a personal folder file is opened and updated, regardless of whether any e-mail messages were added or deleted from the file store. Because the .pst was updated by the application, the file will be backed up during the next backup session because appears to have changed. For a company with hundreds or thousands of Outlook users, the volume of unnecessary backup activity can be staggering.
Database applications pose a similar problem. In ideal circumstances, backup applications will interface directly with the database backup API to extract only incremental changes to a database. These are effective solutions, but more common practices for database backup are typically less efficient. It's common to export the database to file for backup, quiesce the database for a full backup or copy the database to a split mirror volume for backup. Like the mail store file, every moment a database application runs, the database file attributes are modified, resulting in the entire file object being backed up.
Object-based disk technologies are also being adapted to serve a primary storage role. EMC Corp.'s Centera offers a similar concept to provide content-addressed storage (CAS) functionality for content management applications, where large volumes of redundant files can be managed in a singular instance in a disk-based appliance. EMC's Centera offers a way for applications to write data into a content-addressable primary storage environment, using an API designed for accessibility by a wide array of content management applications. The object-based market is entirely different, however, due to the ability of those products to manage data objects at a subfile level.
Object-based backup vendors offer a different technology breed, yet with the same basic principles and goals: to help organizations manage--in an extremely efficient manner--large volumes of infrequently changing data in a disk environment. Whether or not object-based backup will function as primary or secondary storage, or both, is still to be determined, so expect to see multiple product marketing strategies from the vendor community.
Real-world object-based backup
Early adopters of object-based backup will be organizations experiencing significant problems with backup, and a cutting-edge solution may be less risky than attempting to retool their current backup infrastructure. Companies facing increased demand to keep archival data accessible as a result of regulatory compliance demands may also benefit from object-based backup and archive solutions.
In addition, enterprises with branch offices or remote operations that need a low-overhead, small-footprint backup solution may be prime candidates for object-based backup. In large companies where significantly altering backup/archive operations might stir up budgeting, cultural or technical issues, their major investments in backup software and tape are likely to deter any radical changes. But even in those cases, a hybrid approach may prove to be a beneficial blending of architectures.
The two primary object-based backup implementation scenarios are:
- Standalone implementations of object-based backup that are intended to completely replace existing backup infrastructure (see "Three ways to use object-based backup")
- Integrated implementations that place the object-based backup into existing backup infrastructures
Backing up files in use--also known as fuzzy backup--is also an issue with object-based backup. To successfully back up this data, the responsibility lies with applications and intelligent API-type backup modules to extract data out of the application for backup. The traditional method is database backup, where the database application must be halted or the database must have an export function to create static files for backup.
Object-based backup doesn't solve this dilemma because active data backup issues at the file level still exist at the subfile level. Object-based backup vendors with standalone client software must develop their own interfaces to applications and databases to deal with so-called fuzzy backups.
Another scenario involves using dual infrastructures for multiple-site disaster recovery, where a primary object-based backup environment replicates data changes to an alternative mirror environment. Object-based backup provides a different way of replicating data between sites. Because only subfile level changes to the production backup environment would be replicated, the network bandwidth and latency requirements demanded by modern replication tools would be significantly lowered.
Reliability is a common concern for those considering object-based backup. Protecting an object-based environment against meta data failure and corruption are valid concerns. When object-based backup is used to complement existing backup
software and tape environments, traditional tape-based disaster recovery is a legitimate fall-back strategy.
Object-based backup has the potential to dramatically change the way disk is used in production backup. The technologies are immature, but their potential to streamline disk-based backup should not be underestimated.
|Three ways to use object-based backup|