Managing data with an object storage system
A comprehensive collection of articles, videos and more, hand-picked by our editors
Object storage isn't a new concept in the NAS world, but some new products are bypassing traditional file system interfaces as industry debate emerges about the best way to cope with unstructured data.
In the age of Web 2.0, the cloud and the digital content explosion, enterprise data storage managers are reevaluating how they store unstructured data as vendors roll out new object-based storage systems designed to offer simplified management and more scalable metadata schemes.
Unstructured data is expected to far outpace the growth of structured data over the next three years. According to the "IDC Enterprise Disk Storage Consumption Model" report released last fall, while transactional data is projected to grow at a compound annual growth rate (CAGR) of 21.8%, it's far outpaced by a 61.7% CAGR predicted for unstructured data.
"There are going to be extreme amounts of data as things like digital video and mobile networks grow; in five years, pretty much every phone will be 'smart,'" said Robin Harris, senior analyst at StorageMojo. "All of us storage geeks agree on that, and different people are beginning to visualize what that kind of growth needs in terms of storage infrastructure."
Think APIs, not files
Traditional hierarchical file systems organize data into "trees" consisting of directories, folders, subfolders and files. Files are a logical representation of blocks of data associated with an application and are the most familiar means of working with data. Network file system interfaces like NFS and CIFS are well-understood, standardized methods of conveying the logical groups of blocks from a storage repository to an application.
A problem arises, however, when a traditional file system, which has a theoretically limited number of files it can address in a single directory and tracks only simple metadata, runs into massive repositories of similar files.
"File systems make less sense over time as the amount of data grows," StorageMojo's Harris said. "Architecturally, it makes more sense for each file to have a unique 128-bit ID and use an Internet-like system for locating that file; a URL points to an address and there are files at that address, and object-based storage interfaces are essentially operating on the same principle."
With an object ID replacing a file name, more extensive data can accompany an object than the simple "created," "modified" or "saved on" fields available in traditional file systems. Thus, detailed policies can be applied to objects for more efficient and automated management.
Without NFS or CIFS to serve up files to applications, object-based storage systems need to replace that layer of abstraction between raw blocks of data on disk and files that applications can recognize. Today's object-based systems use standard APIs such as Representational State Transfer (REST) and Simple Object Access Protocol (SOAP), or proprietary APIs to tell applications how to store and retrieve object IDs.
New object-based storage products target the cloud
For companies like Amazon, Flickr, Google or YouTube, whose intellectual property and differentiation comes from offering Web-based applications, programming their own interfaces isn't such a big deal. But for companies with dozens or hundreds of applications, cobbling code to make each app work with object-based storage is likely to be an onerous and uneconomical task. There are, however, some storage vendors that offer pre-built but flexible architectures that do the job.
Caringo Inc. was first to position a content-addressed storage (CAS) system for nearline rather than archival storage, where CAS systems like EMC Corp.'s Centera (designed by the same engineers who later founded Caringo) historically played. In May 2008, the company claimed that its CAStor product can take the place of a file system or global namespace in traditional clustered storage products. CAStor runs CIFS or NFS using a file system gateway that can also be clustered (although no global namespace is available on the gateway), as well as HTTP access natively. According to the company, CAStor can be installed on nearly any x86 hardware with direct-attached storage (DAS).
EMC entered the market in November 2008 with its Atmos system, which it dubbed cloud-optimized storage (COS). Atmos uses object-based metadata to allow users to set policies that determine where to store data, which services to apply to it, and how many copies should be created and where they should be stored. REST and SOAP Web services are built in, as are capabilities such as replication, versioning, compression, data dedupe and disk spin-down. Users don't have to set up file systems or assign logical unit numbers (LUNs); during setup, they simply answer a few questions to set policies.
DataDirect Networks Inc. announced Web Object Scaler (WOS) in June 2009, and was expected to ship the system before the end of 2009. EMC said Atmos can scale to multiple petabytes and billions of files, but DataDirect Networks said WOS can handle more than 200 billion files and 6 petabytes (PB); the company also claims a performance advantage over Atmos because its system holds object metadata in memory on its server nodes. Atmos metadata is partitioned and stored in a collection of databases spread across many disks in the system.
Cleversafe Inc. brought its dsNet Object Store out of the beta testing phase in September 2009. Cleversafe's SliceStor storage nodes can break a single file into as many as 11 pieces for redundancy, creating a hash that's appended to each slice for reconstruction. Cleversafe provides built-in encryption and previously offered the product with a block-level iSCSI or WebDAV interface. It's offering APIs for object-based access to the dsNet based on the Java software development kit (SDK) or using REST.
More recently, NetApp Inc. cloud czar Val Bercovici revealed in a blog post that the company best known for network-attached storage (NAS) will also be offering a native object storage interface "in the not too distant future."
The object debate
Paul Carpentier, Caringo's CTO and co-founder, invented CAS as founder of FilePool, which became Centera after it was sold to EMC in 2001. Carpentier has become perhaps the most outspoken proponent of object-based storage systems as a replacement for file systems altogether. "It's a heated debate," Carpentier said. "Personally, I'm very convinced we've stretched the hierarchical thing way too long."
Carpentier argues that file systems were originally built to allow concurrent access to smaller groups of objects shared among a few users. But now, he said, there's a "mismatch between prevailing use cases [for unstructured data] and how those systems work. Ninety percent to 95% of us don't need a storage system with concurrent locking for reference information."
Carpentier noted that the management of file systems is too meticulous to be practical at petabyte scale. "Some products create a virtualization layer that presents a global namespace, but there might be 20 underlying file systems you have to manage individually, and sooner or later the Web 2.0 business model bumps into an impossibility," he said. Furthermore, at scale, "backup just doesn't cut it anymore, you need live replication."
Object interfaces decouple data from the underlying disk hardware in a way file systems can't keep up with, said Cleversafe CEO Chris Gladwin. "With objects, there isn't a size limit or a concept of drive size; there's just a single namespace that can theoretically encompass all the hard drives on the planet."
One EMC and NetApp user said he agrees with this point of view. "I feel really strongly that the file systems we have today are not all that great. In the mainframe days, you could include attributes with a file to help manage them," said Tom Becchetti, a veteran storage professional who asked that his company not be named because of organizational policy. "With file systems, if you need to manage some files differently from others, you do it in separate server buckets today."
That runs counter to the consolidation going on with server virtualization, and Becchetti said object-based storage "could be a key enabler to grow the virtual [server] world, where an object isn't a file but a VMDK [virtual machine disk file]. It could mean I could share a VMDK between more physical servers than is possible with today's file systems, and protect it on a grander scale with policy-based management, where I could say anything with 'P' in the VMDK name should be protected this way vs. anything with 'D' in the name."
Still, even in some of the most demanding environments, users said file systems can get the job done. Speaking on a recent Wikibon.org conference call, Eugene Hacopians, senior system engineer at The California Institute of Technology (the academic home of NASA's Jet Propulsion Laboratory), said the 2 PB of storage in his environment, comprising billions of 5 KB to 25 KB files, still runs mostly on traditional storage systems from Nexsan Technologies Inc.
But that's been a matter of timing, project lifecycles and budget rather than technical preference. "We have looked at [object-based storage] and are considering it for newer projects," Hacopians said. "It's difficult to convert to new technology and fork out additional money when you're in the middle of trying to deliver on a project."
Different products for different use cases
Another viewpoint maintains that file vs. object doesn't have to be an either-or proposition. NetApp and EMC, for example, have both expressed this point of view.
"If there are limits to traditional file systems, we're not running into them today," said Peter Thayer, director of marketing, midrange products at EMC. "It's more a matter of application-centric use cases in Web 2.0 requiring additional metadata than running out of gas in the traditional file system space today."
John Hayden, EMC's CTO of NAS engineering, added that if users require shared read/write access to the same files, "you'll get more horsepower out of traditional file systems today in terms of performance."
NetApp's Bercovici echoed that outlook. NetApp continues to roll out file system-based products, most recently its Ontap 8 operating system, which will support scale-out. However, "if you need to support millions, hundreds of millions or billions of similar objects, like medical images, storage interfaces are just overhead," he said. "You don't want to create LUNs, folders and permissions; you just want a single scalable directory."
Some users find a combination of products works best for different needs within the same environment. At the Johns Hopkins University Bayview Research Campus Center for Inherited Disease Research, data processing for genetic research processing is done using clients attached to a 72 TB Isilon Systems Inc. clustered NAS system, but once data passes from being actively shared among researchers to being kept as reference information, it's moved to Caringo's CAStor object-based system.
"Isilon provides a large shared file system to support desktop data analysis for the computers that drive instruments in our lab," said Lee Watkins Jr., the center's director of bioinformatics. It's important to have file-locking capabilities and the ability to manage permissions across both Windows and Linux OSes in this environment, though Watkins said this can often carry management headaches. "We have very large files people need access to from Linux, Mac OS X and Windows desktops, some reading, some writing, and we have to decide how to balance throughput to the different [Isilon] nodes -- which file system is going to mount to each node," he said.
Once data passes into the archive stage, Watkins said it's more important to be able to access the data and metadata quickly when it's needed. "We also produce a tremendous amount of data. It can be between a terabyte and 3 TB per day," he said. For Johns Hopkins, writing an application to access the Caringo storage through an API "was pretty simple," according to Watkins. "We can move files around on the back end and not worry about addressing and where it is, and it doesn't matter what operating system is requesting the file."
Combining file protocols with object stores
File and object aren't necessarily mutually exclusive ideas even within the same system. In fact, several existing scale-out NAS systems already have object stores underlying a file interface, including BlueArc Corp.'s Titan, Panasas Inc.'s ActiveStore and ParaScale Inc.'s Hyper-scale Storage Cloud.
"Objects are kind of an overloaded term," said Brent Welch, Panasas' director of software architecture. "Different people define it differently, but it's essentially a container for data that serves as a building block for higher-level storage systems." The Panasas distributed file system knits together NFS with an underlying object store to meet the scalability demands of high-performance computing.
Systems like CAStor and Atmos essentially peel back the network protocol layer and let the application interface directly with the object store. Some products, like BlueArc's Titan, also allow administrators to search using more detailed object-based metadata schemes, though end users in the environment access the system through NFS.
James Rainey, BlueArc's executive director of strategic technology, said BlueArc has allowed some partners to integrate applications directly into the object store using a proprietary API, and they're considering opening up that API for more general use.
Some enterprise users are looking to ease object-based systems for archival data into their environments by putting together standard file-based access with one of the newer object storage systems built on commodity hardware. BlueArc stores file system and object metadata in proprietary field-programmable gate arrays -- FPGAs -- and Panasas uses a proprietary NFS client (see "Peaceful coexistence: Object meets file" below).
|Peaceful coexistence: Object meets file|
"We have a lot of legacy stuff -- we want to use objects for scalability of medical image archives long term, but we're not a Web 2.0 company that can start fresh with a database and objects. Meanwhile, almost any computer system on the planet can connect through CIFS and NFS," said Michael Passe, storage architect at Boston-based Beth Israel Deaconess Medical Center.
Passe is working with EMC Corp. engineers to get file access into Atmos. "They're helping us push forward the file protocol side, but there's significant work to do with Samba to connect to Windows systems via CIFS," he said.
While managing objects will become a necessity down the road, Passe said Atmos' commodity hardware and scale-out architecture has appeal right now. "We went from Centera, at $8 per raw gigabyte of data, to Atmos, at less than a dollar per raw gigabyte," he said. "Even if it makes four copies for data protection, it's still only $2.80 per raw GB."
Connecting Windows systems via CIFS and Samba to an object-based system is fairly esoteric. However, Brent Welch, Panasas Inc.'s director of software architecture, said that Version 4.1 of the NFS standard will include support for connecting via the pNFS client to file, block or object-based storage systems, potentially easing integration of object-based storage into enterprise environments with legacy data like Passe's.
Despite the efforts to meld object and file systems, StorageMojo's Harris predicts the debate over files and objects will continue. "There has been a low-level religious war going on for quite some time," he said. "File systems have been a key technology for decades, but we're rapidly reaching the point … where it doesn't make sense to tie data to a specific disk drive attached to a specific path name anymore."
BIO: Beth Pariseau is senior news writer for SearchStorage.com.