For the past 20-plus years, block and file have been the two main external shared storage system protocols. Both...
protocols have been successful due to the ubiquity of the networking interfaces that drive them. In the case of block devices, that has been Fibre Channel and Ethernet (iSCSI); and for file, it has been Ethernet (CIFS/SMB and NFS).
However, block and file are not well-suited for building large-scale data repositories, mainly due to issues with data protection and indexing and addressing. RAID doesn't scale adequately and file-based protocols start to run into issues with metadata management at petabyte-sized volumes of data and billions of files.
Object storage has emerged as an answer to storing data at the multi-petabyte level. However, many applications expect traditional SAN and NAS interfaces, so integrating an object store is not as straightforward as using block- and file-based systems. But it's being done today and there are a variety of options available for making object storage work with your organization's key applications.
Why object storage?
Object storage has unique attributes that overcome the issues of scalability and metadata management seen in traditional storage platforms. These features include:
Dispersed and geo-dispersed data protection. Protection mechanisms are typically implemented with a form of erasure coding, also known as forward error correction, which allows lost or corrupted data to be recreated from a subset of the original content. The exact ratios of redundant to primary data are determined by the service level needed to be applied to that content. As a protection mechanism, erasure coding is much more scalable and capacity-efficient than RAID (albeit at the cost of additional CPU overhead). Erasure coding also offers business continuity/disaster recovery (BC/DR) benefits by allowing subsets of erasure-coded data to be placed in geographically distant locations. This can protect against the failure of one (or more) of the systems in these locations. Obviously, the specific configuration of an object storage system depends on an organization's specific data protection requirements.
Improved data management. With any storage system, there is always the risk of data loss or corruption. Today's disk and solid-state storage media are reliable, but not totally error-free. That can be a problem with very large-scale data repositories. Storage media does fail and may be subject to silent corruption or unrecoverable read errors (UREs) that put data at risk. Object stores mitigate these problems using data scrubbing techniques that validate and rebuild potentially corrupt or missing data. The use of erasure coding and the typical write-once nature of object store data allows failed data to be recreated as a background task with little or no impact to production operations.
The ability of object stores to manage device failure at scale (and the fact that object stores do not have high I/O performance requirements) means that systems can use lower-cost, higher-capacity drives than their block or file counterparts. At scale, the ability to maximize capacity and reduce the cost per TB becomes a design imperative.
Detailed and extensible metadata. Block-based storage systems collect very little information about the contents of the data being stored in the system. The metadata that does exist is used to map logical concepts such as LUNs or volumes to the physical location of that data on disk. Modern block storage systems use metadata to track the application of space-saving features such as thin provisioning and data deduplication, which are infrastructure rather than content-focused. File-based storage makes use of slightly more metadata, as the nature of storing files requires keeping track of permissions (ACLs), access dates/times and file owners.
Object stores offer much richer metadata capabilities, typically providing extensibility to the metadata model itself; allowing abstract key pairs (keywords and values) to be stored with each object. Object storage systems have the ability to search metadata quickly and efficiently to locate and retrieve objects from the store.
Simplified data access. At the heart of the technology, object stores use vastly simplified access methods to store and retrieve data. REST APIs based on Web-based protocols like HTTP allow objects to be accessed through a unique URL. The URL is built from API commands (like GET and PUT) plus a unique reference code assigned to each object -- the object ID.
In terms of standards, the de facto object API is Amazon Web Services' S3 (Simple Storage Service). The S3 API format has become so ubiquitous that object-based platforms must support it to compete in today's market. The accuracy of S3 support is a key success factor for vendors and their products. Many applications use S3 support as the standard method of writing and reading application data.
Versioning. REST-based APIs provide a much easier way to interact with an object store as almost all commands operate at the object level. In addition, on most object stores, an individual object is immutable; meaning that once created, it cannot be changed. Updates to the data within an object require the user to retrieve the data, change the object and store it again into the object store. The result is a new object ID or a new version of the same object. This ability to version data provides an audit trail and archive log that allows previous versions of objects to be retrieved. On systems that provide data deduplication, the overhead of object versioning is restricted to the change in data itself.
Cloud-based object storage integration
All of the major cloud vendors offer some kind of object storage technology and, in many cases, it was the first storage platform they offered. AWS offers S3 and Glacier, Google Compute Platform offers Cloud Storage and Microsoft Azure offers Blob Storage. Consuming these services is a case of writing for the APIs, all of which are, of course, subtly different. That means vendors must develop software using the API to allow object resources to be used. The alternative is to use a cloud gateway that presents a well-known interface to the end user, such as block or file. The cloud providers already offer these features to some degree. Amazon provides the AWS Storage Gateway, a locally installed software feature that emulates iSCSI storage while storing the content on S3. Microsoft acquired StorSimple in 2012 and now offers a hardware appliance based on the technology as an on-ramp to storing data in Microsoft Azure while using the standard iSCSI interface.
There are also a range of vendors that offer hardware and software products that integrate with the cloud providers. These include CTERA, FXT from Avere Systems, Nasuni, NetApp's AltaVault (based on technology acquired from Riverbed), Panzura and TwinStrata (now part of EMC). However, because these products provide the ability to consume cloud-based object stores through the familiarity of block and file protocols, they don't deliver the full benefits of object storage, per se. So what about IT departments looking to deploy in-house object stores? How does the landscape change when you want to deploy hardware on-site, rather than consuming it as a service from a cloud vendor?
In-house object storage
There are both proprietary and open-source platforms for building object stores. The leaders in the market (according to IDC) are Cleversafe, Scality, DataDirect Networks (DDN), Amplidata and EMC. NetApp, Caringo, Cloudian and HDS are also major players. Some are deployed as appliances and some as software-only, where the customer chooses hardware of their own specification. Looking to the open-source platforms, there are options such as Ceph and OpenStack Swift. Ceph is now part of Red Hat, which offers commercial support and SwiftStack provides support for Swift.
These products offer the common characteristics of cloud storage (protection, data and metadata management, APIs and versioning). There is widespread (usually native) protocol support among these systems. Scality supports NFS, SMB, Linux FS, REST, CDMI, S3, OpenStack Swift, Cinder and Glance, making the platform readily integrated into OpenStack environments as the persistent storage layer for all requirements.
Of course, object stores must have a good metadata engine and rich application support. Cloudian, for example, uses a modified version of the open-source Cassandra database for both metadata and transaction logging. The database can be shared and distributed across multiple nodes, providing scalability for the metadata function as object volumes increase. Hitachi's HCP is a good example of a platform with strong application support. HDS offers integrated search capabilities (Hitachi Data Discovery Suite), data ingest (with Hitachi Data Ingestor) and integration with secure file sharing through HCP Anywhere.
Most vendors have optimized their technology for performance through tiering, supporting both solid-state and traditional spinning media. DDN's WOS, for example, is capable of supporting up to three million IOPS in a single cluster, with optimization both for small and large file performance. Cleversafe, Cloudian and DDN all use techniques to measure the latency of each node in a cluster, retrieving data from the nodes with the lowest latency scores. This feature is particularly important in geo-dispersed configurations.
Many of these systems offer features that allow nodes to be added and removed from a storage cluster without interruption to meet availability requirements. The nodes being migrated were simply powered off, moved and re-added to the cluster, with background data management features updating and correcting any changed or corrupted data identified during the physical move. Most also provide data encryption at rest for added protection in large environments where drive replacements can be frequent. Multi-tenancy is common as well, making them suitable for service provider environments or private clouds.
Finally, some object storage platforms offer integration between private and public object stores. Cloudian and HCP, for example, offer this functionality. This allows organizations to take advantage of the cost effectiveness of the public cloud for certain types of data (such as inactive or rarely accessed content) while retaining the ability to search across on-site and cloud data.
Object-based storage works for cloud storage…with help
The best way to use an object store
Object technology meets the world of storage