This content is part of the Essential Guide: Unstructured data storage showdown: Object storage vs. scale-out NAS
Evaluate Weigh the pros and cons of technologies, products and projects you are considering.

Object storage architecture removes file crawl issues

When dealing with large data sets, you don't want to use system resources to examine all files to get your information. Object storage, and its unique identifiers, eases the process.

At its lowest level, all data storage is stored as block storage. Object storage is a layer above block storage that combines data; metadata, the details and descriptions about the stored data; and a unique identifier, and packages it as a discreet object. Because object storage is a layer above block storage, it uses the same hardware, including x86 processors, memory, HDDs and flash solid-state drives. There is no need for proprietary or unique object storage hardware. Most object storage runs on commodity, off-the-shelf white-box servers with embedded HDDs and SSDs.

An object storage architecture generally contains extensive amounts of metadata. Common examples of metadata include security policies, such as who has access to objects and whether the object is encrypted; data protection policies; or management policies.

Objects are not organized in an index like files in a file store or NAS; instead, they are stored in a flat address space. Locating and manipulating an object is done via its unique identifier and metadata. That's profoundly different from traditional block storage, where data is located by where it is physically stored in the storage system structure, or a file location is pinpointed via a centralized file directory.

Objects work best for large data sets

The flat namespace of an object storage architecture makes it better suited for large amounts of data than traditional NAS or SAN storage systems. Searching a file store or NAS involves a detailed examination -- commonly referred to as a file crawl -- of the complete index to find a single file. That process consumes file system resources that affect all reads and writes. The amount of consumed time expands rapidly as the file system grows. This makes the file index a choke point when system access demand is high and file counts are extensive.

Learn the ins and outs of object storage with this
animated video from TechTargetTV.

Object storage searches are measurably faster because they only need to search on the unique identifier and metadata. Because there are no file systems or indexes to crawl, this makes object storage highly scalable with little to no impact on performance.

Most object architectures have file interfaces, such as NFS, SMB and Hadoop Distributed File System (HDFS), in addition to standard RESTful APIs (application program interfaces). This enables object storage to write and read data in a fashion similar to NAS, while keeping its advantages. The HDFS interface allows object storage to be a more cost-effective storage for Hadoop projects.

These differences make an object storage architecture a much more efficient and economical fit for several types of applications, including:

  • Active or cold archiving
  • Search
  • Analytics
  • Backups
  • Compliance
  • Social media
  • File sharing
  • Cloud storage

It takes little imagination to understand why object storage has become the primary mass data storage for most cloud storage providers, such as Amazon Web Services, Google, IBM SoftLayer, Microsoft Azure and so many others.

Object storage architecture steps up data protection

The extensive metadata and flat storage pool structure of an object storage system makes it an ideal candidate for erasure codes. Erasure codes require quite a bit of metadata, but are a significantly more economical and resilient form of data protection than traditional RAID in the event of a disk or hardware failure. Erasure coding loosely breaks down data to be stored in a number of unique objects known as the width. Reading data back requires a subset of the full width, called the breadth, to be read. When the entire breadth is read, the original data is available. The entire width does not have to be read to provide the complete data.

A failure to read all the unique objects only means a failure occurred somewhere in the reading process. The data itself is unaffected. New objects are then created to replace the ones that failed or did not come back. Erasure coding is much more efficient than RAID or multi-copy mirroring in the amount of over-provisioned storage required.

This becomes increasingly noticeable as the number of concurrent hardware failures requiring protection increases. Take the example of data that's required to be resilient against six concurrent hardware failures. Multi-copy mirroring would require seven times the baseline storage or 600% overprovisioning. RAID does not have the ability to provide six levels of parity, so your best bet would be RAID 6 triple-parity mirrored. That configuration would require approximately 2.5 times the amount of baseline storage or 150% overprovisioning. It would also significantly reduce storage performance, especially during rebuilds. An object storage architecture using erasure codes would require a width of 26 by a breadth of 20 or, for more performance, a width of 16 by a breadth of 10. That would require 1.3 to 1.6 times the amount of baseline storage, which translates into 30% to 60% overprovisioning. That's a huge difference in costs for the same level of hardware protection.

Next Steps

A complete guide to object systems

The benefits of object storage

Should you buy or build your own object storage system?

Dig Deeper on Object storage

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What specific features do you believe make an object architecture preferable to block storage?
Wish there could be a picture of the erasure coding. Maybe it's just because I have a head cold but I'm having a hard time following the written description.

Have a look at
where the leading object storage vendor will show you how you can benefit from object storage and how erasure coding works.

Regards Roger
Thanks, but I'm not seeing any reference to erasure coding on that page.
Well, I generally like reading or listening to Mr. Staimer expound on data storage, but he needlessly complicated his explanation of how erasure coding protects data objects.

Erasure coding is usually contrasted with replication.  Replication does what it sounds like. It makes copies of data objects in order to protect them and disperses them throughout the object-based storage cluster.

Most erasure coding is based on Reed-Solomon error correction codes, which goes back in time to the days of X.25 packet-switched data networks.

When applied to data protection in object-based storage clusters, erasure coding creates data fragments and generates parity fragments for every object being stored.  There are numerous ways to do this.

A simple example would be to create 4 data fragments and generate 2 parity fragments for each object being stored.  Using a 4+2 erasure code means that any 4 of the fragments (data or parity) are needed to read the object.

Each fragment of the object is dispersed in the object-based storage cluster.  In this 4+2 erasure code example, you would need 6 storage servers in the cluster.  Each storage server would hold 1 fragment (data or parity).  The cluster could survive the loss of 2 storage servers and still be able to read the erasure coded object.

Although erasure coding is more storage efficient than replication, it does require additional computation by the storage servers, which increases latency in reading objects, and it isn't typically deployed over multiple data centers unless something like hierarchical erasure coding is implemented.

Some object-based storage software vendors like IBM (Cleversafe) and HGST (Amplidata) only use erasure coding and they support the use of multiple data centers.  Other object-based storage software vendors, like Cloudian and Caringo, support the use of replication and erasure coding in the same storage cluster.

It is possible to initially store data objects using replication and later on erasure code the same same data objects to reduce the storage space overhead.  This might be a policy-driven event based on how frequently an object is read.

Erasure coding seems similar to hardware or software based RAID, but they are different.  Erasure coding operates at the object level and RAID operates at the block level. 

Interesting. Will have to look into this further as I have only worked with DB2 and SQL Server for data storage.
@ToddN2000, replication and erasure coding are typically used to protect unstructured data.

The data contained in DB2 and SQL Server databases is typically referred to as structured data.  It is estimated that only 20 percent or less of the data generated in an organization is structured data. Database applications have their own methods of protecting the data held within their database structures.

Unstructured data, which is estimated to account for 80 percent or more of the data created in a organization, has no data protection unless it is stored in a RAID architecture and backed up to disk or tape.

Unstructured data is growing 10x to 50x faster than the growth in structured data.  Object-based storage clusters provide protection for unstructured data based on the use of replication or erasure coding, and it does it at a scale that cannnot be accomplished using RAID architectures.