This content is part of the Essential Guide: Unstructured data storage showdown: Object storage vs. scale-out NAS

Object storage architecture removes file crawl issues

When dealing with large data sets, you don't want to use system resources to examine all files to get your information. Object storage, and its unique identifiers, eases the process.

At its lowest level, all data storage is stored as block storage. Object storage is a layer above block storage that combines data; metadata, the details and descriptions about the stored data; and a unique identifier, and packages it as a discreet object. Because object storage is a layer above block storage, it uses the same hardware, including x86 processors, memory, HDDs and flash solid-state drives. There is no need for proprietary or unique object storage hardware. Most object storage runs on commodity, off-the-shelf white-box servers with embedded HDDs and SSDs.

An object storage architecture generally contains extensive amounts of metadata. Common examples of metadata include security policies, such as who has access to objects and whether the object is encrypted; data protection policies; or management policies.

Objects are not organized in an index like files in a file store or NAS; instead, they are stored in a flat address space. Locating and manipulating an object is done via its unique identifier and metadata. That's profoundly different from traditional block storage, where data is located by where it is physically stored in the storage system structure, or a file location is pinpointed via a centralized file directory.

Objects work best for large data sets

The flat namespace of an object storage architecture makes it better suited for large amounts of data than traditional NAS or SAN storage systems. Searching a file store or NAS involves a detailed examination -- commonly referred to as a file crawl -- of the complete index to find a single file. That process consumes file system resources that affect all reads and writes. The amount of consumed time expands rapidly as the file system grows. This makes the file index a choke point when system access demand is high and file counts are extensive.

Learn the ins and outs of object storage with this
animated video from TechTargetTV.

Object storage searches are measurably faster because they only need to search on the unique identifier and metadata. Because there are no file systems or indexes to crawl, this makes object storage highly scalable with little to no impact on performance.

Most object architectures have file interfaces, such as NFS, SMB and Hadoop Distributed File System (HDFS), in addition to standard RESTful APIs (application program interfaces). This enables object storage to write and read data in a fashion similar to NAS, while keeping its advantages. The HDFS interface allows object storage to be a more cost-effective storage for Hadoop projects.

These differences make an object storage architecture a much more efficient and economical fit for several types of applications, including:

  • Active or cold archiving
  • Search
  • Analytics
  • Backups
  • Compliance
  • Social media
  • File sharing
  • Cloud storage

It takes little imagination to understand why object storage has become the primary mass data storage for most cloud storage providers, such as Amazon Web Services, Google, IBM SoftLayer, Microsoft Azure and so many others.

Object storage architecture steps up data protection

The extensive metadata and flat storage pool structure of an object storage system makes it an ideal candidate for erasure codes. Erasure codes require quite a bit of metadata, but are a significantly more economical and resilient form of data protection than traditional RAID in the event of a disk or hardware failure. Erasure coding loosely breaks down data to be stored in a number of unique objects known as the width. Reading data back requires a subset of the full width, called the breadth, to be read. When the entire breadth is read, the original data is available. The entire width does not have to be read to provide the complete data.

A failure to read all the unique objects only means a failure occurred somewhere in the reading process. The data itself is unaffected. New objects are then created to replace the ones that failed or did not come back. Erasure coding is much more efficient than RAID or multi-copy mirroring in the amount of over-provisioned storage required.

This becomes increasingly noticeable as the number of concurrent hardware failures requiring protection increases. Take the example of data that's required to be resilient against six concurrent hardware failures. Multi-copy mirroring would require seven times the baseline storage or 600% overprovisioning. RAID does not have the ability to provide six levels of parity, so your best bet would be RAID 6 triple-parity mirrored. That configuration would require approximately 2.5 times the amount of baseline storage or 150% overprovisioning. It would also significantly reduce storage performance, especially during rebuilds. An object storage architecture using erasure codes would require a width of 26 by a breadth of 20 or, for more performance, a width of 16 by a breadth of 10. That would require 1.3 to 1.6 times the amount of baseline storage, which translates into 30% to 60% overprovisioning. That's a huge difference in costs for the same level of hardware protection.

Next Steps

A complete guide to object systems

The benefits of object storage

Should you buy or build your own object storage system?

Dig Deeper on Object storage