3dmentat - Fotolia
The growth of unstructured data continues unabated, with no end in sight. Over the last decade, email, videos,...
tweets, photos and more -- all created by human hands -- have been key generators of all this unstructured information. As smartphone adoption reaches maturity worldwide, you'd think unstructured data growth would start to moderate. Think again.
As the internet of things kicks in, machine-generated data will soon dwarf data generated by humans. Whether you are talking about video surveillance, smart cars, elevators, refrigerators, traffic meters or other smart devices, it seems they already (or will soon) have the power to measure and monitor something. Combine this capability with ubiquitous and inexpensive internet access and you will find that these machines are sending information to data centers worldwide at an alarming rate.
At the turn of the century, scale-out NAS easily handled this onslaught of predominately human-generated unstructured data. But as global public cloud services exploded, a new type of storage was required, hence the industry's invention of object storage.
To help you choose the right option for your needs, I explain how scale-out NAS and object storage work, their relative strengths and weaknesses, how easy or difficult they are to manage, and how well they integrate with existing infrastructure and other architectural considerations.
How scale-out NAS works
If you use a personal computer, you are familiar with a file system. It is a mechanism for easily storing and organizing different types of data by structuring information in a humanly readable manner. A scale-out NAS device is like your C: drive on steroids: a singular, massive system that can store millions, or even billions, of files. Scale-out means it operates as one very large global namespace across multiple individual servers or nodes. The data is stored in a relative hierarchical fashion, meaning the location and name of a file depends on where it is structurally located within a system of folders and subfolders.
First-generation, multinode scale-out NAS devices improved scalability of single file systems, but typically at the expense of small file performance. In addition, performance would also significantly slow as the total number of files increased (especially metadata searches). Recent generations greatly improved small file performance with the use of flash-first designs for metadata management with ever faster inter-nodal network speeds. So, today, some modern scale-out NAS devices can easily store billions of files without compromising the speed of metadata searches.
Key strengths of scale-out NAS
Hierarchical and humanly readable metadata. Through a simple file system browser, you can see descriptions of file names and easily organize these descriptive files into descriptive folders. For example, you can copy a photograph named "IMG_XXX" into a folder where you can effortlessly rename it "Aunt Alice at Disneyland - May of 2016." You can also store that file in a folder called "2016 Pictures." One popular use case for scale-out NAS is for consolidating departmental file shares and home directories for thousands of employees. That way, IT administrators can centrally manage, secure and back up all this data.
POSIX-compliant file access. This feature is critical for thousands of legacy applications written to the Portable Operating System Interface (POSIX) standard, which allows companies to centrally manage security and file access and ensures that multiple applications can share a scale-out NAS device without one application overwriting a file another is using.
Another key feature, called fsync, makes sure that when a file is written, the data is fully protected from a power loss.
Performance with real-time data consistency. Through the use of a real-time metadata locking system, multiple applications or users are inherently forbidden to write to the same data at the same time. This allows for updates to files without fear of simultaneous changes by another application, which would leave files corrupt or out of sync.
Vendors of scale-out NAS products include: EMC Isilon, IBM Spectrum Scale, Qumulo and Scality.
How object storage works
When scale-out NAS could not keep pace with emerging web-scale requirements, a new access method called object storage emerged, which added global scalability but relinquished easy file access. Object storage lacks the hierarchal (and more complex) metadata attributes controlled through the POSIX standard, and it features very few commands (e.g., Put, Get, Delete) to keep the interface extremely simple to use.
Command simplicity also means that objects now exist in a single flat address space, each with a unique identifier. This simplifies management of high-level metadata as the same unique identifier is used to retrieve the object data. The power of this simple concept is in retrieving data without the need to know how or where the underlying data is stored.
Each object also contains an extended set of metadata (much more metadata than a traditional file system) that can richly describe object contents. Objects can now be self-describing and can include application-specific details required to interpret the data. Because there is no limit to the contents or length of the metadata, programmers can add a rich set of information about the object itself.
Extended commands allow for the retrieval of metadata, which assists in cataloging and maintaining what is stored in an object. Object location lookup, meanwhile, is simple to process and knowledge of every object's location isn't required or maintained by every node in the system. This allows object storage to scale out to an almost unlimited size.
Key strengths of object storage devices
Global distribution and scale. Object storage's simple indexing scheme means you can build a multi-nodal system that scales massively with global distribution. Additionally, object storage devices assume that data in geographical locations will eventually become consistent. This eventual consistency means you forgo real-time file locking, which in turn enables massive scaling with good performance. However, it also means data access may collide, essentially creating multiple versions of the data. This issue is sometimes experienced in applications like Dropbox when multiple copies of files get created because of synchronization conflicts with another user.
Self-healing with high data reliability. Data protection is controlled by having multiple copies of data across global systems, and erasure coding improves data reliability and durability beyond the individual error rate of the devices themselves. Using these technologies enables companies like Amazon to guarantee up to 11 9s (or a 99.999999999% level) of reliability. If a node fails or an error is detected, the failed data automatically rebuilds across other available nodes.
Low-cost bulk storage. Inherently, object storage architecture tends toward an overall lower storage cost, primarily through using commodity hyper-scale computers and simpler metadata management requirements. Generally, maintaining a high-performance metadata system costs more to implement; an example is a scale-out NAS device where more flash-based storage and higher speed inter-nodal network infrastructure is needed to guarantee a consistent quality of service. While some object storage devices employ flash to boost performance, in general, these devices work well without flash-based storage.
Vendors of object storage products include: Caringo, Cloudian, EMC Elastic Cloud Storage (ECS), IBM Cleversafe and Scality (Scality claims both scale-out NAS and object).
The features of scale-out NAS and object storage devices overlap more and more. For example, object storage devices are adding multiple interface-access methods such as file and block. They are also adding features such as versioning and storage analytics.
Scale-out NAS products are rapidly changing as well. Vendors are modernizing their products with flash-first architectures running on the same commodity servers that object storage systems operate, for example, and new metadata techniques enable scalability orders of magnitude larger than previous clustered file systems. NAS systems are also adding various object storage protocols and APIs in conjunction with their single namespace file systems.
As these two storage architectures reach feature equivalence, it is leading vendors of both types of systems to stake claims for the same unstructured data workloads, causing confusion as to which type of storage product to choose. The eventual winners will be those vendors that can service the widest variety of unstructured workloads at a total cost of ownership lower than competitive offerings.
I highly recommend that you thoroughly research the numerous alternatives available and, if possible, validate vendor claims through on-premises proof-of-concept testing before committing to one technology approach over the other. The following are general recommendations for which architectural approach would best fit your environment:
Lead with scale-out NAS if:
- You have many applications that require POSIX-compliant shared access
- You have a requirement for mixed workloads of small and large files
- You require a more consistent quality of service
- Data consistency for your critical applications is paramount
Lead with object storage if:
- Most of your applications have been written to work natively with the object storage protocol/API
- You require a global distribution of the data (multiple copies in multiple geographies)
- You need to scale to billions or even trillions of objects inside one cluster domain
- Cost per GB is a higher consideration than performance (e.g., for archive and backup use cases)
As the onslaught of unstructured data continues, it is comforting to know innovation is still alive and well, and there are numerous scale-out-capable storage products to choose from. It remains to be seen whether file-based or object-based architectures will eventually prevail, however.
About the author:
Jeff Kato is a senior storage analyst at Taneja Group with a focus on converged and hyper-converged infrastructure and primary storage.
A closer look at the scale-out NAS marketplace
Scale-out vs. scale-up NAS in-depth
The rise of scale-out network-attached storage systems
- Unstructured Data –Hitachi Vantara
- Addressing the Changing Role of Unstructured Data with Object Storage –Western Digital
- The State of Unstructured Data Management –Igneous
- How to Evolve Unstructured Data Management Processes –Igneous