Sergey Nivens - Fotolia

Yesterday's unified storage is today's enterprise data lake

How unified storage needs to develop to support new applications and new data center environments.

Unified storage has been a data storage industry buzzword for years. When NetApp coined the term years ago, the intention was to describe a single approach for storage operation and management, no matter the size of the platform or the protocols required (which at the time were pretty much NFS, CIFS and Fibre Channel). NetApp used the term to contrast its products with EMC's multiplatform, multi-operating system, multisiloed approach.

Over time, the definition has evolved. Now, unified storage means supporting both file and block data within the same array, eliminating the need to create a separate silo for each one. But with new technologies like object storage and Hadoop steadily making their way into the data center, expect a new definition for unified storage to come into play. We'll call it unified storage 2.0 for now, though some vendors have already adopted the term enterprise data lake. And while that sounds like marketing fluff and vendor posturing, the concept itself is solid and can benefit users in a number of ways.

Market challenges

Managing unstructured data continues to be a challenge for enterprise IT. When Enterprise Strategy Group surveys IT managers about their biggest overall storage challenges, growth and management of unstructured data often comes out at or near the top of the list.

And that challenge isn't going away. Data growth is accelerating, driven by a number of factors, such as:

  • Bigger, richer files. Those super-slow motion videos we enjoy during sporting events are shot at 1,000 frames per second with 2 MB frames. That means 2 GB of capacity is required for every second of super-slow motion video captured. And it's not all about media and entertainment; think about industry-specific use cases that leverage some type of imaging, such as healthcare, insurance, construction, gaming and anyone using video surveillance.
  • More data capture devices. More people are generating more data than ever before. The original Samsung Galaxy S smartphone had a 5 megapixel camera, so each image consumed 1.5 MB of space compressed (JPEG) or 15 MB raw. The latest Samsung smartphone takes 16 megapixel images, consuming 4.8 MB compressed/48 MB raw storage -- a 3x increase in only four years.
  • The Internet of Things. We now have to deal with sensor data generated by everything. Farmers are putting health sensors on livestock so they can detect issues early on, get treatment and stop illness from spreading. They're putting sensors in their fields to understand how much fertilizer or water to use, and where. Everything from your refrigerator to your thermostat will be generating actionable data in the not too distant future.

These are just a few examples, but you get the point. Data was growing fast before, but it's growing faster now. But now that we have all that data, what do we do with it?

Here comes Hadoop

Analytics have long provided valuable business insight, but it's been too time consuming and expensive for mass adoption. The process of extracting data, transforming it into something that fits operational needs, loading it into a data mart and conducting analysis can take weeks. Data is growing too fast, the potential insights too valuable and competition too fierce for IT to have multi-week wait times to perform analytics. That's driving the development of new solutions, such as near real-time analytics based on Hadoop. However, we still have the data problem. How do we get data into these systems in a timely manner?

Traditional analytics data-loading processes don't just take too long, they create lots of new, redundant and expensive data silos that significantly increase the amount of unstructured data you need to manage.

What about object storage?

Object storage operates differently from standard file-system storage. With a standard storage infrastructure, content is managed through a hierarchical file system using an index table that points to the physical storage location of each file, and tracks only simple metadata. This approach limits the number of files that can be managed in a single directory. And files are accessed via standard access protocols such as NFS and CIFS.

Object storage data is organized into containers of flexible sizes (objects). Each object has a unique ID instead of a filename, with metadata that can include detailed attributes. This metadata can be used to set up automatic storage policies such as migrating aging data from high performance to more cost-efficient capacity-based disk or the deletion of data when it expires. Object storage offers a simpler design and greater scalability, easily managing billions of individual objects. The scalability and manageability of object stores make them a natural back end for cloud deployments, and indeed much of the early adoption of the technology has been by cloud service providers.

The challenge for enterprises is that objects are addressed via proprietary APIs using RESTful interfaces, and each vendor has its own API set. As a result, object storage has been slow to be adopted by enterprise IT because of the disruption and lock-in created by writing to proprietary APIs. The good news on that front is that there's some standardization emerging and many vendors are adopting API sets from Amazon or OpenStack Swift.

A number of object storage vendors are beginning to support standards-based interfaces like CIFS and NFS, providing a "best of both worlds" approach that allows IT to scale and manage storage easier but also plug into existing workflows and applications. However, this multiprotocol approach is more like unified storage 1.5 than unified storage 2.0. Getting to 2.0 takes a bit more.

Unified storage 2.0

Object storage is off to a great start, but unified storage 2.0 means sharing not just the storage pool, but the files or objects themselves between applications. How do we get there? It's all about sharing nicely. We need:

  • A shared data store that's accessible to all your applications. That means instead of creating silos for your new file, mobile, cloud and Hadoop workflows, and copies of data in support of analytics operations or to feed other business processes, you would have just one, single pool of data shareable across everything.
  • A shared-access model so that each bit of data would be simultaneously accessible in multiple formats: as a CIFS or NFS file, a RESTful object, a Hadoop object or whatever comes next. This would eliminate the extract, transform and load process and allow for things like data-in-place analytics and accelerated workflow support between disparate applications.
  • Access from any device -- a tablet, smartphone, laptop, desktop or even a phablet -- to support today's mobile workforce.
  • Some level of quality of service. This would provide some way to securely isolate consolidated workflows in their own zones within the system for safeguarding or performance. Sharing is good, but limits need to be put in place to ensure organizations don't hurt important applications if there's a spike in non-critical application activity.

There are other things that would be required, such as a scale-out architecture that grows with the data; some form of tiering so stale data is moved to slower, less expensive media; high availability so multiple applications don't go down when a storage component fails; and efficiency features such as erasure coding, compression and deduplication.

IT cannot afford to keep throwing hardware at the unstructured data problem. We're at a breaking point. The first step is rationalization and consolidation. But we can't consolidate on traditional platforms. We need something that scales and grows, is easy to manage and fully shareable. Only then, once we get our data under management and control, can we truly begin to harness the power of the information we have at hand.

About the author:
Terri McClure is a senior storage analyst at Enterprise Strategy Group, Milford, Mass.

Next Steps

Don't forget about design principles when jumping in Hadoop data lake

Data lakes and disaster recovery: A good fit?

Dig Deeper on Unified storage