As a practical matter, big data analysis -- deriving near real-time intelligence from massive volumes of data that emanate from different applications and take the form of a mixture of block and file formatted bits -- requires an appropriate storage infrastructure. Not surprisingly, most big data reference storage architectures have focused on clustering storage nodes to facilitate scaling, I/O balancing mechanisms to ensure computing or analytical processes don't become I/O bound, and high-performance throughput capabilities to get data ingested into the infrastructure efficiently. An early approach to meeting these requirements was to leverage high-performance, scale-out clustering hardware stacks in which disk or silicon storage was installed in (or directly attached to) a commodity compute server. The software component of this type of environment runs Hadoop, NoSQL or Cassandra, which enables analytic inquiries to be distributed among the control nodes.
Apple, Facebook, Google and other early big data pioneers made huge implementations of this type of infrastructure, described as a hyper-scale computing environment. Leading storage vendors subsequently began engineering products that could deliver hyper-scale-like infrastructures in the footprint of more traditional pre-fabricated, pre-integrated arrays. IBM's reference architecture looked like this: a rack of multiple storage arrays called data nodes, a server called an edge node that was responsible for routing data into and out of a big data cluster, one or more management nodes controlling the system and tracking data placement and workloads, and a top layer of rack switches to control the movement of data into and out of the rack. Similar configurations have been built by other vendors, such as HP, and are touted as pre-assembled nodes that can be customized to specific needs and that provide reliable, pre-integrated and pre-tested solutions to specific big data analytics challenges.
A second approach is advocated by purveyors of NAS appliances, which are thin servers joined to a direct-attached storage array. Various scale-out technologies have been introduced by leading NAS vendors -- including NetApp -- that are intended to facilitate the expansion of capacity behind an individual thin server "head," or to enable the clustering of multiple NAS heads and their storage so parallel file systems can be used with a growing number of NAS appliances. BlueArc and Isilon are two scale-out NAS systems that were acquired by brand-name storage vendors in recent years, in part to facilitate big data hosting and high-speed access (via parallel file systems) to hosted data.
But big data analysis often involves a combination of file- and block-based data, which can limit the effectiveness of a NAS solution optimized for file storage. That brings us to the third approach to big data storage infrastructures: object storage. Object storage systems impose a database-like structure when storing data objects, which can be block data output from databases or files. They use much of the same physical storage infrastructure as file system-based storage, but apply a unique identifier to each data object, index the object and its location, and use different methods to access and retrieve objects individually and in groups. At present, products in the object storage space are less mainstream than scale-out NAS or hyper-scale storage products, but their ability to scale to billions of objects and deliver high-speed access to data makes this a technology area to watch.
Big data storage infrastructure: Factors to consider
There are two factors that need to be considered when architecting the storage infrastructure for a big data operation: the frequency with which it will be used, and the speed at which analytical processes must deliver their results.
There is no one-size-fits-most infrastructure architecture for big data as an application any more than there is for enterprise resource planning, email or other database/productivity applications.
In recognition of this, technologies like EMC's VIPR -- a scale-out object storage "overlay" technology intended to enable the inclusion of disparate legacy hardware products into a kind of big data infrastructure -- are appearing on the market to enable occasional big data aficionados to use the storage they have to host their big data workloads when needed.
In contrast, Nutanix and a number of other flash memory storage startups are offering pricey solid-state drive (SSD) or PCI Express (PCIe) flash storage-based "clusters in a box" arrays that provide "isolated islands" of big data storage infrastructure (with their own price tags). This strategy is somewhat reminiscent of high-cost data warehousing appliances that have appeared and disappeared in the market over the past 10 years. Whether "silicon island" products succeed or not, many vendors insist hybrid storage arrays combining SSD and hard disk, or on-server PCIe flash with internal or direct-attached disk, will be required to deliver the speeds needed for data analysis and the delivery of those results.
The need for speed
How fast does big data storage need to be? It's a valid question, but there's a lack of agreement in the big data community about the meaning of fast data. Real-time databases have existed for years, especially in financial markets. In-memory data stores have been the hosting solution for real-time databases for nearly two decades, but experts say that the once rarified and expensive-to-scale infrastructure is becoming more available and affordable with the advent of cheap flash memory that enables entire databases to be placed into silicon rather than just cached input and output waiting to be processed.
Most big data evangelists smile when you ask the best way to host big data and answer: "On flash, of course!" It's not surprising that database makers are reinventing their products today following the lead of Oracle and SAP, whose big data appliances feature a full suite of flash memory and dynamic RAM to host their in-memory databases.
According to MIT researchers in a study released in January 2014, the speed usually realized in accessing big data across a networked disk-based infrastructure requires approximately 4 milliseconds to 12 milliseconds. Using flash storage systems can reduce this to microseconds.
However, it's not clear whether this analysis includes accesses made across WANs over distance to clouds or geographically dispersed repositories, which would most likely add a lot of time delay to analytic processing. Still, at this point, flash seems to rule the day when it comes to big data.
Capacity still king of the hill
Capacity planning is a big factor in any big data project. Multiple, constantly growing compilations of data must be made available to analytics and workload distribution processes in a coherent way. As noted earlier, words like scale out and hyper-scale connote the ongoing demand for greater physical storage capacity.
Scaling a disk volume with traditional operating systems (OSes) and file systems can be problematic. The traditional server OS typically sees a disk as a resource of fixed size, incapable of scaling over time. In contrast, storage virtualization enables infinite scaling of virtual disks comprised of physical storage from the back-end infrastructure, but not limited in size to any one physical disk or flash card. So, virtualizing the storage infrastructure can help meet the capacity requirements of big data, per the advice of DataCore Software and IBM (two purveyors of storage virtualization technology): To grow a volume larger to accommodate more data, add more disk or silicon storage to the pool, spread its I/O over more nodes, and auto-tier it based on the speed/capacity tradeoffs that make the most sense.
The alternative to storage virtualization is to cobble together a large number of independent, but clustered, nodes with their own associated storage. Other methodologies grow volume size by increasing the set of cluster members via a layer of clustering and nodal management software technology, the essence of current hyper-scale and NAS scale-out approaches.
The choice of which approach is best is a personal one. Sometimes the restrictions of the big data analytics software will determine the right strategy, especially since software vendors are currently certifying specific storage solutions for use with their wares.
Perhaps the most capacious media, tape, is missing from most discussions of big data analytics, and I think it should be included. Some observers say the linear read/write characteristics of tape make it unsuitable for big data. However, a big data infrastructure is just as vulnerable to unplanned interruptions and failures as a traditional storage infrastructure. Given the high cost of most big data infrastructures, the investment being made to ingest data into the storage infrastructure, and the challenge of recovering parts of the big data environment in the wake of an outage event, tape backup is still, from my perspective, the best investment you can make to protect your big data storage infrastructure and its data assets.
Explaining four big data architecture categories
Combining big data technologies with a data warehouse architecture