Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Third installment of a four-part series
We concluded the last article in this series on managing "big data" with a question: Does Hadoop storage have to be direct-attached storage (DAS)? This article introduces some alternatives to the strict use of embedded DAS for hardware-based storage within the Hadoop MapReduce framework. To do so, we'll examine the alternatives in terms of a three-stage model:
Stage one: DAS in the form of small amounts of disk (JBOD/RAID) embedded within each cluster node is replaced with larger, high-performance arrays that are external to the cluster nodes that are still directly attached to provide data locality. In a way, we’re rephrasing our original question: Does Hadoop data storage have to be relatively small groupings of DAS embedded within each cluster node? No, but the larger external storage arrays that replace embedded DAS still function as DAS.
Stage two: The node-based DAS layer used as primary storage by the cluster is augmented with the addition of a second storage layer consisting of network-attached storage (NAS) or a storage-area network (SAN).
We'll now look at each of these three stages in more detail.
The first thing to note about the use of external disk arrays as primary storage for Hadoop clusters is that doing so preserves the shared-nothing and data locality characteristics of DAS. Typical storage arrays can be configured as multiple volumes of DAS, all housed within the same array. Each node has its own non-shared set of disks. Once this is done, however, all the features and functions now common to high-performance, data center class RAID arrays can be applied to Hadoop NameNodes and DataNodes. The beneficial effects, as seen from the standpoint of data storage administrators, differ for NameNodes vs. DataNodes:
At the NameNode level: Storage for the Hadoop NameNode holds cluster-wide metadata. The NameNode is a well-know single point of failure that can shut down the cluster when not functioning. A data center class array can act as a unified repository of cluster metadata that supports faster recovery from failure. It also serves as a repository for other cluster software, including scripts; as such, it can be used to simplify cluster deployment, updates and ongoing maintenance.
At the DataNode level: Standard Hadoop clusters typically use DataNode-based software to provide data protection and system resilience. Hadoop storage uses a distributed, host software-based multiple data mirroring scheme that functions across all DataNodes in a cluster. Upon data ingest, users typically specify that two additional copies of the original data be written to two other DataNodes in the cluster, resulting in three data copies contained within the cluster. This provides a degree of resilience in case of a failure, as well as balanced access (load balancing) to data across the DataNodes in the cluster.
However, by using a replication count of three, every TB of data ingested yields three TBs stored. In addition, the copy process consumes cluster processing resources and internal communications bandwidth that detracts from making those same resources available to analytic processes.
Using external arrays to support DataNodes allows storage administrators to use array-resident data protection functions, including RAID, snapshots, continuous data protection (CDP), cloning and off-site replication. That moves data protection processes, and the creation of data replicas needed for adverse event recovery purposes, off the Hadoop cluster and on to storage arrays designed to accomplish these tasks far more efficiently. Data security and preservation processes can also be applied.
Cluster-wide performance can also be improved. As noted above, triple mirroring within the cluster consumes server and network bandwidth. Moving the data protection function off the cluster and on to the arrays returns the cluster resources consumed back to the cluster. An example of a stage one implementation can be seen in NetApp’s Open Solution for Hadoop.
We mentioned that Hadoop storage administrators typically maintain three copies of data within the cluster for data protection and disaster recovery. Some commercial versions of Hadoop support using externally shared storage as a target for Hadoop’s internal mirroring process, i.e. one of three data copies lives externally. Node-based DAS remains untouched.
John Webster's Big Data Tip Series
Read more about Hadoop management (tip 1)
Hadoop cluster nodes explained (tip 2)
Key Apache Hadoop shortcomings to look out for (tip 4)
As with stage one, this implementation of external storage in the context of Hadoop also has the net effect of preserving the shared-nothing and data locality mandates (DAS as a primary storage layer is preserved), while allowing storage administrators the ability to apply their data protection, security and preservation processes. In addition, because SAN and/or NAS can now act as a secondary storage tier supporting a Hadoop cluster, external storage becomes a very scalable data repository. EMC Greenplum HD’s use of DataDomain and VMAX is an example of a stage two implementation.
In stage three, rules start getting broken. Shared storage – scale-out NAS, for example -- becomes Hadoop’s primary storage layer. DAS is gone. Shared-nothing and data locality are gone. However, most if not all the beneficial attributes common to modern storage platforms (automated tiering, internal and external replication, and so on.) are applied to Hadoop data.
This implementation of shared storage in the context of Hadoop will likely limit the size of the cluster. Therefore, it would seem to be a viable option in situations where data residing on a scale-out NAS system supporting normal business applications could for example, be copied and presented to a small Hadoop cluster running a BI application and attached to the same NAS system. EMC Greenplum’s support of Isilon is an example.
Is stage four on the horizon?
The potential now exists for a fourth stage to emerge later this year. What is stage four? We are already aware of scale-out storage architectures using distributed computing (aka grid) as a foundation. This category of storage platforms includes but isn't limited to EMC Isilon, IBM SONAS and Sepaton DeltaScale. As mentioned in an earlier article on Hadoop clusters, Hadoop MapReduce aims to move the data close to the computing elements to reduce cluster latency. But suppose you did it the other way around and moved the computing close to the data. Scale-out storage processing nodes generally have enough compute power and internal network bandwidth to support Hadoop. So storage administrators, get ready to manage your very own Hadoop cluster.
Evaluating Hadoop storage
In walking through these first three stages, we’ve been running up against the implication that Hadoop storage has issues, and that these issues can be addressed by using more robust and scalable storage platforms to support Hadoop clusters. In the fourth article in this series we’ll discuss these issues in more detail by evaluating Hadoop as an enterprise data center quality storage device. We’ll look at how Hadoop storage maintains system availability, manages data protection and other issues, in much the same way you would as a storage administrator.
To get started, it’s worth asking a set-up question: Are there issues with Hadoop that can be addressed in Hadoop’s storage layer? Those of you who are aware of Hadoop’s shortcomings would answer with a decided “yes.”
BIO: John Webster is a senior partner at Evaluator Group Inc., where he contributes to the firm's ongoing research into data storage technologies, including hardware, software and services management.