Big data tutorial: Everything you need to know
A comprehensive collection of articles, videos and more, hand-picked by our editors
Last installment of a four-part series
So far, we’ve said that Hadoop is a framework for building distributed computing clusters that can address the processing of large and disparate data sets. However, another way to think of Hadoop is as a storage device or storage environment. Indeed, it's a platform upon which applications can be built that's capable of storing petabytes of data. In addition, it can process and analyze data; delivering the results to a growing list of "big data" applications. (Admittedly, this is a very storage-centric view of a Hadoop architecture.)
We’ve also said that each node offers its local computational and storage resources to the cluster and that these nodes are based on commodity server hardware. The term sometimes used to describe the resource provisioning philosophy is “cheap and deep,” meaning that clusters composed of commodity servers (cheap) can be scaled to hundreds of nodes (deep) -- all based on Apache Hadoop, which is free (and you can’t get much cheaper than free).
Hadoop: Like RAID?
Given the cheap and deep mindset, component failures of one sort or another are expected over time. So Hadoop is designed to detect and handle failures. This aspect of Hadoop architecture is somewhat analogous to the early days of RAID which originally stood for Redundant Arrays of Inexpensive Disks. It was assumed that, in a storage array built from a pile of PC-class disks, drive failures would occur. The trick was to allow for drive failures without losing data. The differing RAID levels (0, 1, 3, 5, 6 and so on) offered various array configurations and drive failure recover modes.
John Webster's Big Data Tip Series
Managing Hadoop storage for Big Data (tip 1)
The Hadoop compute cluster: Key criteria (tip 2)
Architecture alternatives for Hadoop data storage (tip 3)
Indeed, Hadoop can be thought of as a redundant array of inexpensive servers (RAIS). Hadoop also assumes that hardware failures across redundant servers will be a normal operational occurrence and therefore has recovery processes built-in. Many of these are implemented in the Hadoop Distributed File System (HDFS). For example, when data is ingested, it's broken down into data blocks (the default is 64 MB blocks). Blocks are copied multiple times and then distributed -- both the originals and the copies -- across DataNodes. HDFS makes two copies by default, and one is typically written to a different server rack. This copy and distribution process is managed by a NameNode. If a DataNode server fails for any reason, including internal disk failure, the NameNode will find missing data residing elsewhere in the cluster, allowing processing to continue while the failed node is restarted or replaced.
But not like a modern RAID array
Even so, there are some glaring omissions. Recovering from the loss of a DataNode is relatively easy compared to that of a NameNode outage. In current versions of Apache Hadoop, there's no automated recovery provision made for a non-functional NameNode. The Hadoop NameNode is a notorious single point of failure (SPOF) -- a situation not unlike that of a RAID array where a single controller is a SPOF. Loss of a NameNode halts the cluster and can result in data loss if corruption occurs and data can’t be recovered. In addition, restarting the NameNode can take hours for larger clusters (assuming the data is recoverable).
Addressing Apache Hadoop issues
The lack of an automated NameNode failover mode and other Apache Hadoop shortcomings (JobTracker is another SPOF) create an opening for commercial vendors eager to sell “enterprise-ready” alternatives. One of the ways these vendors generally do this is to essentially support Apache Hadoop through APIs to the core Hadoop components like HDFS along with their own modifications, some open and others proprietary. The list of vendors that fall into this category includes (but isn't limited to):
- MapR (also offered by EMC Greenplum)
- Red Hat
A first order of business for these vendors (and other vendors wishing to preserve Hadoop’s MapReduce framework while fixing problems) is to address the NameNode and JobTracker SPOF issues. For example, MapR’s distribution for Apache Hadoop implements a distributed NameNode function (Distributed NameNode HA) across servers in the cluster. Red Hat’s GlusterFS uses its built-in metadata awareness to eliminate the NameNode altogether as a metadata server.
We’ve also mentioned that Hadoop creates multiple copies of data that are distributed across the cluster for use under differing recovery scenarios. However, using snapshots instead could be used to roll-back the cluster to a known good state while reducing the overhead required by full data copies. A number of vendors support snapshot copy in their Hadoop architectures.
And referring back to our discussion about scale-out network-attached storage (NAS) as primary storage for Hadoop, EMC Isilon can be used to address these issues as well. Isilon’s OneFS global namespace file system can be used to support Greenplum Hadoop (HD) clusters. Isilon treats HDFS as an “over the wire” protocol and, as such, is the first SoNAS platform to be natively integrated with HDFS. It, too, addresses the Hadoop NameNode and JobTracker functions as single points of failure.
Apache Hadoop to respond
To be fair, we have to point out that the Apache community isn't ignorant of Hadoop’s present shortcomings with regard to the NameNode and other issues. In fact, a major new version now available as a beta from Cloudera (CDH 4.0) specifically addresses the NameNode SPOF issue. It includes a high-availability (HA) version of HDFS. In the HA version is a “hot stand-by” NameNode that, under administrative control, takes over when the active NameNode either faults or is taken off-line by an administrator for routine maintenance and upgrades -- normally the more likely scenario. In short, HDFS HA includes two NameNodes in an active/passive configuration. In future releases, automated NameNode failover will be supported.
We started this series by noting that big data storage is another way of saying petabyte-scale storage, and that big data analytics is a new way to do business intelligence (BI). We’ve looked more closely at Hadoop as an example of big data analytics. But we’ve also seen how big data storage can be used in conjunction with Hadoop -- bringing big data storage and analytics together -- and how Hadoop can be evaluated as a petabyte-scale storage device.
However, we’ve yet to really explore one final but critical consideration: cost. Aside from the latency problem with adding networked storage to a shared-nothing cluster we described earlier, a storage-area network (SAN) and NAS are also seen by “traditionalists” as being way too expensive. Remember the mantra: cheap and deep. Ditto for solid-state drives (SSDs) as a replacement for direct-attached storage (DAS) at the cluster node level. Even a type of storage that could turbocharge the cluster is typically seen as too expensive at scale and appropriate only for those who are willing to pay up for performance.
The real question is whether or not the cheap-and-deep mindset will prevail within the enterprise data center. Where it does, node-level DAS will likely prevail as the one and only storage layer for Hadoop until someone realizes that continually adding servers to the cluster to accommodate data growth also has cost implications in terms of increasing support issues and management overhead. Where it doesn’t, SAN and/or NAS will be used as either a primary or secondary storage layer for business continuity and data preservation purposes, and the storage administrator’s skill set will be once again appreciated.
BIO: John Webster is a senior partner at Evaluator Group Inc., where he contributes to the firm's ongoing research into data storage technologies, including hardware, software and services management.