Complete guide to Hadoop technology and storage
A comprehensive collection of articles, videos and more, hand-picked by our editors
Why utilize private cloud storage with Hadoop? Isn't Hadoop designed to use inexpensive commodity servers and storage? Will integrating private cloud storage solve Hadoop issues? Getting a handle on these questions and answers requires just a little background understanding of Hadoop.
Hadoop is the Apache open source software project for big unstructured data analytics. It's designed to provide useful or actionable information on large amounts of primarily unstructured data sets. However, it can also provide similar value with large structured data sets and combinations of structured and unstructured. What makes Hadoop so valuable is its ability to derive advantageous information from data that traditionally is not easily mineable. Its ability to sift through petabytes of data is unparalleled. But what makes Hadoop truly stand out is that users do not have to know the outcome they are seeking in advance. It derives relationships that no one had a clue even existed. It is a powerful business and IT tool today.
The key design principle behind Hadoop's handling, processing and analysis of very large data sets (petabytes of data) is to automatically distribute data storage and batch jobs across commodity server clusters. Hadoop scales up from a single server to thousands of machines with a degree of built-in fault tolerance. Hadoop's failure detection and automation provides a high degree of resilience.
The two significant technologies behind Hadoop are MapReduce and Hadoop Distributed File System (HDFS). MapReduce is the framework that recognizes and assigns batch jobs to Hadoop cluster nodes. MapReduce processes those jobs in parallel, enabling large data set batch processing and analysis in short timeframes. HDFS spans and links all the nodes in the Hadoop cluster into one big file system. Because nodes will fail, HDFS enables reliability by replicating data across multiple nodes.
What, then, does private cloud storage provide for Hadoop? Hadoop is an evolving project. There are currently three significant Hadoop issues mitigated or eliminated by private cloud storage providers:
- HDFS provides a well-documented, highly resilient file system. Unfortunately, its single NameNode is a well-known single point of failure that reduces Hadoop's availability. The NameNode coordinates file system data access. In Hadoop clusters utilizing interactive workloads (HBase); real-time extract, transform and load; or batch-processing workflows, an HDFS NameNode outage is a serious problem. When this happens, downtime will occur, users will be unhappy and productivity will be impacted negatively. The Hadoop community and Apache are working hard on developing a NameNode high availability, and it is expected sometime later this year with the Hadoop 2.0 release. In the meantime, several private cloud storage vendors such as NetApp FAS and V-Series, EMC Isilon, and Cleversafe Dispersed Storage have provided a fix to the NameNode issue via their storage.
- The second Hadoop issue addressed by private cloud storage is actually a bit worse than the first. HDFS makes a minimum of two copies of the data, or three in total, to meet its designed resilience. That means it consumes three times the amount of storage. Even utilizing cheaper commodity server storage, that's a lot of storage. For every petabyte of actual data, 3 PB of storage is consumed. That storage takes up rack space, floor space, and especially power and cooling. Cleversafe has resolved this issue by providing a HDFS interface that eliminates the multiple data copies with the use of dispersed storage erasure coding. That dispersed storage enables an order of magnitude with a higher level of resilience than standard HDFS, with 60% less storage consumed.
- Then, there is the issue of moving data into a Hadoop cluster. Somehow, data has to be migrated to the Hadoop cluster to be processed. That is a nontrivial task that takes a lot of time, depending on the amount of data to be processed and analyzed. And that task is ongoing. EMC Isilon has come up with a shortcut. They can represent NFS or CIFS (SMB1 or SMB2) data residing in an Isilon storage cluster as HDFS and thus eliminate the data migration. It can also represent HDFS data as NFS or CIFS for use outside of the Hadoop cluster.
About the author
Marc Staimer is the founder, senior analyst and CDS of Dragon Slayer Consulting in Beaverton, Ore. The consulting practice of 15 years has focused in the areas of strategic planning, product development and market development. With more than 33 years of marketing, sales and business experience in infrastructure, storage, server, software, database, big data and virtualization, he's considered one of the industry's leading experts. Marc can be reached at firstname.lastname@example.org.