This content is part of the Essential Guide: Big data tutorial: Everything you need to know

How is Hadoop infrastructure improving?

John Webster describes how changes to HDFS and the NameNode can help to improve Hadoop infrastructure.

We've seen a lot of Hadoop services pop up lately that add management and analytics software on top of Hadoop infrastructure. Is there any innovation around the architecture of the storage clusters, or is software the only place that we'll see any change or improvement in the future?

The two major architectural components of Hadoop, if you will, are the MapReduce framework and Hadoop Distributed File System (HDFS). There's been a lot of work on improving HDFS. There's a distribution that's out from MapR, for example, that replaces HDFS with a version of the file system that supports Hadoop infrastructure and eliminates the single point of failure represented in the NameNode in the Hadoop framework. So that's one approach that's being taken.

Another approach that's being taken by some vendors is to propose an alternative file system to HDFS. So, for example, Symantec has a version of CIFS that, again, addresses some of the shortcomings in HDFS. Red Hat has the Gluster file system that they've created an enterprise Hadoop version of, so they've proposed that as an alternative to HDFS.

At one point, IBM wanted to push GPFS as an alternative, but they've backed away from that. I think the reason for that is interesting, because the Hadoop community wants to maintain 100% open source availability for Hadoop code. Some of these offshoots are regarded as forks to the code base. There are a number of purists out there who want to maintain HDFS as the file system for Hadoop infrastructure, and if there are shortcomings there, those will be addressed by the Hadoop community. That's happening as we speak. It's been felt for a long time that HDFS ought to have snapshot capability, so that is on the roadmap. The answer to the NameNode failover issue -- or the inability of the NameNode to fail over to a secondary in some automated way -- is also being addressed. There are a number of other issues that will be addressed in Hadoop 2.0.

Until that time, it hasn't stopped some of the vendors from coming out with proprietary extensions or just full-scale replacements, in some cases, for parts of the Hadoop framework.

About the author:
John Webster is a senior partner at Evaluator Group Inc., where he contributes to the firm's ongoing research into data storage technologies, including hardware, software and services management.

Dig Deeper on Big data storage