This content is part of the Essential Guide: Complete guide to Hadoop technology and storage

How essential is a Hadoop infrastructure to a big data environment?

Expert Jon Toigo explains why Hadoop and big data are frequently used together, but argues that Hadoop has a number of downfalls.

Jon ToigoJon Toigo

How essential is a Hadoop infrastructure in a big data environment?

Because Hadoop came into vogue at the same time big data did, they became synonymous. [But] they're two different things. Hadoop is a parallel programming model that is implemented on a bunch of low-cost clustered processors, and it's intended to support data-intensive distributed applications. That's what Hadoop is all about. It existed prior to the fascination with big data that we're hearing about today. But since Hadoop was out there, it was seized on as sort of an architecture for building big data infrastructure. It rests on Google's MapReduce algorithms, which are a way to distribute an application across clusters. Google's file system, operating system, MapReduce applications and Hadoop Distributed File System [HDFS] are mostly built on Java, which introduces its own set of problems. Hadoop also claims to provide resiliency through internodal failover. With most clusters, if one node fails, it's supposed to fail over to another cluster.

I'm not sure I'm entirely comfortable with Hadoop going forward. And there's general agreement that there are several aspects to a Hadoop infrastructure that really need work if it's ever going to be enterprise-ready. For one thing, core to Hadoop are something called NameNodes, which store metadata about the Hadoop cluster: what each one of the devices in the cluster are, what each one's capabilities are, what they'll be doing, what kind of workload they can handle. That information is not replicated anywhere; it only exists in one place. So it's a single point of failure in a Hadoop infrastructure. And that needs to be addressed if you're going to be doing serious processing on a Hadoop cluster. Another one is JobTracker. JobTracker is a component that manages MapReduce tasks and assigns workloads to different servers, preferably those that are closest to the data being analyzed by that particular process. And again, JobTracker is a single point of failure. It only exists on one server in the cluster. These are just obvious things that are problematic about the architectures with Hadoop today.

The technology of Hadoop itself is not simple. If you're going to deploy it, you're going to need some competent programmers, and they have to understand a variety of things that you wouldn't normally expect a single programmer to have in one kit. They have to know Pig, which is short for Pig Latin, and it's associated with the runtime environment of Hadoop. And they also have to understand Hive, which is a specialty interface for SQL Server databases that allows SQL data to be included in the Hadoop infrastructure. And, of course, they're going to have to understand Java, specifically Jaql, which is JavaScript's object notation language. It's hard enough these days to find programmers who are competent enough to do PHP, and now you're asking for somebody with a mix of fairly exotic languages under their belt.

So the first thing I said is you have some single points of failure. Second, Hadoop requires some specialized skills that may not be available in the skills market that's out there. Third, you're going to have problems with performance. Every company that's deployed Hadoop has had problems with the performance of the operations that Hadoop performs -- the big data analytics that are going on above it. Some of the problem has to do with badly written application code, but some of it has to do with the infrastructure itself. A lot of companies are throwing more money at additional server clusters, direct-attached storage and additional software tools, all with the intention of improving the speeds and feeds of the Hadoop infrastructure.

And, of course, management of the infrastructure is a bear. Hadoop infrastructure management is something a patch of folks are trying to address with a technology they're calling ZooKeeper, and a number of other vendors are trying to address with custom-built, build-around products they're offering. The problem is that there isn't a really good management paradigm right now for Hadoop, and to keep it all up and running is a pain in the butt.

Forbes magazine did an article a little while ago that expressed another big concern I'll share with you: Hadoop is all about the infrastructure that may undertake a big data project. It basically concerns itself with how data is processed. Now business folks don't understand processing -- they couldn't give a hoot about how you process big data. They just want the business entitlement, and they want it fast. The writer of this article correctly observed that Hadoop may be great for processing data at scale, which is what its claim is; however, it is in no way optimized to give you quick ad-hoc analysis or real-time analytics. So it doesn't serve the business process; all it does is perform a certain valuable function underneath, and it's just one way to host all the data.

That gets to the heart of the matter, and the real question ends up being: What are we going to use big data for? And I'm not sure anybody knows that yet, except for folks in marketing who just want to use it to target their products or services more specifically to a particular customer.

Dig Deeper on Big data storage