While Hadoop remains a popular topic in discussions about big data environments, the technology also garners a decent amount of criticism. In addition to being a complicated technology that requires a specific set of skills, IT pros must be aware of downsides such as single points of failure and increased capacity requirements. But, according to John Webster, a senior partner at Boulder, Colo.-based Evaluator Group Inc., the benefits of Hadoop are well worth the speed bumps. In the second of this two-part podcast, Webster describes Hadoop benefits and why this topic is so widespread among big data administrators. Listen to the podcast or read the transcript below.
What are some of the main problems with Hadoop?
John Webster: Aside from the lack of understanding of what it is and how to use it -- which I think will change fairly quickly -- people point to single points of failure in Hadoop. There are two different types of nodes: NameNodes and a DataNode. If the NameNode goes down, the cluster essentially goes down, and that's been identified as a single point of failure. But the Apache Software Foundation that does the open source version of Hadoop is addressing that problem. We've got a failover mechanism now in the latest version of Hadoop that's in beta. And then there are some other alternative distributions out there that can do active-active failover for NameNodes. So that's coming along and [the single point-of-failure] problem is getting solved.
One of the things enterprise IT people look at and shake their heads about in disbelief is the fact that Hadoop by default makes three full copies of data on ingest. So you take files, you write it to disk and then it's replicated twice: once within the rack that serves a number of nodes in the cluster, and one copy that can span a rack. So you have three full copies of data, one primary and two fallback, and so a lot of data in Hadoop. Just think: Every time you load a file you're multiplying its capacity requirement by three. There is no concept of RAID, for example, in Hadoop. So the copies [with Hadoop] are there for failure mode, so you can fail over to one of those copies. That's one of the things that enterprise IT has difficulty understanding.
The other is that to add new storage capacity to a Hadoop cluster, you do it by adding DataNodes. DataNodes also have a certain amount of CPU, but you're really adding the DataNodes for capacity as opposed to CPU power. But if you scale up to a 500- or maybe 1,000-node cluster, utilization of CPU actually goes down to single digits. CPU in any given DataNode can only be utilized 4% or 5% in some of these very large clusters. So the enterprise sees that as server sprawl and wonders if there's a way to disaggregate the addition of storage capacity from CPU capacity so you can scale the two independently. And there are ways you can do that as well.
IT pros are aware of these difficulties, so why do we still hear so much about Hadoop? Has it improved since it first came about?
Webster: It's absolutely improved, and it continues to be improved and will be improved. There's significant demand for Hadoop in the enterprise and that's because it can do things that a traditional data warehouse or computing structure can't. It's got a lot of performance at a large scale and at a low cost. [Better performance, scalability and low cost are] three things the enterprise would love to have; they just have to work the bugs out. But once that happens, I think you'll see it really proliferate in production environments. Again, it's a question of what kind of application you're going to run on it, but I think all of that will get worked out in the next few years.
So bottom line: How necessary is Hadoop in big data environments, and if you didn't want to use it, what might some alternatives be?
Webster: There are, and have been, some Hadoop alternatives that live in the SQL community: MySQL, NoSQL, NewSQL. If you program parallel processing clusters in those languages and use those databases, you can do some very scalable analytics as an alternative to Hadoop. So there are definitely alternatives out there.
But just to give an example of the power of Hadoop, I was talking this morning to a financial services company that has five divisions and each one of those divisions has its own set of records on [the] 32,000 companies they follow. What this company wanted to do was go through all of the records it had on all 32,000 companies that it watches to look for signs of trouble, red flags, both in financial resolve and in text data that is submitted to the FCC -- lots of different kinds of data -- structured, unstructured, et cetera. They tried to do this on their traditional computing platforms and came to the conclusion that to do the kinds of things they wanted to do it would take months.
So they put up a Hadoop cluster, and not a very large one, and found they could do what they wanted to do in about 30 minutes on three terabytes of compressed data, which is pretty powerful. To go from an application that you think would take months on a standard platform, down to 30 minutes, that's like make or break. That means [administrators say], 'We can't do this application, but now we can with Hadoop.' That's what we're talking about. We're talking about people doing things they just couldn't do before.