Hadoop can be a beneficial tool for big data environments, but according to John Webster, a senior partner at Boulder, CO-based Evaluator Group Inc., a lot of the criticism stems from a lack of understanding of the uses for Hadoop. In the first of this two-part podcast, Webster explains Hadoop's role in data storage, whether it can be used as an alternative to object storage and what needs to change to spread Hadoop adoption. Listen to the podcast or read the transcript below.
What do people mean when they talk about Hadoop and the big data lake?
John Webster: When I hear big data lake, I think of this big, scalable place where you can put all kinds of things and retrieve it when you need it. That's typically the concept with which I hear traditional warehouse vendors use the term in reference to Hadoop. So what they're trying to say is the enterprise can use Hadoop as a place to put all kinds of data -- structured, unstructured, file, etc. -- that you want to try to make sense of in the context of the data warehouse. The traditional data warehouse isn't very good at handling that sort of data, so you have this big data lake you shove all this data into, and then you can feed that into a traditional or existing data warehouse and the Hadoop engine essentially becomes what's referred to in data warehouse speak as extract, transform and load. So it's a place where you put all kinds of data, and then extract what you need out of it from Hadoop and into the data warehouse. It also serves as sort of an archival store in some cases.
How does the fact that Hadoop is built on a file system help its role in storage?
Webster: It's not just that it's a file system -- to me it's a distributed file system, which is really the differentiator here. So the idea is that you have a file system that runs on a cluster of nodes and the cluster can be expanded to thousands of nodes, so it's a very elastic file system. You can expand it, contract it, and it spans lots of different servers and computing devices.
Do you think that Hadoop can serve as an object store alternative for large data sets?
Webster: If I was a storage buyer and I was looking at object storage, probably for an archival application, which is one of the proposed applications for large object storage devices -- no. I wouldn't put it in [the object storage] category. Hadoop is something that you can program, and typical object storage devices, or SAN or NAS [network-attached storage] devices, aren't really programmable in the way that we think of a device [that can be programmed to] actually have an application run on top of it. That's what you can do with Hadoop. I think of Hadoop as a storage platform that can run an application as opposed to an object storage platform that essentially serves data to an application.
We hear a lot about Hadoop, but do you think that there are enough Hadoop-based applications to drive wide adoption of it?
Webster: There are two things that are going on here, at least from the enterprise perspective. The first thing is that there are a lot of enterprises out there that, in all likelihood, have a Hadoop cluster somewhere in the organization either because the marketing department has gone up and bought one from any number of vendors who will support these in the form of an appliance, or somebody in the IT department put together some servers, downloaded the free software, and brought Hadoop up in a sandbox environment just to see what is going on with it: What's all this stuff about Hadoop? What can we do with it? The marketing department who puts it up as shadow IT knows what to do with it at this point because they're [actively using it]. On the other hand, it's not uncommon to find Hadoop in these sandbox environments where IT is playing with it but they really don't see any applicability for it yet beyond just trying to figure out how to run it and what to do with it, how to program it. So those are the two uses for Hadoop where it winds up in the enterprise environment.
In the case of IT, I think enterprise IT people are using it or playing with it and have yet to really figure out how they can run applications on it. They may have come to the conclusion that they have to build the applications themselves, which is quite often the case, but there are a lot of applications now appearing that will make it easier for the enterprise to say, "Okay, here are some uses for Hadoop; we don't have to hire a small army of data scientists at $300 thousand a year to get something out of this platform." So I think we're right on the cusp of the enterprise now becoming aware that there are some really valuable applications appearing on the marketplace that will do things with Hadoop, and different kinds of data that they just haven't seen before. Some of the types of information that they're getting out of Hadoop are absolutely amazing.