In this SearchStorage.com podcast, associate site editor Ian Crowley speaks with Evaluator Group senior partner John Webster about the progress Hadoop vendors are making toward offering viable enterprise products for managing and processing "big data." Listen to the podcast or read the transcription below to find out more about the state of Hadoop architecture.
Can you give us a brief history of Hadoop, and explain how companies are working with the Apache community to implement new Hadoop capabilities into their products?
Apache Hadoop is an open source project that started at the large social media entities like Google, Amazon, Zynga and Facebook. It spans an arc from that sort of starting point to the enterprise, which is now starting to pick up Hadoop and use it an alternative to traditional data warehousing. It got its start in 2004, when Google described an architecture that they created called MapReduce in a paper that they used to support their query engine. Yahoo picked up the paper and started an open source development project under Apache to bring the MapReduce framework forward. They also created a file system that supported this MapReduce process. It was a distributed file system called Hadoop distributed file system (HDFS). That's essentially the history of Hadoop.
Right now, it's still version 1.0, maybe 1.1 and Hortonworks might have pushed it to 1.2 -- but it's still in a distribution form available as version 1. Version 2 is due out later this year, and one of the objectives within version 2.0 is to address some of the availability and reliability issues within 1.0 that make Hadoop difficult for the enterprise to put into production. Hadoop 2.0 is sort of referred to, in my mind anyway, as enterprise-ready Hadoop.
We're going from a version of Hadoop that was built by the social media companies and could be supported by the social media companies to a version of Hadoop that is more enterprise-ready. Some of the vendors, particularly in the storage arena, that are working with Hadoop are making it, in whatever way they can, more enterprise-ready -- more acceptable by enterprise users and more manageable by enterprise IT staff.
How viable are cloud-based Hadoop solutions like Amazon's Elastic MapReduce when dealing with enterprise big data analytics?
I think they're very viable. If you're an enterprise IT staff person and you're looking at ways to implement Hadoop and the cloud turns out to be one of them, you are going to want to figure out [what happens] if you stand up a Hadoop instance with Amazon's Elastic MapReduce. How are you going to get the data out once you spin these things up? Now, I'm not saying that you can't. There are ways to do that, but that's one of the first things you need to understand. You might want to understand how to get it out of the cloud once you get it in.
I think it's a very viable way to start if you have some data that you want to run some analytics against. You don't have to download the code, build the clusters, install the code, hire people to run it, etc. You just spin up the instance. There's a growing ecosystem of partners to organizations like Amazon's Elastic MapReduce that will help people with data sources -- getting the data there, getting the analytics personnel, building the applications on top that will present the results of the analytics applications to the users.
To me it's a very viable option, but then it becomes a question of whether or not you want to keep the analytics application running in the cloud or you want to bring it back in house, and it's important to know how you're going to do that. Lots of people are doing this. In fact, they've been doing it for years. There have been analytics in the cloud for longer than there's been Hadoop.
What about using Hadoop with shared storage?
There are a number of different ways to look at this. There hasn't been much interest in doing this until very recently, among the social media players (Facebook, Google, and the original creators and users of Hadoop), but there's a lot of interest now among enterprise users in figuring out how they can integrate the storage services they're familiar with into a Hadoop storage architecture.
They're also looking at Hadoop in terms of efficiency. As these clusters grow, they can be very inefficient users of computing resources in particular. Typically, you grow a cluster to take on more and more data. In order to do that, you add nodes. The nodes come along with compute capacity as well. It's not uncommon to see utilization of processing resources as low as 10%. So you've got an awful lot of processing resources that are going unused because you've grown the cluster to take on more and more storage. One of the reasons you might want to use shared storage is to disaggregate storage capacity from processing capacity. So you can grow storage capacity without adding processing capacity, which you may not need.
So it sounds like combining shared storage and Hadoop makes things more complex and expensive. With the added expenses associated with Hadoop distributions, how do you see partnerships like Hortonworks-Microsoft and MapR-EMC shaking out?
Let me respond to your previous comment there. Sure, there are some tradeoffs to consider -- you might be spending extra money on the hardware side or the infrastructure side, but you could be saving money in terms of management costs. As these clusters grow, they typically run in some sort of failure mode. To keep them running, you have to employ people. So you can spend money on infrastructure in order to reduce the number of staff that you need to keep these things up and running and managed. There are some tradeoffs here that are worth considering. The other one is uptime of the cluster. If you've got a cluster that's now in production that users are depending on regularly and the cluster is unavailable for some period of time, that's going to cost you in terms of productivity and lost revenue. Enterprises have typically looked at tradeoffs like that and have been more willing to spend money on big data infrastructure to gain return on investment. Return on investment is typically in user productivity or simplified management.
But partnerships are critical here. You've got a group of base distributions and vendors that are pushing forward versions of the open source distributions of Apache Hadoop. Those are Cloudera, Hortonworks and MapR, which I would argue has its own proprietary distribution. I'm seeing all the major vendors form some sort of relationship with any -- and all -- of those three. You mentioned Hortonworks and Microsoft. Specifically, what's going on there is Hortonworks is figuring out ways to integrate Microsoft applications with Apache Hadoop. They're also working on running Hadoop under Hyper-V and virtual machines.
EMC and MapR -- MapR offers what they advertise as an enterprise version of Hadoop and obviously EMC has a customer base that's very heavily dominated by large enterprise users. So they're going to be interested in an enterprise version of Hadoop from vendors that they know and understand and can work with. [The relationship] is also around an appliance, and appliance versions of Hadoop are also an interesting way to approach this. The appliance being the entire framework -- servers, networks, storage integrated with Hadoop, all installed and supported by a single vendor. Those have become popular as well.
I mentioned NetApp and Cloudera. IBM has relationships with Cloudera and Hortonworks. The vendors who have an enterprise user constituency are seeing Hadoop enter the enterprise, and are building relationships with the developers of Hadoop. Through those relationships, the vendors gain an understanding of the platform and how it's progressing forward, and companies like Cloudera, MapR and Hortonworks get a better understanding of what the requirements are for Hadoop.
BIO: John Webster is a senior partner at Evaluator Group Inc., where he contributes to the firm's ongoing research into data storage technologies, including hardware, software and services management.