This article can also be found in the Premium Editorial Download "Storage magazine: Data archiving in the cloud."
Download it now to read this article plus other related content.
What about Hadoop?
No column on big data would be complete without a discussion of Hadoop. The ability to accelerate an analytics cycle (cutting it from weeks to hours or minutes) without exorbitant costs is driving enterprises to look at Hadoop, an open source technology that’s often run on commodity servers with inexpensive direct-attached storage (DAS).
Hadoop is used to process very large amounts of data and consists of two parts: MapReduce and the Hadoop Distributed File System (HDFS). Put (very) simply, MapReduce handles the job of managing compute tasks, while HDFS automatically manages where data is stored on the compute cluster. When a compute job is initiated, Map-Reduce takes the job and splits it into subtasks that can be run in parallel. It basically queries HDFS to see where the data required to complete each subtask lives, and then sends the subtasks out to run on the compute node where the data is stored. In essence, it’s sending the compute tasks to the data. The results of each subtask are sent back to the MapReduce master, which collates and delivers the final results.
Now compare that with a traditional system, which would need a big expensive server with a lot of horsepower attached to a big expensive storage array to complete the task. It would read all the required data, run the analysis and write the results in a fairly serial manner, which at these volumes of data, takes a lot longer than the Hadoop-based MapReduce
The differences can be summed up in a simple analogy. Let’s say 20 people are in a grocery store and they’re all processed through the same cash register line. If each person buys $200 worth of groceries and takes two minutes to have their purchases scanned and totaled, $4,000 is collected in 40 minutes by the star cashier hired to keep up. Here’s the Hadoop version of the scenario: Ten register lines are staffed by low-cost, part-time high school students who take 50% more time to finish each separate transaction (three minutes). It now takes six minutes to ring up the same 20 people but you still get $4,000 when they hand in their cash drawers. From a business standpoint, what’s the impact of reducing a job from 40 minutes to six minutes? How many more jobs can be run in that 34 minutes you just gained? How much more insight can you get and how much quicker can you react to market trends? This is equivalent to business-side colleagues not having to wait long for the results of analytical queries.
This was first published in May 2012