Dealing with big data: The storage implications


This article can also be found in the Premium Editorial Download "Storage magazine: Data archiving in the cloud."

Download it now to read this article plus other related content.

What about Hadoop?

No column on big data would be complete without a discussion of Hadoop. The ability to accelerate an analytics cycle (cutting it from weeks to hours or minutes) without exorbitant costs is driving enterprises to look at Hadoop, an open source technology that’s often run on commodity servers with inexpensive direct-attached storage (DAS).

Hadoop is used to process very large amounts of data and consists of two parts: MapReduce and the Hadoop Distributed File System (HDFS). Put (very) simply, MapReduce handles the job of managing compute tasks, while HDFS automatically manages where data is stored on the compute cluster. When a compute job is initiated, Map-Reduce takes the job and splits it into subtasks that can be run in parallel. It basically queries HDFS to see where the data required to complete each subtask lives, and then sends the subtasks out to run on the compute node where the data is stored. In essence, it’s sending the compute tasks to the data. The results of each subtask are sent back to the MapReduce master, which collates and delivers the final results.

Now compare that with a traditional system, which would need a big expensive server with a lot of horsepower attached to a big expensive storage array to complete the task. It would read all the required data, run the analysis and write the results in a fairly serial manner, which at these volumes of data, takes a lot longer than the Hadoop-based MapReduce

Requires Free Membership to View

job would.

The differences can be summed up in a simple analogy. Let’s say 20 people are in a grocery store and they’re all processed through the same cash register line. If each person buys $200 worth of groceries and takes two minutes to have their purchases scanned and totaled, $4,000 is collected in 40 minutes by the star cashier hired to keep up. Here’s the Hadoop version of the scenario: Ten register lines are staffed by low-cost, part-time high school students who take 50% more time to finish each separate transaction (three minutes). It now takes six minutes to ring up the same 20 people but you still get $4,000 when they hand in their cash drawers. From a business standpoint, what’s the impact of reducing a job from 40 minutes to six minutes? How many more jobs can be run in that 34 minutes you just gained? How much more insight can you get and how much quicker can you react to market trends? This is equivalent to business-side colleagues not having to wait long for the results of analytical queries.

This was first published in May 2012

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: