Dealing with big data: The storage implications


This article can also be found in the Premium Editorial Download "Storage magazine: Data archiving in the cloud."

Download it now to read this article plus other related content.

Hadoop isn’t perfect. Clustered file systems are complex, and while much of this complexity is hidden from the HDFS admin, it can take time to get a Hadoop cluster up and running efficiently. Additionally, within HDFS, the data map (called the NameNode) that keeps track of where all the data lives (metadata) is a single point of failure in the current release of Apache Hadoop -- something that’s on the top of the list to be addressed in the next major release. Data protection is up to the admin to control; a data replication setting determines how many times a data file is copied in the cluster. The default setting is 3, which can lead to a capacity overhead of 3x the required usable capacity. And that’s to protect in the local cluster; backup and remote disaster recovery (DR) need to be considered outside of the current versions of Hadoop. There’s not a large body of trained Hadoop professionals on the market; while firms like Cloudera, EMC and MapR are doing a good job on the education front, it’ll take time to build a trained workforce. This last point shouldn’t be taken lightly. Recent studies show that projects planning to leverage contractors/consultants should budget as much as $250,000 per developer per year.

Big data, bigger truth

This laundry list of shortcomings, combined with the potential commercial analytics market opportunity, is driving big storage companies like EMC, IBM and NetApp to look at the big data opportunity. Each company

Requires Free Membership to View

has introduced (or will, you can count on it) storage systems designed for Hadoop environments that help users cover the manageability, scalability and data protection angles that HDFS lacks. Most offer a replacement to the HDFS storage layer with open interfaces (such as NFS and CFS), while others provide their own version of a MapReduce framework that performs better than the open source distribution. Some offer features that fill in the open source HDFS gaps, like the ability to share data between other apps via standard NFS and CFS interfaces or, much better, data protection and DR capabilities.

NetApp is actually taking a radically different approach from most vendors. They’re embracing the open Hadoop standard and the use of data nodes with DAS. Instead of using their own file system with a wrapper for Hadoop, they’re turbo-charging the DAS with SAS-connected JBOD based on the low end of the Engenio platform. And for the NameNodes they’re using an NFS-attached FAS box to provide a quick recovery from a NameNode failure. It’s a “best of both worlds” hybrid approach to the problem.

Whether or not the market will pay a premium for the better availability and broader potential application leverage still remains to be seen, as we’re in the early days yet.

Big data is a reality, and not all big data was created equal: various types of big data need different storage approaches. Even if you have a big data problem and are hitting those barriers that indicate you need to do something differently, the best way for users to talk to vendors about their requirements is to cut right through the fluff and not talk about big data at all. Instead, you should talk about the business problem and the use cases that will ultimately narrow the spectrum to a specific set of workload characteristics. The right storage approach will quickly become evident.

BIO: Terri McClure is a senior storage analyst at Enterprise Strategy Group, Milford, Mass. 

This was first published in May 2012

There are Comments. Add yours.

TIP: Want to include a code block in your comment? Use <pre> or <code> tags around the desired text. Ex: <code>insert code</code>

REGISTER or login:

Forgot Password?
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy
Sort by: OldestNewest

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to: