This article can also be found in the Premium Editorial Download "Storage magazine: Data archiving in the cloud."
Download it now to read this article plus other related content.
Hadoop isn’t perfect. Clustered file systems are complex, and while much of this complexity is hidden from the HDFS admin, it can take time to get a Hadoop cluster up and running efficiently. Additionally, within HDFS, the data map (called the NameNode) that keeps track of where all the data lives (metadata) is a single point of failure in the current release of Apache Hadoop -- something that’s on the top of the list to be addressed in the next major release. Data protection is up to the admin to control; a data replication setting determines how many times a data file is copied in the cluster. The default setting is 3, which can lead to a capacity overhead of 3x the required usable capacity. And that’s to protect in the local cluster; backup and remote disaster recovery (DR) need to be considered outside of the current versions of Hadoop. There’s not a large body of trained Hadoop professionals on the market; while firms like Cloudera, EMC and MapR are doing a good job on the education front, it’ll take time to build a trained workforce. This last point shouldn’t be taken lightly. Recent studies show that projects planning to leverage contractors/consultants should budget as much as $250,000 per developer per year.
Big data, bigger truth
This laundry list of shortcomings, combined with the potential commercial analytics market opportunity, is driving big storage companies like EMC, IBM and NetApp to look at the big data opportunity. Each company
NetApp is actually taking a radically different approach from most vendors. They’re embracing the open Hadoop standard and the use of data nodes with DAS. Instead of using their own file system with a wrapper for Hadoop, they’re turbo-charging the DAS with SAS-connected JBOD based on the low end of the Engenio platform. And for the NameNodes they’re using an NFS-attached FAS box to provide a quick recovery from a NameNode failure. It’s a “best of both worlds” hybrid approach to the problem.
Whether or not the market will pay a premium for the better availability and broader potential application leverage still remains to be seen, as we’re in the early days yet.
Big data is a reality, and not all big data was created equal: various types of big data need different storage approaches. Even if you have a big data problem and are hitting those barriers that indicate you need to do something differently, the best way for users to talk to vendors about their requirements is to cut right through the fluff and not talk about big data at all. Instead, you should talk about the business problem and the use cases that will ultimately narrow the spectrum to a specific set of workload characteristics. The right storage approach will quickly become evident.
This was first published in May 2012