Essential Guide

Managing Hadoop projects: What you need to know to succeed

A comprehensive collection of articles, videos and more, hand-picked by our editors
Manage Learn to apply best practices and optimize your operations.

Storage considerations for a Hadoop implementation

Storage Switzerland analyst Colm Keegan explains how to determine whether a SAN or NAS should be used as primary storage with Hadoop.

Can SAN and NAS storage systems be used as the primary storage layer with a Hadoop implementation?

The short answer is yes. The longer answer is that, depending on the size of your Hadoop implementation and the number of nodes, it may not make sense from a cost perspective to use SAN and network-attached storage (NAS) systems as the primary storage layer.

A good place to start is by understanding what Hadoop is. Simply put, it's an Apache open source project that allows users to perform highly intensive data analytics on structured and unstructured data across hundreds or thousands of nodes that are local or geographically dispersed. Hadoop was designed to ingest mammoth amounts of unstructured data and, through its global file system (Hadoop Distributed File System), distribute workloads across a vast network of independent compute nodes to rapidly map, sort and categorize data to facilitate big data analytical queries.

Hadoop was also designed to natively work with the internal disk resources in each of the independent compute nodes within its clustered framework for cost efficiency and to ensure that data is always available locally for the processing node. Jobs are managed and delegated across the cluster farms, whereby data is parsed, classified and stored on local disk. One block of data is written to the local disk, and two are replicated for redundancy. Data copies are readable so they can be used for processing tasks.

In determining the answer to the question, we have to consider that Hadoop is disk-agnostic; that means SAN and NAS resources can be used as the primary storage layer to service Hadoop workloads. But a natural follow-up question is this: Will SAN and NAS storage be the most cost-effective way to deploy Hadoop? If a Hadoop implementation will be confined to one or two locations, the benefits of managing a centralized storage resource should make sense, especially if an existing array has already been depreciated.

But if your firm's Hadoop implementation will consist of many hundreds or thousands of compute nodes across multiple data centers, the costs for deploying multiple SAN or NAS systems may prove to be too costly. The bottom line is that Hadoop's native support for internal disk resources gives data center managers a lot of options for leveraging either existing SAN or NAS assets, or low-cost, commodity disks (or a combination thereof) to support big data analytic workloads.

About the expert
Colm Keegan is an analyst at Texas-based firm Storage Switzerland LLC and has been in the IT industry for 22 years. His focus is on enterprise storage, backup and disaster recovery.

This was last published in May 2013



Find more PRO+ content and other member only offers, here.

Essential Guide

Managing Hadoop projects: What you need to know to succeed

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

Join the conversation


Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

How can I utilize the existing SAN storage? Is it going to degrade by cluster performance
It really depends on what type of load you are going to be placing on the SAN from your server cluster. Ideally, the server nodes should be attached to a local fibre channel fabric or GigE network to the SAN. In the article, I mentioned that in large Hadoop clusters, nodes can be highly distributed throughout an enterprise and in general, you want the storage resource to be located close to the node or cluster. If the cluster has to traverse a wide area network, for example, to perform storage IO, performance could be negatively impacted. But if your cluster is in the same data center as your SAN, performance should be ok assuming that the SAN is not over burdened servicing up other application workloads.
Connyank: I'm new to the whole Hadoop and IT world. I see that you mentioned that local SAN arrays are good to go for Hadoop. Does that also apply to completely virtualizing Hadoop with the same local SAN in mind?
Hi Fastidious (BTW this is Colm Keegan; connyank is my blog handle). Good question. By googling "Hadoop Virtualization", I came across multiple articles and white papers which indicate that some environments are starting to virtualize Hadoop. In fact, Chuck Hollis has a pretty good blog entry about this very topic:

In short, you can virtualize Hadoop and attach those VMs to a SAN, however, as you increase VM density on your hypervisors that are attaching to the SAN, there is a possibility that you could see some performance degradation. One of the issues with running in a heavily virtualized infrastructure is the generation of multiple random read and write requests to the storage controllers on the SAN. This can cause a lot of IO queuing and increase storage IO latency, resulting in degraded application performance. You can design your storage infrastructure to alleviate these conditions either by placing flash storage resources in the SAN or directly into the server where your most performance sensitive VMs reside. You can find a number of different articles on how to architect flash into your virtual environment on our web site: Thanks for the question!