This content is part of the Essential Guide: Complete guide to Hadoop technology and storage

Storage considerations for a Hadoop implementation

Storage Switzerland analyst Colm Keegan explains how to determine whether a SAN or NAS should be used as primary storage with Hadoop.

Can SAN and NAS storage systems be used as the primary storage layer with a Hadoop implementation?

The short answer is yes. The longer answer is that, depending on the size of your Hadoop implementation and the number of nodes, it may not make sense from a cost perspective to use SAN and network-attached storage (NAS) systems as the primary storage layer.

A good place to start is by understanding what Hadoop is. Simply put, it's an Apache open source project that allows users to perform highly intensive data analytics on structured and unstructured data across hundreds or thousands of nodes that are local or geographically dispersed. Hadoop was designed to ingest mammoth amounts of unstructured data and, through its global file system (Hadoop Distributed File System), distribute workloads across a vast network of independent compute nodes to rapidly map, sort and categorize data to facilitate big data analytical queries.

Hadoop was also designed to natively work with the internal disk resources in each of the independent compute nodes within its clustered framework for cost efficiency and to ensure that data is always available locally for the processing node. Jobs are managed and delegated across the cluster farms, whereby data is parsed, classified and stored on local disk. One block of data is written to the local disk, and two are replicated for redundancy. Data copies are readable so they can be used for processing tasks.

In determining the answer to the question, we have to consider that Hadoop is disk-agnostic; that means SAN and NAS resources can be used as the primary storage layer to service Hadoop workloads. But a natural follow-up question is this: Will SAN and NAS storage be the most cost-effective way to deploy Hadoop? If a Hadoop implementation will be confined to one or two locations, the benefits of managing a centralized storage resource should make sense, especially if an existing array has already been depreciated.

But if your firm's Hadoop implementation will consist of many hundreds or thousands of compute nodes across multiple data centers, the costs for deploying multiple SAN or NAS systems may prove to be too costly. The bottom line is that Hadoop's native support for internal disk resources gives data center managers a lot of options for leveraging either existing SAN or NAS assets, or low-cost, commodity disks (or a combination thereof) to support big data analytic workloads.

About the expert
Colm Keegan is an analyst at Texas-based firm Storage Switzerland LLC and has been in the IT industry for 22 years. His focus is on enterprise storage, backup and disaster recovery.

Dig Deeper on Big data storage