buchachon - Fotolia

Can I run Hadoop analytics without using HDFS?

According to analyst George Crump, you might want to think about going with a non-traditional Hadoop architecture.

The short answer is that you must have Hadoop Distributed File System to perform Hadoop analytics. However, typically when asking this question, IT professionals are really asking if the storage resource has to be direct-attached, which is the traditional Hadoop design. The answer to that question is no, and there are some compelling reasons not to follow that conventional design.

What are Hadoop analytics?

Hadoop is an environment used for business analytics processing. It allows massive amounts of compute resources to process very large unstructured data sets. This data can come from a variety of sources, but one of the most common is data created by sensors as part of the Internet of Things. In order for its analytics processing to be of value, Hadoop has to process these data sets rapidly and it accomplishes this with the Hadoop Distributed File System (HDFS). HDFS essentially moves the compute to the data instead of transferring the data to the compute.

Most Hadoop environments consist of a cluster of commodity servers, all with local storage. Data is loaded onto these nodes and the processing of that data set is done there. This is called the MapReduce function. Once each node processes the data based on the request, those results are sent from each node and then consolidated on a master node. The master node also stores all the metadata associated with cluster management.

Hadoop storage alternatives

Alternatives to the traditional Hadoop storage architecture leverage a shared storage environment that the compute nodes connect to. Vendors that provide this solution have either made their own HDFS-compatible plug-in or leveraged Hadoop's provision of an Amazon Simple Storage Service (S3) interface.

S3 is a native file system for reading and writing files on Amazon Cloud storage. Many object storage systems support this interface, and as a result, can support a Hadoop infrastructure running on a local private cloud instead of in the Amazon Cloud. The advantage of this file system is that Hadoop can access files that were written with other tools or Internet-connected sensors. Conversely, other applications can access files written using Hadoop.

There are several advantages to using a shared storage infrastructure to store Hadoop data, including better, more efficient protection of data, multi-application access to the store and better protection of the Hadoop master node.

Next Steps

Complete guide to Hadoop storage and analytics

Easing issues with Hadoop with private cloud storage

Learn more about HDFS architecture in this tip

Dig Deeper on Big data storage