buchachon - Fotolia
The short answer is that you must have Hadoop Distributed File System to perform Hadoop analytics. However, typically when asking this question, IT professionals are really asking if the storage resource has to be direct-attached, which is the traditional Hadoop design. The answer to that question is no, and there are some compelling reasons not to follow that conventional design.
What are Hadoop analytics?
Hadoop is an environment used for business analytics processing. It allows massive amounts of compute resources to process very large unstructured data sets. This data can come from a variety of sources, but one of the most common is data created by sensors as part of the Internet of Things. In order for its analytics processing to be of value, Hadoop has to process these data sets rapidly and it accomplishes this with the Hadoop Distributed File System (HDFS). HDFS essentially moves the compute to the data instead of transferring the data to the compute.
Most Hadoop environments consist of a cluster of commodity servers, all with local storage. Data is loaded onto these nodes and the processing of that data set is done there. This is called the MapReduce function. Once each node processes the data based on the request, those results are sent from each node and then consolidated on a master node. The master node also stores all the metadata associated with cluster management.
Hadoop storage alternatives
Alternatives to the traditional Hadoop storage architecture leverage a shared storage environment that the compute nodes connect to. Vendors that provide this solution have either made their own HDFS-compatible plug-in or leveraged Hadoop's provision of an Amazon Simple Storage Service (S3) interface.
S3 is a native file system for reading and writing files on Amazon Cloud storage. Many object storage systems support this interface, and as a result, can support a Hadoop infrastructure running on a local private cloud instead of in the Amazon Cloud. The advantage of this file system is that Hadoop can access files that were written with other tools or Internet-connected sensors. Conversely, other applications can access files written using Hadoop.
There are several advantages to using a shared storage infrastructure to store Hadoop data, including better, more efficient protection of data, multi-application access to the store and better protection of the Hadoop master node.
Complete guide to Hadoop storage and analytics
Easing issues with Hadoop with private cloud storage
Dig Deeper on Big data storage
Related Q&A from George Crump
Shadow IT means enterprises are at increasing risk of cloud data loss, but providing employees with comparable file sharing apps can help. Continue Reading
Cloud storage doesn't just have to be for backup. According to George Crump, cloud services can make deploying a new application or disaster recovery... Continue Reading
If your IT department has the skills set, OpenStack object or block storage might be a good idea, analyst George Crump said. Continue Reading