Hadoop was created to enable organizations to perform massive scale analytics processing across a very large, unstructured...
data set. This data can consist of millions, if not billions, of files that need to be read. To keep costs down and processing performance high, the data and the application reside on the same physical hardware. Doing so eliminates data movement, allows for local processing and enables the use of inexpensive, server-class storage. The Hadoop Distributed File System (HDFS) was developed to manage data across these dispersed nodes. But modern storage architectures now offer a compelling HDFS alternative: object storage. Here are three reasons why an object storage system could be a viable option for Hadoop analytics for your organization.
Reason 1: Object storage can offer better data protection
While HDFS leverages internal, server-class storage, it does make three copies of all data as a part of its standard data protection strategy. So while the use of internal, server-class hard disk drives is less expensive, it may not be as economical as was originally hoped when the capacity need is multiplied by three.
One alternative is to use an object-based storage system that provides Amazon Simple Storage Service protocol access, which Hadoop supports in addition to HDFS. These systems can be software-only, and therefore use commodity servers and server-class storage. But unlike default HDFS, many object storage systems offer erasure coding. This data protection is similar to RAID but more granular, operating at an object or sub-object level, spreading data and parity across nodes in a storage cluster. The result is that a similar or better level of data redundancy can be maintained for an overhead of approximately 25% to 30%, instead of the 200% overhead of the three-way replication that's standard for HDFS.
Reason 2: HDFS exposes the master node
HDFS has one master node and a series of slave nodes. The slave nodes process the data and send the result to the master. The master node also ensures that data replication policies are being maintained as well as general cluster management. If the master node fails, the rest of the cluster cannot be accessed. HDFS only provides limited protection for the master node, so organizations need to take special steps to implement their own high availability on the master node.
With an object storage system, as described above, the master node benefits from the same erasure coding data protection as the slave nodes. Also, all the metadata that the master node maintains to manage the Hadoop cluster can be stored on the centralized object storage system. This allows a slave node, or a stand-by node, to quickly become a master node if the master fails.
Reason 3: HDFS does not allow independent scaling
Hadoop, like any other architecture, will have varying degrees of demand for compute and storage capacity. The problem is that with HDFS, compute power and storage capacity need to scale in lockstep, meaning you can't add one resource without the other.
The most common way this manifests itself is when a Hadoop architecture runs out of storage space, since adding more capacity means adding another node full of hard disk, which also adds more compute power. The alternative is also true, as a Hadoop infrastructure often needs more processing power but has plenty of capacity. Most of the time when a new compute server is bought, it is filled with capacity. The result is that the Hadoop architecture is always wasting money on one resource and may not have enough of another.
Object storage allows for capacity and compute to be scaled independently. Compute nodes can now be 1U or 2U chassis with a solid-state drive for booting. The object storage system can be filled with high-capacity drives to keep the cost per gigabyte at a minimum. More importantly, as the environment evolves, each of these tiers can be scaled independently.
The advantages that HDFS brings to Hadoop are typically low cost and high performance, thanks to the local placement of data. An object storage system that leverages commodity storage can offer similar price savings, especially if erasure coding is used to improve data protection efficiency. High-speed 10 GbE networks are now very affordable, and these should eliminate any performance advantage that HDFS enjoyed by storing data with the compute. Object storage provides a more cost effective, more reliable, and at least equally performing infrastructure, and should be considered a viable HDFS alternative.
When to consider external data storage with Hadoop