Manage Learn to apply best practices and optimize your operations.

HDFS architecture optimizes performance

Hadoop Distributed File System architecture and its associated processes let the technology make better use of storage resources than many of today's enterprise storage arrays.

The Hadoop Distributed File System offers storage performance at large scale and low cost -- three things enterprise...

arrays have difficulty doing at the same time.

Apache HDFS is a parallelized, distributed, Java-based file system designed for use in Hadoop clusters that currently scale to 200 PB and can support single Hadoop clusters of 4,000 nodes. HDFS architecture supports simultaneous data access from multiple applications and Apache Yet Another Resource Negotiator. It is designed to be fault-tolerant, meaning it can withstand disk and node failures without interruption in cluster availability. Failed disks and cluster nodes can be replaced as needed.

The following HDFS architecture features are found in most deployments. Each one aids performance, minimizes workload and helps to keep costs down.

Core Hadoop Distributed File System features

Shared nothing, asymmetrical architecture. The major physical components of a Hadoop cluster are commodity servers with JBOD storage -- typically six to 12 disks in each server -- interconnected by an Ethernet network. InfiniBand is supported and used for performance acceleration.

The internal network is the only major physical component shared among cluster nodes; compute and storage are not shared.

Servers within the cluster are designated as control nodes that manage data placement and perform many other tasks or DataNodes. With earlier versions of Apache Hadoop, control node failure could result in a cluster outage. This problem has been addressed in recent Apache versions by vendors offering commercial distributions of Hadoop, as well as by enterprise storage vendors.

Data locality. The Hadoop Distributed File System minimizes the distance between compute and data by moving compute processes to the data stored on Hadoop DataNodes where the actual processing takes place. This minimizes the load on the cluster's internal network and results in a high aggregate bandwidth.

HDFS tries to optimize performance by considering a DataNode's physical location when placing data, allocating storage resources and scheduling automated tasks.

Multiple full copies of data. As Hadoop ingests raw data, often from multiple sources, it spreads data across many disks within a Hadoop cluster. Query processes are then assigned by the control node to DataNodes in the cluster that operate in parallel on the data spread across the disks directly attached to each processor in the cluster. HDFS architecture makes multiple copies of data on ingest -- normal default is three copies -- and they are spread across disks.

Eventual consistency. Because data is replicated across multiple DataNodes, a change made on one node takes time to propagate across the cluster, depending on network bandwidth and other performance factors. Eventually, the change will be replicated across all copies. At that point, HDFS will be in a consistent state. However, it is possible for HDFS architecture to encounter simultaneous queries and run them against two different replicated versions of the data, leading to inconsistent results.

Rack awareness and backup/archiving. HDFS tries to optimize performance by considering a DataNode's physical location when placing data, allocating storage resources and scheduling automated tasks. Some nodes can be configured with more storage for capacity while others can be configured to be more compute-heavy. However, the latter is not a standard practice. For backup and archiving, a storage-heavy Hadoop cluster is usually recommended.

Erasure coding. By default, the Hadoop Distributed File System creates three copies of data by replicating each data block three times upon ingestion. While this practice is common across other distributed file systems as a way to keep clusters running despite failure, it makes for very inefficient use of storage resources. This is particularly true when many advanced users are considering the use of flash drives to populate Hadoop clusters.

The alternative, which will be delivered in an upcoming release by the Apache Software Foundation, is to use erasure coding in place of replication. According to Apache, the default erasure coding policy will save 50% of the storage space consumed by storing three replicas and will also tolerate more storage failure per cluster.

Individual directories can be configured with an erasure coding policy or the normal three-replica policy. However, there is a performance penalty in using an erasure-coding policy because data locality is decreased for erasure code-configured directories. I/O to these directories will be elongated. As a result, Hadoop developers now see erasure codes as a form of storage tiering, whereby less frequently accessed and cold data is allocated to erasure code directories.

Next Steps

Does HDFS for storage work for you?

Hortonworks discusses product developments at Hadoop Summit

Hadoop challenges with big data analysis

This was last published in May 2016

Dig Deeper on Data management tools



Find more PRO+ content and other member only offers, here.

Join the conversation


Send me notifications when other members comment.

By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States. Privacy

Please create a username to comment.

Which Hadoop file system features are most important to you?
The unstated problem with HDFS, is ... the answer to every question is "add more nodes". Even with commodity hardware this is expensive (cost of HW, cost of Power, cost of SW license per node). a poor solution to the performance problem. I should add ... if you are not concerned with performance; its very likely your hadoop system is not many-users or high-value or time-sensitive. and all these will likely change with time; my suggestion is to prepare in advance.
the better answer is ... make nodes go faster with SSDs. the problem with that approach is, HDFS+Java code stack is too slow to take advantage of SSDs.
so look for updates to HDFS or alternatives like MapR which can take advantage of SSDs.
don't buy more nodes, instead use fewer, but better, nodes using SSDs.