The Hadoop Distributed File System has a growing list of storage management features and functions that are consistent...
with persisting data, but many enterprises find the technology falls short in their production environment. The challenge for IT administrators encountering Hadoop is to determine whether the HDFS storage layer can serve as an acceptable data preservation foundation for the Hadoop data analysis platform and its growing list of applications. Here are typical issues storage admins come across and what they can do to mitigate them.
Inadequate data protection and disaster recovery capabilities
HDFS relies on the creation of replicated data copies -- usually three -- at ingest to recover from disk failures, data loss scenarios, loss of connectivity and related outages. While this process does allow a cluster to tolerate disk failure and replacement without an outage, it slows data ingest operations, negatively impacts time to information and still doesn't totally cover data loss scenarios that include data corruption.
In a recent study, researchers at North Carolina State University found that while Hadoop data analysis provides fault tolerance, "data corruptions still seriously affect the integrity, performance and availability of Hadoop systems." It also makes for very inefficient use of storage media, which is a critical concern when users need to persist data in the cluster for up to seven years due to regulatory compliance.
The ability to replicate data synchronously between Hadoop clusters does not currently exist in HDFS. Synchronous replication can be a critical requirement for supporting production-level disaster recovery operations. While asynchronous replication is supported, it is open to the creation of file inconsistencies across local and remote cluster replicas over time.
Inability to disaggregate storage resources from compute resources
HDFS binds compute and storage together to minimize the distance between processing and data for performance at scale. However, this results in some unintended consequences when HDFS is used as a long-term, persistent storage environment. To add storage capacity in the form of data nodes, an administrator has to add processing and networking resources, whether or not they are needed. Note that 1 TB of usable storage equates to 3 TB after the copies are made.
This tight binding of compute and storage limits an administrator's ability to apply automated storage tiering to take advantage of hybrid solid-state drive and rotating disk architectures. Administrators can leverage flash to make up for and even improve any loss in performance resulting from compute and storage disaggregation, and gain a more efficient way to save infrequently accessed data in the cluster for years.
Data in and out processes can take longer than the actual query process
One of the major advantages of Hadoop data analysis lies in Hadoop's ability to run queries against very large volumes of unstructured data. For that reason, Hadoop is often positioned as a big data lake. The idea is to copy data from active data stores and move it to the data lake. This process can be time consuming and network resource-intensive, depending on the amount of data. But perhaps more critically from the standpoint of Hadoop in production, it can lead to data inconsistencies, causing application users to question whether or not they are querying a single source of the truth.
One way to solve this problem is to run multiple applications producing the data on the same storage system, eliminating the need to create, track and move data copies over a network. Using an alternative, multipurpose storage environment that offers many use cases simultaneously has the further advantage of not requiring modification of the transactional data architecture on which an enterprise may be dependent. Creating a shared cluster-external storage environment also allows IT administrators to disaggregate compute with storage so that the two can be scaled separately and offers the further advantage of not requiring modification of the OLTP architecture on which an enterprise may be dependent.
HDFS is complicated because it's hard to learn
To solve some of the production-inhibiting shortcomings of HDFS, the community often creates add-on projects. Recovery Algorithms for Fast-Tracking, known as Raft, for example, can be used to recover from failures without re-computation. DistCp can be used for periodic synchronization of clusters across WAN distances, but it requires manual processes to reconcile differences when inconsistencies occur over time. Falcon addresses data lifecycle and management, while Ranger centralizes security administration.
However, all of them have to be learned and managed as separate entities. Each approach also has its own lifecycle that requires tracking, updating and administering. The HDFS environment precludes running many of the common file commands, like copy, which increases the learning curve and opens the system to human error. Enterprise Hadoop administrators will naturally gravitate to simplicity in this regard. A storage environment that has already built-in features to solve the problems these add-on projects try to address individually simplifies management and reduces opportunities for error.
Addressing Hadoop analysis issues
There are two other sources for answers to Hadoop data analysis shortcomings:
- Vendors of commercial Hadoop distributions, known as distros. The most popular distros are Cloudera, Hortonworks and MapR.
- The growing list of storage vendors seeking to integrate their data center-grade storage systems with Hadoop. These come with required data protection, integrity, security and governance features built in.
Hadoop and big data analytics are changing storage
SQL provides Hadoop with a big data boost
Product developments announced at Hadoop Summit