michelangelus - Fotolia


Commercial Hadoop distributors bring HDFS improvements

These three commercial distributions of Hadoop are alternative options for big data storage that can bypass data protection and performance problems common with HDFS.

Hadoop, and its associated Hadoop Distributed File System, are largely used for storage and analytics in big data environments. However, HDFS comes with some limitations: weak data protection capabilities, resource-intensive analytics and a large learning curve.

There are three major commercial Hadoop distributors -- Hortonworks, Cloudera and MapR -- that can help enterprises avoid HDFS pitfalls. Of the three, Hortonworks maintains the closest implementation to Apache HDFS. Cloudera offers enhancements in the form of projects that are added to the Apache Hadoop projects catalog. MapR discovered early on that HDFS came with excess baggage that would create issues in enterprise data center implementations. As a result, MapR opted out of HDFS in favor of its own symmetrical file system.

Hortonworks: The 100% open source platform

Hortonworks executive management wants customers to understand that its Hortonworks Data Platform (HDP) is 100% open source. The company's business model is based on a fully supported, enterprise-ready Hadoop platform consumable by enterprise customers who offer the most significant growth potential. It claims a customer base of 800 with $122 million in annual revenue as of early 2016.

There are three major commercial Hadoop distributions -- Hortonworks, Cloudera and MapR -- that can help enterprises avoid HDFS pitfalls.

Hortonworks tends to innovate at the upper layers of the Hadoop stack. Within the last year, the company introduced Hortonworks Data Flow (HDF) -- based on Apache NiFi, Kafka and Storm -- for data routing, transformation and system mediation logic. HDF and HDP are offered separately or integrated. The integration can be particularly useful when Hadoop is used as a central data aggregation and processing foundation for Internet of Things applications.

However, it is at the storage layer where one sees the 100% adherence to the Apache Foundation's projects, including HDFS, Falcon for data lifecycle management, and Atlas for data governance and compliance.

Cloudera provides open Hadoop distributors with proprietary add-ons

Cloudera claims 850 enterprise subscription software customers for its commercial Hadoop distribution called Cloudera Data Hub (CDH). Like Hortonworks, Cloudera also positions itself as an enterprise-ready distribution of Hadoop with Apache open source code at the core. However, Cloudera is more willing to initiate and develop complementary but proprietary add-on projects that are also offered to the open source community for co-development.

Two storage-related examples of this are:

  • Kudu is positioned as a new storage engine for the Hadoop ecosystem. HDFS was originally architected to support MapReduce processes at large scale. As such, it performed best under large-block, sequential-access conditions. HBase was created to add online transaction processing (OLTP) for SQL databases and excelled in small-block, random-access environments. Kudu essentially combines the scale of HDFS with the OLTP aspect of HBase.
  • Cloudera Navigator is a data governance product for Hadoop, offering data discovery, continuous optimization, audit, lineage, metadata management and policy enforcement. It can be used to address information lifecycle management and regulatory compliance requirements.

Cloudera is open to Hadoop use cases where HDFS storage becomes disaggregated from the compute layer. In a future release, CDH will leverage Intel's 3D XPoint technology as persistent, high-performance storage. Node-based persistent memory will provide local affinity for hot data while using central HDFS storage for less frequently accessed data. Erasure coding will be supported by Cloudera in the next major release of CDH.

Cloudera also has a close relationship with EMC. Isilon is supported as an external storage platform for Hadoop with additional functionality planned for future releases. Cloudera is working with EMC to port Hadoop to EMC's DSSD flash products.

MapR: The HDFS-free alternative

MapR has taken a progressive approach to the continued development of its commercial Hadoop distribution, the Converged Data Platform. This means it has addressed barriers to enterprise adoption of Hadoop within the enterprise when and where it has seen them. This process started with implementing a foundational Hadoop file system that bypasses critical issues previously identified in Apache HDFS, such as NameNode vulnerability. It extends to a robust implementation of snapshots, data governance features and data replication for disaster recovery (DR).

From a storage perspective, this progressive behavior can be seen in MapR's recently delivered Zeta Architecture. In Zeta, all MapR applications -- including standard MapReduce, HBase and Spark -- read and write to a common scalable, distributed file system. This is so a MapReduce process that needs HBase data doesn't have to import that data -- it's already in the Zeta storage layer. For real-time analytics, the Zeta Architecture supports the use of databases, including HBase and MapR-DB. Other supported databases are HP Vertica, MySQL and Sybase IQ. Supported storage protocol standards include NFS and S3.

Unlike Cloudera and, to some extent, the Hortonworks Hadoop distributors, MapR prefers to position its proprietary storage environments as not needing assistance from external storage platforms.

Hadoop storage-related tools and ancillary applications

Over the past few years, a number of startups have formed around the need to address Hadoop Distributed File System issues. Dataguise for data governance and security, and WANdisco for data replication and DR are two examples.

Dataguise offers a data security and governance platform that detects, audits, protects and monitors sensitive data assets in real time wherever they live and move across all repositories. All three Hadoop distributors noted earlier are supported.

WANdisco's Fusion Platform delivers core functionality supporting continuous availability and performance with guaranteed data consistency across clusters deployed on any combination of Hadoop distributors, Hadoop-compatible storage systems, or cloud environments any distance apart via data replication across clusters and at different geographical locations.

Next Steps

Big data management with commercial Hadoop distributions

Complete guide to managing Hadoop implementations

HDFS is a better big data storage option than traditional arrays

Dig Deeper on Big data storage