Apache Hadoop users typically build their own parallel computing clusters from commodity servers, each with internal storage, typically in the form of a JBOD arrangement containing six to 12 disks. Hadoop clusters are commonly referred to as shared-nothing architectures because all processing is done in parallel by servers in the cluster that are self-contained processing units. They communicate with one another over a common network, but otherwise do not share any other computing, memory or storage resources within the cluster. External enterprise SAN and NAS systems, while scalable and resilient, are typically too expensive at scale, lack the locality of reference needed for high performance and break the shared-nothing rule. Therefore, Hadoop storage is commonly direct-attached storage embedded within each cluster node.
While still very controversial in Hadoop circles, an increasing number of enterprise Hadoop users are either substituting cluster-internal Hadoop Distributed File System file storage with a cluster-external, data center-grade storage system, or adding such a system as an adjunct for long-term, online archival storage -- also known as an active archive. This feature will focus on external storage systems used as HDFS alternatives. All reviewed systems include data center-grade products available from well-known commercial vendors.
However, before we highlight these HDFS alternatives and their specific capabilities in the context of Hadoop applications, it's important to review some of the reasons for considering Hadoop alternatives:
Analyst Ben Woo discusses the importance of Hadoop in processing big data.
- Inefficient and inadequate data protection and disaster recovery capabilities. HDFS relies on the creation of replicated data copies -- usually three -- at ingest to recover from disk failures, data loss scenarios, loss of connectivity and related outages. While this process does allow a cluster to tolerate disk failure and replacement without an outage, it slows data ingest operations, negatively impacts time to information and still doesn't totally cover data loss scenarios that include data corruption. Shared storage can be used to apply storage system-resident management functions such as data protection, replication for DR and storage tiering to maintain performance while preserving data for longer periods.
- Inability to disaggregate storage resources from compute resources. Compute and storage are bound together to minimize the distance between processing and data for performance at scale. However, to add data storage capacity, processing and networking resources have to be added, whether they are needed or not. Implementing shared storage allows administrators to scale compute and storage resources independently.
- Data in and out processes can create production issues. Copying data from active data stores can be time-consuming and strain network resources. Perhaps more critically from the standpoint of Hadoop in production, it can lead to data inconsistencies, causing application users to question whether they are querying a single source of the truth. A high-capacity, shared-storage platform -- scale-out NAS, for example -- that supports Hadoop can store applications within the system without having to migrate them from other sources.
In addition, storage system vendors offer the following reasons why analytics should be run on shared storage:
- Total cost of ownership gains of 50% or more resulting from reductions in infrastructure needed to support storage capacity, and increased storage efficiency resulting from eliminating the need to create three full copies of data on ingest.
- Increased administrative efficiency by eliminating the need to install, reinstall and update HDFS on each individual node.
- Storage and management functions are off-loaded from the Hadoop processing cluster to the storage system -- multiple benefits include data protection, tiered storage, snapshots and security.
In an attempt to rectify these matters, an increasing number of vendors are merging data center-grade storage systems with Hadoop. These blended systems include the necessary data protection, integrity, security and governance features baked right into the product. Several HDFS alternatives on the list include EMC Isilon, IBM Spectrum Scale, NetApp Open Solution for Hadoop and SanDisk InfiniFlash.
EMC's Isilon IQ series consists of clustered nodes based on industry-standard Intel servers, with direct-attached disks and custom software to provide scale-out NAS along with software-enabled features implemented in the base file system, OneFS. To maintain a global file system image across the nodes, two InfiniBand links are used for intracluster communication and synchronization and provide a consistent method for communicating control information. EMC claims that more than 600 large companies have deployed Isilon for Hadoop applications.
The key to the value provided by Isilon is OneFS software; it is both the operating environment, based on BSD Unix, and the file system, which includes the volume manager and RAID protection in a single layer that is optimized for NAS functionality. OneFS controls placement of data on disks by managing redundancy on a file-by-file basis.
Isilon supports multiple Hadoop alternatives, including Cloudera CDH, Hortonworks Data Platform and Pivotal HD, via an HDFS interface implemented at the Hadoop cluster node level. HDFS can connect to any node within the Isilon cluster, and every Isilon node can act as a name node or data node from the standpoint of HDFS. This can be done while Isilon supports other applications. Therefore, data created by these applications that could be also used by Hadoop does not have to be ingested from some other location -- it's already stored in the Isilon cluster. In addition, administrators can apply Isilon's embedded storage and data management features to Hadoop.
IBM Spectrum Scale
IBM Spectrum Scale is another scale-out storage system that can be integrated natively -- no cluster-level code required -- with Hadoop. Spectrum Scale implements a unified storage environment, which in this case means support for both file and object-based data storage under a single global namespace.
For data protection and security, Spectrum Scale offers snapshots at the file-system or file-set level and backup to an external storage target -- backup appliance or tape. Storage-based security features include data at rest encryption as an option, secure erase and Lightweight Directory Access Protocl/Active Directory for authentication. Synchronous and asynchronous data replication at LAN, metropolitan area network and WAN distances with transactional consistency is also available.
Spectrum Scale supports automated storage tiering using flash for performance and a multi-terabyte mechanical disk for inexpensive capacity with automated, policy-driven data movement between storage tiers. Tape is supported as an additional archival storage tier.
Policy-driven data compression can be implemented on a per-file basis for approximately a two-time improvement in storage efficiency and reduced processing load on Hadoop cluster nodes. And for mainframe users, Spectrum Scale can be integrated with IBM zSystem -- often a remote data island when it comes to Hadoop.
NetApp Open Solution for Hadoop
The first thing to note about the NetApp Open Solution for Hadoop is that, among the HDFS alternatives highlighted, it preserves the shared-nothing architectural model. It provides direct-attached storage (DAS) to Hadoop nodes in the form of a NetApp E-Series array to each data node within the Hadoop cluster. Each E-Series array used to support the Hadoop cluster can be configured as four volumes of DAS to support four data nodes. Each data node has its own nonshared set of disks and "sees" only its own. Each data node is allocated 14 disks within the E-Series array. Each array supports dual-array controllers with hardware-assisted computation of RAID parity. A NetApp FAS system running Data Ontap provides NFS-based storage for the Hadoop name node server, offering production data center-quality storage for Hadoop system metadata -- a critical component to the overall functioning and resiliency of the Hadoop cluster.
As with other external storage platforms for Hadoop, NetApp moves data protection processes, and the creation of data replicas needed for adverse event recovery purposes, off the Hadoop cluster and onto storage arrays. The HDFS convention of triple mirroring within the cluster consumes server and network bandwidth. Instead, NetApp allows Hadoop administrators to mirror HDFS data to a direct-attached E-Series array. Doing so replaces the triple mirror implemented in software that runs at the data node level with hardware RAID that runs at the array level.
Support for the nondisruptive, simultaneous rebuild of logical volumes within the E-Series array means disk failures can be handled without disrupting the cluster and without requiring administrator intervention. Use of the E-Series can also increase overall cluster performance -- even when JBODs are used within the data nodes -- by reducing the HDFS replica count and allowing the storage array to process that workload. In addition, the use of hardware RAID combined with caching at the array level will add an additional margin of performance.
SanDisk characterizes InfiniFlash as a scale-out, solid-state disk storage platform. It is designed to offer massive scale while maintaining consistent performance and quality of service as storage capacity is increased. It is based on a new SanDisk high-capacity flash form factor. The InfiniFlash OS offers unified block, file and object support, as well as snapshots, replication and thin provisioning. Currently, InfiniFlash hardware features include:
- 512 TB capacity in three rack units (3U) with up to 64 8 TB InfiniFlash cards;
- Throughput of 7 GBps and 780,000 IOPS; and
- All hot-swappable components, low power consumption -- as compared to HDD systems of equal capacity -- and mean time between failure of 1.5 million hours.
InfiniFlash can be integrated with Apache Hadoop starting with InfiniFlash version 2.0 and higher to provide a tiered storage environment for HDFS whereby InfiniFlash retains the high-activity data while a scale-out HDD tier retains lower-activity data. SanDisk claims that deploying InfiniFlash in this way offers flash acceleration for both Hadoop MapReduce and streaming analytics applications with the least impact to the traditional HDFS storage model. It can also be deployed as a flash-optimized, Ceph-based object store for HDFS. In this case, Ceph provides the data-tiering function. A third deployment model replaces Hadoop node-resident HDDs with InfiniFlash as the primary storage layer, disaggregated from the compute layer -- HDFS aggregates compute and storage, distributed across nodes.
Here, all three data replicas are maintained on InfiniFlash with erasure coding applied for low-overhead redundancy and data integrity.
External storage arrays can be used as an additional and potentially more effective way to establish a production data center-quality storage environment for Hadoop-based applications.
There are many who forthrightly believe that the best data storage architecture for Hadoop is small, distributed JBOD arrays embedded within each cluster node. The most common argument for maintaining this architectural model is the preservation of locality of resources -- the closest possible pairing of compute and storage to optimize performance at scale. However, these HDFS alternatives -- while offering performance -- also address perceived inefficiencies, management complexity and exposure to data loss.
Hadoop add-ons continue to spring up
Complete conference coverage from Strata + Hadoop World 2016