EMC Corp. is tackling Hadoop storage by integrating the open source “big data” analytics framework with EMC Isilon...
Today, EMC said its Isilon OneFS 6.5 operating system natively supports the Hadoop Distributed File System (HDFS) protocol. The vendor also released EMC Greenplum HD on Isilon. EMC launched Greenplum HD last May, putting Hadoop on its Greenplum data analytics appliance. Now the Greenplum appliance works with Isilon back-end storage.
HDFS support is free for existing EMC Isilon customers, and the EMC Greenplum HD on Isilon is available starting today.
EMC claims it's the first vendor to natively integrate Hadoop with its storage systems. Last November, rival NetApp Inc. launched NetApp Open Solution for Hadoop, a packaged storage cluster that uses partner Cloudera’s Hadoop management and support.
According to Nick Kirsch, Isilon’s director of product management, the Isilon product separates the Hadoop data analytics engine from the HDFS storage back end. “We’re not using the Hadoop file system at all,” Kirsch said. “We’re just using the [HDFS] protocol the analytics engine uses to talk to the storage. So it thinks that we’re a regular Hadoop storage system.”
As a result, Kirsch said, customers get the Hadoop analytics engine and native HDFS integration with an enterprise storage platform, as well as advanced data protection and efficiency services such as backups, snapshots, replication and deduplication.
The Hadoop data analytics platform is a Java-based open source framework for distributed processing of large data sets across compute and storage clusters. It’s an implementation of the MapReduce algorithm designed to scale from a single server to thousands of devices. The Apache Hadoop open-source kernel was created by Doug Cutting, who named it after his son’s toy elephant. EMC Greenplum HD includes implementation services, training, certification and customer support.
Hadoop is gaining popularity in universities and with research organizations, but Isilon's Kirsch said several of its characteristics have kept the open source project from gaining widespread enterprise adoption. It has a single point of failure because each HDFS cluster is fronted by a NameNode that manages the file system metadata and data nodes. If the NameNode goes down, that entire Hadoop cluster is offline. Hadoop has a dedicated storage infrastructure and a one-to-one compute and storage configuration. It was engineered for low-cost commodity servers that include all the necessary compute and storage, but not for larger storage arrays. In addition, Hadoop lacks native data protection and support for storage protocols such as NFS and CIFS.
Building HDFS into the Isilon file system solves these problems, Kirsch claims. “So now you can input data using NFS, analyze it using Hadoop, and look at that data using a Windows workstation. [Enterprises] won’t have to retrain all their users and yet they can take advantage of the Hadoop platform,” he said.
Benjamin Woo, IDC’s program vice president for worldwide storage systems, said Hadoop distribution providers can address its shortcomings by altering the source code, but this is the first hardware-based attempt to make it enterprise-ready.
“This is the first time that this sort of 'enterprisation' has been instantiated from a hardware perspective,” Woo said. “And that makes it a big deal.”
Woo pointed to Isilon’s separation of the one-to-one relationship between the compute and the storage, and removing Hadoop’s single points of failure as other important features.
For new Isilon customers, EMC recommends a five-to-10-node configuration with the Isilon X-Series. That configuration could handle between 100 TB and 300 TB, depending on the capacity chosen per node.