DrHitch - Fotolia
A software-only storage startup nurtured at the University of California at Berkeley has launched a distributed file system for big data analytics.
Alluxio Inc. today unveiled a free community edition and a subscription-based commercial version of its eponymous in-memory software. The startup's main focus is big data analytics jobs with Apache Spark.
Alluxio, based in San Mateo, Calif., describes its open source product as in-memory virtual distributed storage that enables data sharing across any application and any storage system at memory speed. The goal is to turn underused memory into storage capacity.
Alluxio was known as Tachyon until changing its name in 2015. That is the same year that venture capital firm Andreessen Horowitz made a $7.5 million investment. Alluxio lead investor Peter Levine is a former CEO at XenSource (now part of Citrix) and a former executive vice president at the original Veritas Software.
Haoyuan LiCEO, Alluxio
The Alluxio software layer installs between computational frameworks and storage to virtualize underlying file and object data stores. CEO Haoyuan Li said about a dozen web-scale companies have deployed Alluxio in production with Apache-based Flink, Samza, Spark and Storm. The list includes Alibaba Group, Barclays, Baidu, Capital One, CERN, Esri, Google, Juniper Networks, JD.com, Qunar.com, Swisscom and Yahoo.
"What we do is unify data at memory speed," Li said. "You can mount different storage systems in Alluxio as a folder, which can be accessed through the [operations] layer. We basically virtualize the data for different storage systems and expose unified APIs in a global file-system namespace to the computational system."
Hadoop has received broad adoption as the de facto standard for big data jobs. The core components to Hadoop are the Hadoop Distributed File System (HDFS) and MapReduce orchestration and processing. But Apache Spark is gaining ground as an adjunct, if not an outright alternative, to a Hadoop MapReduce configuration.
"Apache Spark is becoming more and more important as part of a big data processing framework," said Arun Chandrasekaran, a research vice president at Gartner Inc. "Alluxio wants to layer a very simple file system on top of the existing file system you already have. This gives you a memory-centric architecture.
"The other thing they want to do is to decouple the computation from the storage on the back end. They provide an API on the front end that is HDFS- or MapReduce-compatible for your classic Hadoop applications. That means you don't have to make any software changes on the front end."
Alluxio creates a columnar data format in memory that overlays atop disk-centric batch processing, allowing reads and writes to occur in memory. Hot files are cached in memory and Alluxio's tiering engine siphons cold and warm data to back-end storage via standard file or object APIs. The Alluxio file system provides object storage interfaces for Amazon S3 and OpenStack Swift and file storage interfaces for HDFS and Red Hat GlusterFS scale-out NAS.
Internet provider Sparks faster reads
Baidu USA, the North American arm of Chinese-language internet provider Baidu, manages an Alluxio cluster that scales to 1,000 nodes and more than 2 PB, including 50 TB of memory storage and the balance in disk capacity. Business analysts and product managers mine the data to suggest product improvements.
Leo Wang, a software architect at Baidu USA, said read speeds are up to 50 times faster with Alluxio.
"Previously, it took hours to get the query results, which did not meet our business needs," Wang said. "Alluxio solves this problem by pooling all the hot data in memory for processing, thereby avoiding reads from remote [storage]. This is very important [since] our ad hoc query platform targets a response time within minutes."
The Alluxio Enterprise Edition embeds Kerberos authentication for security and data replication for high availability. Rather than replicating file system data across a cluster, Alluxio records all changes to file data and metadata and keeps the log available in memory. If a server gets interrupted in mid-calculation, Alluxio taps idle processing power to enable another server to resume the analysis from that point.
Data compression is not included in the Enterprise Edition, but Li said it is on the product roadmap for future software releases.
The free Alluxio Community Edition is a stripped-down package available for download from the Alluxio website. Alluxio Manager is included on both the community and enterprise versions to aid deployment, management and monitoring of Alluxio clusters.
Huawei storage integrates Alluxio
To get the best performance, Alluxio recommends installing its software on the same nodes that process big data computations.
"That way you always have the data closest to the compute. But you can install Alluxio on a subset of the nodes, if that makes more sense for what you are trying to do," said Neena Pemmaraju, Alluxio vice president of products.
Scale out is achieved by installing Alluxio on each new node added to the compute cluster. Li and Pemmaraju did not disclose pricing, but Alluxio Enterprise Edition licenses will be based on the number of nodes on which the storage software is deployed.
The startup lists a dozen enterprise deployments and claims industry partnerships with Huawei, Intel, Mellanox Technologies and Rackspace, among others. Huawei in September said it was integrating Alluxio in its FusionStorage distributed elastic block storage software.
Alluxio is the second startup this year to pitch a memory-based storage software product. Plexistor emerged in January with software that uses nonvolatile memory as persistent storage to support in-memory databases and traditional enterprise applications, claiming it can eliminate the need for clustered compute and storage.
Symbolic IO plans to shrink in-memory storage
Plexistor SDM brings memory, storage together
Apache Spark touted as MapReduce replacement