BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Hadoop as a service (HaaS), also known as Hadoop in the cloud, is a big data analytics framework that stores and analyzes data in the cloud using Hadoop. Users do not have to invest in or install additional infrastructure on premises when using the technology, as HaaS is provided and managed by a third-party vendor.
The open source Hadoop big data analytics framework allows large, unstructured data sets to be analyzed. Hadoop's storage mechanism, the Hadoop Distributed File System, distributes these workloads across multiple nodes so they can be processed in parallel. One of the drawbacks to the Hadoop open source programming language, however, is that it requires a special set of skills many organizations do not have in-house or cannot afford. Hadoop as a service providers integrate proprietary programs with the Hadoop framework to make it easier for organizations to use, and typically include management and support capabilities. Most HaaS offerings are cloud-based, and pricing is most often on a per-cluster, per-hour basis.
HaaS providers offer a variety of features and support, including:
- Hadoop framework deployment support
- Hadoop cluster management
- Alternative programming languages
- Data transfer between clusters
- Customizable and user-friendly dashboards and data manipulation
- Security features
Features to look for in a HaaS provider include:
- Data should be stored persistently in HDFS. This avoids issues associated with translating data stored in other formats into HDFS.
- Elasticity to accommodate a wide variety of workloads.
- Ability to recover from processing failures without restarting the entire process (known as non-stop operations).
- A self-configuring environment that allows automatic configuration based on workload.
Amazon was the first major provider of Hadoop as a service. Other providers currently in the market include:
- Amazon Elastic MapReduce
- Microsoft HDInsight
- IBM InfoSphere BigInsights
- Oracle Big Data Discovery Tool
- OpenStack Savanna
- Google Cloud Dataproc