idspopd - Fotolia


Potential pitfalls with Hadoop data analytics

Before deciding on Hadoop to tackle your organizations' big data analysis needs, review these points of concern, including data sources and sustainability.

An increasing number of CEOs and other organizational leaders believe new opportunities lie hidden in the rising tide of data that surrounds them -- and that IT-driven analytics processes will expose them. Big data is the term that captures the advance of analytics-led opportunities. There are tools available to help with big data analysis, including Hadoop, but Hadoop data analytics present challenges.

Big data is not just about capturing and analyzing data. It is even more useful for tapping into a growing number of data sources, such as online social media, and merging that data with existing customer data. This allows retailers to tailor marketing programs and develop tighter connections with their customers.

Data sources fall into two general categories:

  • People who employ a growing number of ways to connect with information and other people. In the process, they become data sources, and their numbers grow daily.
  • Electronic devices that are becoming a larger data source. New sources of machine-generated data -- and the unseen data produced by them -- appear every day.

The challenge is to leverage this growing list of data sources for information users and stakeholders and build their IT architectures. However, there's an underlying assumption that enterprises will store all this data. Data captured is data stored. And because the default policy for retaining stored data within many enterprises is to "save everything forever," the need to store multiple petabytes of new data is entirely imaginable, even for medium-sized enterprises.

With the advance of data analytics, executives realize that data owned by an enterprise can have intrinsic value. In fact, it can be monetized. And there's an understanding among these executives that the value of owning data is increasing, while the value of owning IT infrastructure is decreasing -- as evidenced by unabated growth in public cloud usage. Storage may now be a more critical resource than ever before. When processing large sets of data, the storage needed has to be priced like a commodity and offer low latency at large scale -- something the data storage industry has had difficulty doing at the same time.

Are Hadoop data analytics and HDFS the answer?

Where can you store all this data in a way that is sustainable for the foreseeable future? Traditional enterprise storage platforms -- disk arrays and tape silos -- aren't up to the task. Data center arrays are too costly for the data volumes envisioned, and tape, while appropriate for large volumes at low cost, elongates time to information. One sought-after repository is often called a big data lake, and the most common instantiation of this is Hadoop data analytics. Hadoop offers large-scale, contiguous storage at commodity prices and is seen by many enterprises as an aggregation point for many different data sets and various structured and unstructured data types. 

Hadoop data analytics is used as a data aggregation point for analytic processes run on traditional data warehouses. But characterizing Hadoop as a big data lake is, at best, misleading. At its core, Hadoop is a platform that collects and analyzes structured and unstructured data from many disparate sources. It was designed to deliver high analytic performance coupled with large-scale storage at low cost, and while Hadoop data analysis is happening, it has become a software-defined way to run a wide range of analytics applications that span the processing spectrum from batch to interactive to real time.

The open source project known as Apache Hadoop originated from the Internet data centers of Google and Yahoo, which are known for very large-scale environments at low cost per unit of compute power. But there is a chasm between large Internet data centers and enterprise data centers defined by differences in management style, spending priorities, compliance and risk-avoidance profiles. The Apache Hadoop Distributed File System (HDFS) was not originally designed for long-term data persistence, a quality needed for enterprise use cases. The assumption was that data would be loaded into a distributed cluster for MapReduce batch processing jobs and then unloaded. The process would be repeated for successive jobs.

Now, enterprise users not only want to run successive MapReduce jobs, they want to build multiple applications for multiple types of analytics users on top of HDFS. These include OLTP -- HBase -- and real-time analytics such as Storm and Spark. To do this, data needs to be persisted, protected and secured for multiple user groups and for long periods of time, which could present a problem for users of Hadoop data analytics.

Next Steps

Manage massive amounts of data with Hadoop distribution

Hadoop 2 improvements increase data analytics support

Better understand the Hadoop framework

Dig Deeper on Big data storage