Whether it’s defining “big data,” understanding Hadoop or assessing the impact of large data stores, storage pros need a clear understanding of the big data trend.
It seems impossible to get away from the term “big data” nowadays. The challenge is that the industry lacks a standard definition for what big data is. Enterprise Strategy Group (ESG) defines big data as “data sets that exceed the boundaries and sizes of normal processing capabilities, forcing you to take a non-traditional approach.” We apply the term “big data” to any data set that breaks the boundaries and conventional capabilities of IT designed to support day-to-day operations.
These boundaries can be encountered on multiple fronts:
The transaction volume can be so high that traditional data storage systems hit bottlenecks and can’t complete operations in a timely manner. They simply don’t have enough processing horsepower to handle the volume of I/O requests. Sometimes they don’t have enough spindles in the environment to handle all the I/O requests. This often leads users to put less data on each disk drive and “short stroke” them. That means partially using them to increase the ratio of spindles per GB of data and to provide more disk drives to handle I/O. It also might lead users to deploy lots of storage systems side by side and not use them to their full capacity potential because of the performance bottlenecks. Or both. This is an expensive proposition because it leads to buying lots of disk drives that will be mostly empty.
The size of the data (individual records, files or objects) can make it so that traditional systems don’t have sufficient throughput to deliver data in a timely manner. They simply don’t have enough bandwidth to handle the transactions. We see organizations using short stroking to increase system bandwidth and add spindles in this case as well, which, again, leads to poor utilization and increased expense.
The overall volume of content is so high that it exceeds the capacity threshold of traditional storage systems. They simply don’t have enough capacity to deal with the volume of data. This leads to storage sprawl -- tens or hundreds of storage silos, with tens or hundreds of points of management, typically with poor utilization and consuming an excessive amount of floor space, power and cooling.
It gets very intimidating when these things pile on top of each other -- there’s nothing that says users won’t experience a huge number of I/O requests for a ton of data consisting of extremely large files.
Supporting storage architectures
We’re seeing an evolution in storage architectures to help deal with the increasing volume of data associated with big data. Each has slightly different, but overlapping, characteristics.
On the I/O-intensive, high-transaction volume end, ESG sees a broad adoption of architectures that can scale up by adding spindles. That’s the traditional approach and systems like EMC VMAX, Hitachi Data Systems VSP and IBM DS8000 do well here.
On the large data size front, bleeding-edge industries that have been dealing with big data for years were early adopters of scale-out storage systems designed with enough bandwidth to handle large file sizes. We’re talking about systems from DataDirect Networks, Hewlett-Packard Ibrix, Isilon (now EMC Isilon) and Panasas, to name a few. Traditionally, scale-up implied there were eventual limits; scale-out has far less stringent limits and much more flexibility to add capacity or processing power. As big data sizes become more of a mainstream problem, some of these systems are finding more mainstream adoption. These more mainstream environments can be a mix of I/O- and throughput-intensive performance demands, so both scale-up and scale-out are often needed to keep up.
Finally, on the content volume front, we’re seeing more adoption of scale-out, object-based storage archive systems to make it easier to scale to billions of data objects within a single, easily managed system. The advantage of these systems is that they enable robust metadata for easier content management and tracking, and are designed to make use of dense, low-cost disk drives (Dell DX 6000 series is a good example here).
What about Hadoop?
No column on big data would be complete without a discussion of Hadoop. The ability to accelerate an analytics cycle (cutting it from weeks to hours or minutes) without exorbitant costs is driving enterprises to look at Hadoop, an open source technology that’s often run on commodity servers with inexpensive direct-attached storage (DAS).
Hadoop is used to process very large amounts of data and consists of two parts: MapReduce and the Hadoop Distributed File System (HDFS). Put (very) simply, MapReduce handles the job of managing compute tasks, while HDFS automatically manages where data is stored on the compute cluster. When a compute job is initiated, Map-Reduce takes the job and splits it into subtasks that can be run in parallel. It basically queries HDFS to see where the data required to complete each subtask lives, and then sends the subtasks out to run on the compute node where the data is stored. In essence, it’s sending the compute tasks to the data. The results of each subtask are sent back to the MapReduce master, which collates and delivers the final results.
Now compare that with a traditional system, which would need a big expensive server with a lot of horsepower attached to a big expensive storage array to complete the task. It would read all the required data, run the analysis and write the results in a fairly serial manner, which at these volumes of data, takes a lot longer than the Hadoop-based MapReduce job would.
The differences can be summed up in a simple analogy. Let’s say 20 people are in a grocery store and they’re all processed through the same cash register line. If each person buys $200 worth of groceries and takes two minutes to have their purchases scanned and totaled, $4,000 is collected in 40 minutes by the star cashier hired to keep up. Here’s the Hadoop version of the scenario: Ten register lines are staffed by low-cost, part-time high school students who take 50% more time to finish each separate transaction (three minutes). It now takes six minutes to ring up the same 20 people but you still get $4,000 when they hand in their cash drawers. From a business standpoint, what’s the impact of reducing a job from 40 minutes to six minutes? How many more jobs can be run in that 34 minutes you just gained? How much more insight can you get and how much quicker can you react to market trends? This is equivalent to business-side colleagues not having to wait long for the results of analytical queries.
Hadoop isn’t perfect. Clustered file systems are complex, and while much of this complexity is hidden from the HDFS admin, it can take time to get a Hadoop cluster up and running efficiently. Additionally, within HDFS, the data map (called the NameNode) that keeps track of where all the data lives (metadata) is a single point of failure in the current release of Apache Hadoop -- something that’s on the top of the list to be addressed in the next major release. Data protection is up to the admin to control; a data replication setting determines how many times a data file is copied in the cluster. The default setting is 3, which can lead to a capacity overhead of 3x the required usable capacity. And that’s to protect in the local cluster; backup and remote disaster recovery (DR) need to be considered outside of the current versions of Hadoop. There’s not a large body of trained Hadoop professionals on the market; while firms like Cloudera, EMC and MapR are doing a good job on the education front, it’ll take time to build a trained workforce. This last point shouldn’t be taken lightly. Recent studies show that projects planning to leverage contractors/consultants should budget as much as $250,000 per developer per year.
Big data, bigger truth
This laundry list of shortcomings, combined with the potential commercial analytics market opportunity, is driving big storage companies like EMC, IBM and NetApp to look at the big data opportunity. Each company has introduced (or will, you can count on it) storage systems designed for Hadoop environments that help users cover the manageability, scalability and data protection angles that HDFS lacks. Most offer a replacement to the HDFS storage layer with open interfaces (such as NFS and CFS), while others provide their own version of a MapReduce framework that performs better than the open source distribution. Some offer features that fill in the open source HDFS gaps, like the ability to share data between other apps via standard NFS and CFS interfaces or, much better, data protection and DR capabilities.
NetApp is actually taking a radically different approach from most vendors. They’re embracing the open Hadoop standard and the use of data nodes with DAS. Instead of using their own file system with a wrapper for Hadoop, they’re turbo-charging the DAS with SAS-connected JBOD based on the low end of the Engenio platform. And for the NameNodes they’re using an NFS-attached FAS box to provide a quick recovery from a NameNode failure. It’s a “best of both worlds” hybrid approach to the problem.
Whether or not the market will pay a premium for the better availability and broader potential application leverage still remains to be seen, as we’re in the early days yet.
Big data is a reality, and not all big data was created equal: various types of big data need different storage approaches. Even if you have a big data problem and are hitting those barriers that indicate you need to do something differently, the best way for users to talk to vendors about their requirements is to cut right through the fluff and not talk about big data at all. Instead, you should talk about the business problem and the use cases that will ultimately narrow the spectrum to a specific set of workload characteristics. The right storage approach will quickly become evident.