Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data technology is a big deal for storage shops, and a clear understanding of what it means -- and doesn't mean -- is required to successfully configure storage for big data apps.
I love the idea of changing the world through big data technology. Big data promises we'll all be IT superheroes just by storing more raw data than ever before and then using parallel processing techniques to yield great new insights that will catapult our company to the top. Good storage is costly and the rate that interesting new data is produced increases daily, but the Apache Hadoop product calls for leveraging scale-out commodity server nodes with cheap local disk.
Of course, there's more to it. Conceptually, big data products bring new ways to store and analyze the mountains of data that we used to discard. There's certainly information and insight to be mined, but the definitions are fuzzy, the hype is huge and the mining technologies themselves are still rapidly evolving.
Adding to the confusion, big data technology has been enthusiastically marketed by just about every storage vendor on the planet. But despite the marketing, I believe it's just a matter of time before every competitive IT shop has a real big-data solution to implement or manage, if only because of staggering data growth. For those just setting out on a big data journey, watch out for these common myths.
Myth No. 1: Just do it
A sure way to waste a lot of money is to aggregate tons of data on endlessly scalable clusters and hope that your star data scientist will someday discover the hidden keys to eternal profit.
To succeed with any IT project, big data included, you need to have a business value proposition in mind and an achievable plan laid out. Research is good and those "aha" moments can be exciting, but by the time big data gets to IT, there needs to be a more practical goal than just a desire to "see what might be in there."
Myth No. 2: Store everything
One of the problems caused by big data hype is that unrealistic expectations are often built on the premise of "keeping it all." It may seem plausible for a company to use a big data platform to keep all its data forever. In fact, Cloudera, the most widely adopted Hadoop distribution among enterprises, markets directly to that point. But is it true that accumulated data will become more valuable over time?
Storage experts, at this point, might want to make a few comments along the lines of, "Is all that data going to be actually accessible, usable, reliable, verifiable, available, secure, protected and, certainly not least of all, affordable in the long run?"
For most organizations, far less than "all" data will prove to deliver potential value. And most data declines in relevance as it ages. The faster you can get to an understanding of where your valuable data "subset" is, the more you can direct your resources and attention to what is likely to be most successful. Somewhat ironically, the less data you store, the more efficient and cost-effective you can be with big data.
Myth No. 3: Big is simple
The Apache Hadoop Distributed File System (HDFS) makes it easy to store lots of high-volume, high-velocity and highly variable data across a scale-out cluster. It does it in a way that makes it easy to process using highly parallel MapReduce-style algorithms that farm the heavy-lifting compute tasks out to each data chunk. HDFS also provides for in-cluster replication mainly to improve cluster availability.
But as suggested above, HDFS doesn't natively provide advanced enterprise storage features that might be needed to support good data protection or disaster recovery. Although evolving, Hadoop 1.0 currently doesn't support snapshots, mirroring or remote replication. And there are no easy ways to further optimize space (deduplication, compression) or tweak I/O performance (targeted caching, judicious use of flash or highly parallel streaming).
If you have lifecycle data management or governance requirements for data stored in a big data environment, you might need to consider an enhanced Hadoop distribution like the one from MapR that provides a full-featured storage service layer that transparently replaces HDFS.
Myth No. 4: Serve everyone
Hadoop represents a new way of processing certain types of data in certain parallel ways. And there are some exciting advances coming (e.g., YARN) that enable Hadoop to become a more universal processing platform. But HDFS doesn't provide a universal data storage service. It's designed and optimized for high read-throughput batch processing, and HDFS has no way to target or deliver I/O performance by dataset or workload.
Data has to be specifically loaded into HDFS. It can be difficult to get new data into and results out of it for immediate use or direct access by other applications using other protocols (e.g., NFS, CIFS). And Hadoop's combined compute/storage node makes it challenging to grow compute and storage on different vectors.
Breaking the HDFS "local" storage paradigm can make a lot of sense. For example, an enterprise scale-out array like EMC's Isilon provides "remote" HDFS storage to a Hadoop cluster, while actually hosting data in its native storage array file system with multiprotocol access and all its other enterprise array features.
Myth No. 5: Big and fast
A common misconception about Hadoop is that it's fast. Actually the core design is all about high-throughput "batch"-style processing, and avoiding the impact of common hardware failures that in many larger-scale computing designs (i.e., supercomputers) limit their ultimate efficiency. Hadoop just wasn't originally intended to be an interactive or real-time system.
However, due to demand, there are a lot of projects aimed at ramping up performance and expanding the application "footprint" of Hadoop to better support more interactive workloads. Some of these involve integrating traditional database, streaming data or in-memory processing products. There are also high-performance hardware offerings like DataDirect Networks' hScaler that take an "appliance" approach with compute nodes running in the same rack as their SFA series storage with a customized Hortonworks Hadoop distribution.
Big data will get bigger
Some people may think big data technology is past its peak stage and is crossing a "chasm" of disappointment, but I think we've just seen the beginning of its potential and the start of the evolution of the real value proposition of big data to enterprise IT. Those who have approached it realistically are gaining valuable results.
Big data, in the form of Hadoop, is but the start of a broader change in how data processing will need to be approached, and how future data centers will be designed. Data will continue to increase, processing technologies are in high flux, and the most competitive organizations will strive to wring as much value out of as much of that data as they can. Today, most enterprises haven't yet invested in game-changing big data technology projects intended to move the bottom line, although many have deployed Hadoop as an extract, load and transform (ELT) "ingest" platform for their more traditional data warehousing/business intelligence offerings.
Big data projects are as much storage projects as they are parallel compute. Tell us about your adventures with big data as an enterprise storage solution.
About the author:
Mike Matchett is a senior analyst and consultant at Taneja Group.