AWS analytics tools help make sense of big data
A comprehensive collection of articles, videos and more, hand-picked by our editors
Now that the cloud computing bandwagon is out of gas, vendors have jumped on the next one to roll down the pike:...
Big Data. And as with previous hype cycles, Big Data is now a source of confusion for users as vendors put forth their own unique and often conflicting definitions of the term.
The most common source of confusion results from the conflation of Big Data storage with Big Data analytics. The term “Big Data” originated from within the open source community, where there was an effort to develop analytics processes that were faster and more scalable than traditional data warehousing, and could extract value from the vast amounts of unstructured data produced daily by web users.
Big Data storage is related in that it also aims to address the vast amounts of unstructured data fueling data growth at the enterprise level. But the technologies underpinning Big Data storage, such as scale-out NAS and object-based storage, have existed for a number of years and are relatively well understood.
At a very simplistic level, Big Data storage is nothing more than storage that handles a lot of data for applications that generate huge volumes of unstructured data. This includes high-definition video streaming, oil and gas exploration, genomics -- the usual suspects. A marketing executive at a large storage vendor that has yet to make a statement and product introduction told me his company was considering “Huge Data” as a moniker for its Big Data storage entry.
Big Data analytics is more emergent and multifaceted, but less understood by the IT generalist. Development of Big Data analytics processes has been driven historically by the web. However, the rapid growth of applications for Big Data analytics is taking place in all major vertical industry segments and now represents a growth opportunity to vendors that's worth all the hype.
Big Data analytics is an area of rapidly growing diversity. Therefore, trying to define it is probably not helpful. What is helpful, however, is identifying the characteristics that are common to the technologies now identified with Big Data analytics. These include:
- The perception that traditional data warehousing processes are too slow and limited in scalability
- The ability to converge data from multiple data sources, both structured and unstructured
- The realization that time to information is critical to extract value from data sources that include mobile devices, RFID, the web and a growing list of automated sensory technologies
In addition, there are at least four major developmental segments that underline the diversity to be found within Big Data analytics. These segments are MapReduce, scalable database, real-time stream processing and Big Data appliance.
Apache Hadoop is a good place to start with the MapReduce segment. Hadoop began conceptually with a paper that emanated from Google in 2004 and described a process for parallelizing the processing of web-based data it called MapReduce. Shortly thereafter, Apache Hadoop was born as an open source implementation of the MapReduce process. The community surrounding it is growing dramatically and producing add-ons that expands Apache Hadoop's usability within corporate data centers.
Apache Hadoop users typically build their own parallelized computing clusters from commodity servers, each with dedicated storage in the form of a small disk array or, more recently, solid-state drive (SSD) for performance. These are commonly referred to as “shared-nothing” architectures. Storage-area network (SAN) and network-attached storage (NAS), while scalable and resilient, are typically seen as lacking the kind of I/O performance these clusters need to rise above the capabilities of the standard data warehouse. Therefore, Hadoop storage is direct-attached storage (DAS). However, the use of SAN and NAS as “secondary” storage is emerging.
A potential Hadoop user is confronted with a growing list of sourcing choices that range from pure open source to highly commercialized versions. Apache Hadoop and related tools are available for free at the Apache Hadoop site. Cloudera Inc. offers a commercial version that includes some Cloudera add-ons and support. Other open source variants, such as the Facebook distribution, are also available from Cloudera. Commercial versions include MapR, which EMC Corp. now incorporates into a Hadoop appliance.
While Hadoop has grabbed most of the headlines because of its ability to process unstructured data in a data warehouse-like environment, there’s much more going on in the Big Data analytics space.
Structured data is also getting lots of attention. A vibrant and rapidly growing community surrounds NoSQL, an open source, non-relational, distributed and horizontally scalable collection of database structures that address the need for a web-scale database designed for high-traffic websites and streaming media. Document-oriented implementations available include MongoDB (as in “humongous” DB) and Terrastore.
Another analytics-oriented database emanating from the open source community is SciDB which is being developed for use cases that include environmental observation and monitoring, radio astronomy and seismology, among others.
Traditional data warehouse vendors aren't standing idly by. Oracle Corp. is building its “next-generation” big data platforms that will leverage its analytical platform and in-memory computing for real-time information delivery. Teradata Corp. recently acquired Aster Data Systems Inc. to add Aster Data’s SQL-MapReduce implementation to its product portfolio.
Real-time stream processing
The ability to do real-time analytics on multiple data streams using StreamSQL has been available since 2003. Up until now, StreamSQL has only been able to penetrate some relatively small niche markets in the financial services, surveillance and telecommunications network monitoring areas. However, with the burgeoning interest in all things Big Data, StreamSQL is bound to get more attention and find more market opportunities.
StreamSQL is an outgrowth of an area of computational research called Complex Event Processing (CEP), a technology for low-latency processing of real-world event data. Both IBM, with InfoSphere Streams, and StreamBase Systems Inc. have products in this space.
The Big Data appliance
As the interest in Big Data analytics expands into enterprise data centers, the vendor community sees an opportunity to put together Big Data “appliances.” These appliances integrate server, networking and storage gear into a single enclosure and run analytics software that accelerates information delivery to users. These appliances are targeted at enterprise buyers who will value the ease of implementation and use characteristics inherent in Big Data appliances. Vendors in this space include EMC with appliances built around the Greenplum database engine, IBM/Netezza, MapR’s recently announced commercialized version of Hadoop, Oracle and Teradata with comparable, pre-integrated systems.
Big Data storage for Big Data analytics
The practitioners of Big Data analytics processes are generally hostile to shared storage. They prefer DAS in its various forms, from SSD to high-capacity SATA disk buried inside parallel processing nodes. Shared storage architectures, such as SAN and NAS, are typically perceived as relatively slow, complex and, above all, expensive. These qualities aren't consistent with Big Data analytics systems, which thrive on system performance, commodity infrastructure and low cost.
Real-time or near-real-time information delivery is one of the defining characteristics of Big Data analytics; therefore, latency is avoided whenever and wherever possible. Data in memory is good; data on spinning disk at the other end of a Fibre Channel SAN connection is not. But perhaps worse than anything else, the cost of a SAN at the scale needed for analytics applications is thought to be prohibitive.
There's a case to be made for shared storage in Big Data analytics. Yet storage vendors and the storage community in general, have yet to make that case to practitioners of Big Data analytics. An example can be seen in the integration of the ParAccel’s Analytic Database (PADB) with NetApp SAN storage.
Developers of data storage technology are moving away from expressing storage as a physical device and toward the implementation of storage as a more virtual and abstract entity. As a result, the shared storage environment can and should be seen by Big Data practitioners as one in which they can find potentially valuable data services, such as:
- Data protection and system availability: Storage-based copy functions that don’t require database quiescence can create restartable copies of data to recover from system failures and data corruption occurrences.
- Reduced time to deployment for new applications and automated processes: Business agility is enhanced when new applications can be brought online quickly by building them around reusable data copies.
- Change management: Shared storage can potentially lessen the impact of required changes and upgrades to the online production environment by helping to preserve an “always-on” capability.
- Lifecycle management: The evolution of systems becomes more manageable and obsolete applications become easier to discard when shared storage can serve as the database of record.
- Cost savings: Using shared storage as an adjunct to DAS in a shared-nothing architecture reduces the cost and complexity of processor nodes.
Each of the above mentioned benefits can be mapped to shared-nothing analytics architectures. One can expect to see more storage vendors doing this over time. For example, while it hasn’t been announced, EMC could integrate Isilon or Atmos storage with its MapR-based appliance.
Big Data is a Big Deal
Traditional data warehousing is a large but relatively slow producer of information to business analytics users. It draws from limited data resources and depends on reiterative extract, transform and load (ETL) processes. Customers are now looking for quick access to information that is based on culling nuggets from multiple data sources concurrently. Big Data analytics can be defined, to some extent, in relationship to the need to parse large data sets from multiple sources, and to produce information in real-time or near-real-time.
Big Data analytics represents a big opportunity. IT organizations are exploring the analytics technologies outlined above to parse web-based data sources and extract value from the social networking boom. However, an even larger opportunity -- the Internet of Things -- is emerging as a data source. Cisco Systems Inc. estimates there are approximately 35 billion electronic devices that can connect to the Internet. Any electronic device can be connected (wired or wirelessly) to the Internet, and even automakers are building Internet connectivity into vehicles. “Connected” cars will become commonplace by 2012 and generate millions of transient data streams.
Read the entire Special Report on Big Data
Understanding Big Data analytics
Dave Raffo's take on Big Data
Harnessing the power of multiple data sources such as the Internet of Things will be about technologies that go well beyond traditional data warehousing. It will require processes that imitate the way the human brain functions. Our brains take in massive streams of sensory data and make the necessary correlations that allow us to know where we are, what we’re doing, and ultimately what we're thinking -- all in real-time. That’s the same kind of data processing Big Data analytics is after.
About the author
John Webster is a senior partner at Evaluator Group Inc., where he contributes to the firm's ongoing research into data storage technologies, including hardware, software and services management.