Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
Synchronous analytics and asynchronous analytics are distinguished by the way they process data. But they both have big data storage appetites and specialized needs.
The term big data analytics has crept into the IT vernacular to represent our fixation on what might be called the "big data assumption" -- the belief that the answers to all our questions are buried in piles of data. Somehow, if we can compare and cross-reference enough data points, we'll gain insights that will help us beat the competition, catch all the crooks and save the world from the brink of disaster.
The problem is that all this analysis requires lots of data, and therein lies the challenge for IT: How do you capture, store, access and analyze enough data to garner those insights and justify the resources that have been committed to the task?
Big data analytics applications typically use information such as Web traffic, financial transactions and sensor data, instead of traditional forms of content. The value of the data is tied to comparing, associating or referencing it with other data sets. Analysis of big data usually deals with a very large quantity of small data objects with a low tolerance for storage latency.
There are two primary use cases for big data analytics, and they're distinguished by the way data is processed: synchronously, in real-time or near real-time; or asynchronously, where data is captured first, recorded and then analyzed after the fact using a batch process.
Why Hadoop matters
Hadoop is used extensively in big data applications where its flexibility supports the dynamic nature of the extract, transform, load (ETL) cycle in a big data environment. Hadoop's distributed architecture, which puts the processing engine close to the storage, is well suited for batch processing jobs like ETL where the output goes directly to storage. Hadoop's MapReduce function allows a large ingest job to be broken into smaller pieces and sent to multiple nodes (Map) and then combined (Reduce) into the final dataset that is loaded into the data warehouse.
One of the earliest examples of near real-time big data analytics is how supermarkets calculate your buying behavior and use it to print coupons with your register receipt. In reality, the buying behavior calculation was probably done ahead of time and just referenced when you checked out, but the concept is the same. Other examples include the constant profiling social media sites perform using your preferences and online activity, which is then sold to advertisers to create the pop-up experience you get from these same sites.
In retailing, some large stores are starting to use facial recognition software to identify shoppers in the parking lot so their buying profiles can be accessed to generate promotional materials that are emailed or texted to them as they walk around the store. In real-time use cases like these, speed is a critical factor, so the big data storage infrastructure must be designed to minimize latency.
Storage for synchronous analytics
Real-time analytics applications are typically run on databases like NoSQL, which are massively scalable and can be supported with commodity hardware. Hadoop, on the other hand, is better suited for batch processing, the kind of work supporting asynchronous big data analytics. Since storage is a common source of latency, solid-state storage devices are popular options for real-time analytics.
Flash storage can be implemented in several ways: as a tier on a traditional disk array, as a network-attached storage (NAS) system or in the application server itself. This server-side flash implementation has gained popularity because it provides the lowest latency (storage is closest to the CPU) and offers a way to get started with only a few hundred gigabytes of capacity. SAS/SATA solid-state drives (SSDs) are an option, but PCI Express (PCIe) card-based solid-state is becoming the standard for performance applications like real-time analytics because that implementation offers the lowest latency.
Currently, a number of companies offer PCIe flash storage, including Fusion-io, LSI, Micron Technology, SanDisk, sTec (now part of HGST, a division of Western Digital), Violin Memory and Virident (to be acquired by Western Digital). All the major server and storage vendors offer PCIe solutions as well, many through OEM agreements with these solid-state companies.
Although PCIe cards are now available with as much as 10 TB of flash capacity, a shared storage pool may still be needed. One solution is to use a technology like Virident's FlashMAX Connect software, which can pool flash capacity across PCIe cards and even among servers via InfiniBand. This can be very useful for extending the available flash capacity, especially in servers with limited PCIe slot availability or to support VMware's vSphere Storage vMotion. By pooling flash on multiple servers, these solutions can also provide failover and high-availability capabilities.
Another option is an all-flash array connected via InfiniBand, Fibre Channel or even PCIe. Capacities for these systems range from fewer than 10 TB to more than 100 TB for those with scalable, modular architectures. These high-end solutions offer performance up to 1 million IOPS and nominal latencies as low as a few hundred microseconds. Most of the major storage players have something in the all-flash category but, with the exception of IBM's Texas Memory acquisition, smaller companies have more products to offer and longer track records. Those companies include Kaminario, Nimbus Data Systems, Pure Storage, Tegile, Whiptail (to be acquired by Cisco Systems) and Violin Memory.
Asynchronous big data analytics
Big data analytics that involve asynchronous processing follows a capture-store-analyze workflow where data is recorded (by sensors, Web servers, point-of-sale terminals, mobile devices and so on) and then sent to a storage system before it's subjected to analysis. Since these types of analytics are done using a traditional relational database management system (RDBMS), the data must be converted or transformed into a structure the RDBMS can use, such as rows and columns, and must be consistent with other data sets being analyzed.
This process is called extract, transform, load or ETL. It pulls (extracts) data from the source systems, normalizes (transforms) the data sets and then sends the data to a warehouse (load) for storage until it's analyzed. In traditional database environments this ETL step was straightforward because the analytics were fairly well-known financial reports, sales and marketing, enterprise resource planning and so on. But with big data, ETL can become a complex process in which the transformation step is different for every data source and every data source itself is different.
When the analytics are run, data is pulled from the warehouse and fed into the RDBMS with the results used to generate reports or to support other business intelligence applications. In keeping with the big data assumption, the raw data set is typically kept, as well as the transformed data, since it may need to be re-transformed for a future job.
Storage for asynchronous big data analysis
The storage challenges for asynchronous big data use cases concern capacity, scalability, predictable performance (at scale) and especially the cost to provide these capabilities. While data warehousing can generate very large data sets, the latency of tape-based storage may just be too great. In addition, traditional "scale-up" disk storage architectures aren't usually cost-effective at these capacity points.
Scale-out storage. A scale-out storage architecture using modules or nodes that are clustered to act as a single storage pool, usually with a file-system interface, can provide an appealing solution for big data analytics. Some examples include Dell EqualLogic, EMC Isilon, Exablox (also object-based), Gridstore, HP StoreAll (formerly Ibrix) and IBM Scale Out Network Attached Storage (SONAS). Since each node contains processing power and disk storage, they can actually scale performance along with capacity.
Hadoop is also being used as a storage framework, enabling companies to construct their own highly scalable storage systems using low-cost hardware and providing maximum flexibility. Hadoop runs on a cluster of nodes, each with storage capacity and compute power, typically designed to process that data. Other nodes coordinate these processing jobs and manage the distributed storage pool, generally using the Hadoop Distributed File System (HDFS), although other storage systems can work with Hadoop clusters as well.
But Hadoop, specifically HDFS, requires three copies of data be created to support the high-availability environments it was designed for. That's fine for data sets in the terabyte range, but when capacity is in the petabytes, HDFS can make storage very expensive. Even scale-out storage systems can suffer from the same issues, as many use RAID to provide data protection at the volume level and replication at the system level. Object-based storage technologies can provide a solution for larger environments that may run into this data redundancy problem.
Object storage. Object-based storage architectures can greatly enhance the benefits of scale-out storage by replacing the hierarchical storage architecture that many use with flexible data objects and a simple index. This enables almost unlimited scaling and further improves performance. Object storage systems that include erasure coding don't need to use RAID or replication for data protection, resulting in dramatic increases in storage efficiency.
Rather than creating two or three additional copies (200% to 300% capacity overhead), plus the overhead of the RAID scheme in use, object storage systems with erasure coding can achieve even greater levels of data protection with just 50% or 60% overhead. In big data storage environments, the cost savings can be enormous. There are many object storage systems on the market, including Caringo, DataDirect Networks Web Object Scaler, NetApp StorageGRID, Quantum Lattus and the open source OpenStack Swift and Ceph.
Some object storage systems, like Cleversafe's, are even compatible with Hadoop. In those implementations, the Hadoop software components would run on the CPU in the object storage nodes and the object storage system would replace HDFS in the storage cluster.
Bottom line for big data storage
Big data analytics may seem to be an IT "wonder drug" that more and more companies believe will bring them success. But as is often the case with new treatments, there's usually a side effect -- in this case, it's the reality of current storage technology. Traditional storage systems can fall short for both real-time big data applications that need very low latency and data mining applications that can amass huge data warehouses. To keep the big data analytics beast fed, storage systems must be fast, scalable and cost-effective.
Flash storage solutions, implemented at the server level and with all-flash arrays, offer some interesting alternatives for high-performance, low-latency storage, from a few terabytes to a hundred terabytes or more in capacity. Object-based, scale-out architectures with erasure coding can provide scalable storage systems that eschew traditional RAID and replication methods to achieve new levels of efficiency and lower per-gigabyte costs.
About the author:
Eric Slack is a senior analyst at Storage Switzerland, an IT analyst firm focused on storage and virtualization.