Choosing storage for streaming large files in big data sets
A comprehensive collection of articles, videos and more, hand-picked by our editors
Big data has been in the news a lot lately, highlighted -- somewhat negatively -- by the controversy over the National...
Security Agency's mining of telephone records. But the majority of big data objectives are of a decidedly mundane nature: combing through customer service incidents to improve product quality, determining which products sell well together to optimize merchandising or using diesel price data to route long-haul trucks cost-effectively. Indeed, big data value can found in virtually all industries: financial services, health care, retail, natural resources and government are all good examples. Commercially, the objectives are pretty straightforward: Gain competitive advantage and improve profitability. Early adopters of big data will gain an advantage, while laggards will be in catch-up mode.
Storage is key to big data
Storage organizations focus on big data because it's up to them to house and manage potentially petabytes of information. From a business perspective, it's about big data analytics, or the application side of what can be derived from vast amounts of information. This is an important distinction: If the task were only to house a lot of data, the architecture would be simple -- the highest-capacity drives for the lowest cost and some measure of data protection. But when the objective is competitive advantage and increasing profit, timeliness and a data-crunching capability justify a higher price. Organizations that can recognize changes in consumer habits earlier than their competitors, for example, will have "first mover" advantage to potentially lucrative markets, fads or trends.
The term big data isn't all that useful a label because it raises such questions as "How big does it have to be to be big?" and "Is there such a thing as medium data?" Certainly, big data may involve petabytes of data, but not necessarily. It's about the analytic process more than the sheer size of the data store. Big data also involves the unpredictable nature of the incoming data in terms of source and format. Some observers will argue that big data includes traditional extract, transform and load (ETL) systems that feed data into commercial relational databases. More recently, however, it's thought of in terms of the Hadoop open source framework.
Theoretically, any organization can benefit from performing analytics on big data regardless of its size. The limiting factor is having the necessary critical mass of expertise to implement and gain value from the analytics rather than some arbitrary volume delineation. From a storage manager's standpoint, the critical issues can be summarized as how to provide data access agility cost-effectively to unpredictable and potentially massive amounts of information. With all the data storage technologies available, storage should never be the limiting factor in big data analytics.
Big data or big I/O?
A better label than big data, at least from a storage manager's perspective, might be "big I/O." The unpredictable nature of big data inhibits a manager's ability to gauge which or how much data might be in demand at any point in time. Thus, the ability to predict compute requirements and I/O requirements may be an inexact science. Storage managers will want to select systems and architectures that provide maximum flexibility to adjust any given parameter in the performance equation.
Although ETL and data warehouse environments can be considered to be big data applications, there's an important distinction between these traditional analytical approaches and big data: real-time processing. Think of it as online transaction processing (OLTP) meets data warehousing. This adds a further element of unpredictability because the newer data processing may call for data that resides on low-IOPS hard disk drives (HDDs). From a storage perspective, this means big data may have the throughput requirements of OLTP with the capacity of a data warehouse.
I/O requirements will be further influenced by the nature of data. Millions (or billions) of small files may be highly random in access. A few large files may be best served by long sequential reads. Knowing this distinction will help storage managers know which architecture will be most suited to their workload.
Storage managers are accustomed to a full range of data service capabilities in arrays. Here's a rundown of how some of these may play in a big data environment.
RAID. RAID may seem obvious, but there are a few special considerations. First, widely distributed data stores may operate conventionally with a RAID-5 configuration. In contrast, large-scale central data stores may demand RAID-6 functionality, given the sheer size of the store. However, each additional parity drive can incur both capacity and processing overhead. Object-based storage, yet another alternative, doesn't use RAID at all. Instead, it uses replication across distributed nodes to yield data protection that is located where it's needed.
Thin provisioning. Because data volumes are unpredictable in big data, thin provisioning can help ensure that capacity is available without overprovisioning.
Encryption. Intuitively, it would seem that encryption isn't necessary for an application that's usually kept in-house and is transient in nature. However, if any of the incoming data is regulated, encryption may not be a bad idea.
Automated tiering. Unpredictable IOPS requirements can be solved with automated tiering schemes that move "hot" data to faster media and "cold" data to low-cost, high-capacity HDDs. Some automated tiering schemes move small amounts of data frequently, which may be ideal for high-volume, small-file systems. Other schemes move large blocks infrequently, which may be best-suited to large-file environments.
Remote replication. Real-time big data applications may represent the cumulative results of weeks or months of processing. Often, these systems are more accurate over time. Thus, losing the data store may set the business back significantly and recreating it may be impossible. Consequently, remote replication may be required to avoid downtime in the event of significant system failure or disaster. The recovery point objective may be critical even if the recovery time objective is less stringent.
Organizations may determine that employing Hadoop or a similar stack may be the most efficient big data implementation. From an IT perspective, the implementation is sufficiently different that a proof of concept is well justified: Seat-of-the-pants deployments are likely to yield frustration and failure.
Storage managers must factor in the reality that big data could reintroduce siloed storage to the data center. After having spent the last decade trying to reduce silos, IT organizations would be understandably reticent to reintroduce them. Nevertheless, the benefits to the business of performing analytics on big data may far outweigh the difficulties and drive storage managers to understand and adjust. The result just might be some cool technology that's a business game changer.
About the author:
Phil Goodwin is a storage consultant and freelance writer.
Four categories of storage architecture for big data