Big data analytics creates some challenges for storage managers, but effective integration of big data and Hadoop into a storage environment can make them less daunting.
Today it seems if big data is everywhere, but big is a relative term and the focus should be on the data portion. Big data analytics require lots of data. So storing and protecting data, and making it accessible and available, requires some heavy-duty data management.
Many big data platforms such as Hadoop and other NoSQL and non-relational databases leverage a shared- nothing architecture. But that type of architecture can be problematic for storage managers.
Most storage professionals have spent years, if not decades, consolidating and concentrating data into as few storage silos as possible. End users have been told to save everything to a server so data can be backed up and properly managed. RAID systems have been optimized to deliver maximum performance and reliability in heavily multi-tenanted shared storage systems.
But then along comes Hadoop, and all that order breaks down. Hadoop is best run in a highly distributed environment with local server storage. This storage paradigm doesn't comply with an enterprise's concept of reliability, availability and serviceability (RAS). In fact, for the most part it can be argued that the shared-nothing distributed nature of the architecture isn't enterprise-ready.
The big (storage) problem with big data
Since many enterprises are still getting their feet wet with big data platforms, it may make sense to create a separate environment specific to the big data project that follows the infrastructure architecture suggested by the platform vendor. So, in the case of Hadoop, that calls for many distributed nodes, each with local storage, all sitting on a common LAN. The benefit of this arrangement is that it segregates the sandbox of the Hadoop project away from the production environments.
However, this design is less than optimal in a couple of key areas:
- Data gets duplicated
- There's a lot of data movement resulting from extract, transform and load (ETL) processes
As mentioned earlier, storage administrators have spent a long time trying to normalize the data they store. Specialized products, such as data deduplication appliances, exist to make the normalization efforts possible. But if an enterprise is creating a discrete environment just for big data, many of the benefits of data dedupe and data capacity optimization go out the window.
Additionally, apart from managing duplicate data, another significant challenge is managing the amount of data moving from production environments and data warehousing environments to populate the big data environment. Depending on the design of the big data environment, data may be replicated then persist within the big data environment or, in many cases, the data may be imported for each iteration of big data processing. This importation is known as ETL. Data is extracted from a source (e.g., from a data warehouse), then transformed (into a form compatible with the big data environment) and loaded into the target environment. The ETL process can place a considerable amount of stress on the storage network.
Finally, given that the big data infrastructure is distinct from the rest of the enterprise, traditional (i.e., existing) data management applications likely won't be capable of managing, optimizing and sustaining the big data infrastructure.
Big data storage product sampler
The following are examples of some of the hardware and software products currently available that can be used to address big data processing requirements within an enterprise storage environment.
EMC Isilon: Scale-out NAS storage systems that support the Hadoop Distributed File System.
Hewlett-Packard (HP) Converged-System 300 for Vertica: Preconfigured hardware system to host the HP Vertica Analytics Platform.
IBM PureData System for Hadoop: Appliance that supports IBM's Hadoop-based InfoSphere BigInsights software.
MapR Technologies: Distribution of Hadoop with native support for NFS.
NetApp FlexPod Select with Hortonworks Data Platform (HDP): NetApp's reference architecture for a converged system that supports Hadoop via HDP.
Pivotal Data Computing Appliance: Big data analytics appliance that integrates the Pivotal Greenplum Database and Hadoop.
Storage nirvana is attainable (Read: HDFS)
Ideally, data should be normalized across all platforms from OLTP to OLAP to big data and so on. This concept is often referred to as the "single source of truth." The design is optimized for performance, capacity efficiency, availability and manageability.
To achieve this storage nirvana, storage managers need to embrace the inevitability of big data analytics coming to their datacenters. Storage administrators should try to pave the way for the introduction of these new infrastructure designs. The best way of achieving that is to provide protocols such as the Hadoop Distributed File System (HDFS) as a new way of accessing data.
This approach is not only enterprise-ready, but big data ready. Currently, only a few storage systems are capable of providing HDFS as an interface (most notably perhaps, EMC Isilon storage arrays). Another alternative would be to select a big data distribution (say Hadoop) that supports traditional enterprise storage protocols, such as NFS. In this case, MapR Technologies' distribution of Hadoop has native support for NFS.
Ultimately, storage administrators will have to evolve their existing storage systems and storage architectures into an object-addressable storage architecture (also known as object-based storage).
What many storage managers may not have recognized is that a solution like Hadoop isn't a single piece of software, but rather a framework. The storage-related portion of that framework is HDFS.
As a file system, HDFS provides a level of data management. In fact, HDFS can leverage its properties to create a fully object-addressable environment. More and more independent software vendors are providing gateways to HDFS to enable HDFS to integrate with traditional networked storage. In the future, some enterprises may even drop some of the traditional POSIX-compliant file systems they're familiar with and replace them with HDFS-based storage devices.
The EMC Isilon storage system, for example, is a scale-out storage architecture. EMC Isilon can be managed using existing storage management and datacenter management solutions (such as VMware vCenter). The scale-out capabilities of EMC Isilon and its support for HDFS allow performance to be optimized by distributing I/Os across multiple controller nodes. But most critically, it allows data to stay in place because it won't have to be moved for big data analytic processing.
Data that may originate from an edge device (mobile devices, desktops or laptops) may start off being written using an SMB interface. That data may then be picked up and leveraged by some mission-critical application over NFS, and then the same data (with the HDFS interface) can then be part of the Hadoop framework without ever having to be extracted, transformed or loaded from one system to another.
That type of approach has some very appealing benefits for an enterprise:
- Data can now be compressed or deduplicated according to enterprise policies.
- Data can be backed up and managed just like a traditional storage system.
- Data provenance can be accurately audited, providing new levels of governance and compliance.
The concept of object-addressable storage is no different in basic design from a file sync-and-share environment such as Box or Dropbox. However, instead of edge devices using the data, it's mission-critical applications that process the data. The portability of the data gives rise to new levels of opportunity for the data to be used and enterprise value to be derived.
This object approach also minimizes the pressures on both the LAN and the SAN, since ETL can essentially be eliminated.
Users should encourage their storage vendors to embrace and accelerate their HDFS integration. At the same time, users should be wary of storage vendors promoting (converged) infrastructures that are "designed" to work with various Hadoop distributions. In most cases, their claims simply mean that the RAID controller can create many LUNs and allows multiple (block) connections to a Hadoop cluster. It generally doesn't mean the storage system can "speak" HDFS with Hadoop.
A number of vendors have put together appliances that are designed for Hadoop. These products include NetApp's FlexPod Select for Hadoop (using Hortonworks Data Platform) reference architecture, Pivotal's Data Computing Appliance (DCA) and IBM's PureData System for Hadoop appliances. These offerings tightly integrate server, network, storage and Hadoop distributions to optimize deployment and maintenance. However, the underlying storage systems don't have native HDFS interfaces. In the case of the Pivotal DCA, all the storage is local to the compute nodes in each system.
Vendors that deliver storage systems with native HDFS integration are EMC Isilon and Hewlett-Packard's Vertica Connector for Hadoop.
The exception to the rule
While much of the discussion above has been focused on integrating HDFS with storage, there's another way to integrate storage with Hadoop. As noted earlier, Hadoop is a framework, and HDFS is a module that plugs into Hadoop. So, the exception to the HDFS integration is to actually not use HDFS, and replace it with another file system that can also plug into Hadoop.
For example, IBM's General Parallel File System (GPFS) is an alternative to HDFS. In essence, a storage admin would move some of the responsibility of integrating the data store to the Hadoop administrator. A challenge with that approach is that it increases the complexity of the big data environment considerably. The supportability of the big data environment is also in question. That said, IBM's PureData solution does use GPFS. If your big data environment is going to be all "Blue," then that IBM-supported route might be right for your organization.
Bottom line for big data storage
Big data and Hadoop are a great impetus to review your company's storage infrastructure. Storage administrators need to consider how they can evolve their existing infrastructures to be more flexible, dynamic and multi-application friendly.
The next generation of highly virtualized data centers will be data-centric, not compute-centric. Storage administrators will need to take into consideration that they'll have the responsibility of creating an architecture that minimizes (and optimally, eliminates) data movement between applications.
Storage managers also have to consider the impact this evolution will have on backup and disaster recovery strategies.
Perhaps the most important thing for storage managers to remember is that this is an evolution, not a revolution. That said, those enterprises that evolve quicker will have first mover advantage; if the data users of those enterprises can exploit that advantage, they should also expect to achieve competitive advantage for their enterprises.
One thing is for sure when it comes to big data: if you're not doing it, your competitors probably are. Embrace the change to evolve the IT department from a cost center into a dynamic information service provider.
About the author:
Ben Woo is the founder and managing director of the market research firm Neuralytix. Ben is a frequent contributor to SearchStorage.com and other TechTarget publications, and a speaker at Storage Decisions.
For more information about configuring and managing storage systems for big data analytics, check out these articles and videos:
The short list to big data design basics
Hadoop and storage issues in your big data plans
Use cases for big data analytics
Storage challenges in a big data storage infrastructure
Four categories of storage architectures for big data