Symantec Corp. today announced an Apache Hadoop add-on capability for its Veritas Cluster File System to help run "big data" analytics on storage area networks instead of scale-out, commodity servers using local storage.
Symantec has written a Hadoop Connector for the Hortonworks Data Platform that resides on top of Veritas CFS and sits on a SAN. The goal is to give data used in Hadoop analytics such enterprise features as high availability, snapshots, deduplication and compression.
Typically, Hadoop storage consists of distributed, scale-out processing nodes, because the Hadoop Distributed File System (HDFS) turns each node into a larger file system. Symantec Hadoop Connector is a software layer that sits between the cluster file system and the Hortonworks Hadoop stack so HDFS can run on networked storage instead of direct attached storage. This enables a SAN to serve as Hadoop storage.
"Why build an all-new server [infrastructure] when you can use a perfectly good SAN?" said Dan Lamorena, director of product marketing for Symantec's Storage and Availability Management Group. "We say, 'Let data reside where it is and run analytics there.' Why create a new DAS environment?"
Symantec defines big data as large customer records that require heavy analytics rather than large files used for media, entertainment and genomics. The Hadoop connector can be downloaded for free by CFS customers, Lamorena said.
Much of the data running on Veritas CFS is stored on a SAN and it's the type of data customers want to use for data analysis, said Mike Matchett, senior analyst and consultant at Hopkinton, Mass.-based Taneja Group Inc.
"HDFS is designed to work over DAS," Matchett said. "But HDFS doesn't protect data very well. It's difficult to back up. You can't take snapshots off it, and it's difficult to replicate over a WAN. Hadoop usually has no high availability and it's hard to access data from HDFS." The Symantec connector means CFS customers "can still run the Hadoop cluster and instead of using HDFS on each node, you point Hadoop to the Veritas Cluster File System running on a SAN," he said.
Matchett said there may be a performance tradeoff when using a SAN versus distributing processing to run Hadoop. Data performance on the Veritas CFS may be better or worse depending on the algorithm used. "Some algorithms improve performance when run over local storage," he said.