Clustering is one of the oldest concepts in storage, but it's getting a new spin as major vendors chime in with smaller players, promoting offerings that can scale well beyond the traditional set-up of two nodes tightly coupled for failover purposes.
At its most basic level, a cluster is a group of servers or storage nodes that, linked together logically, act as a single system. To an administrator or application, the clustered nodes appear as one pool through a global address space in the case of block storage or a namespace with a file system.
One of the most appealing aspects for some users is the ability to grow, or scale out, capacity in a modular fashion, beyond the limits of a single standalone storage system and geography. They simply add a node when the need arises, and the cluster automatically recognizes it.
The Spielberg Family Center for Applied Proteomics at Cedars-Sinai Medical Center, for instance, pumps a terabyte a day into its Isilon IQ clustered storage systems from Isilon Systems Inc. The Los Angeles-based center has approximately 20 storage nodes in the room where its research instruments generate data and about 800 nodes in a data center across the street.
"When you're generating data at that rate, it's convenient to be able to purchase storage as you grow and to not have downtime as you add storage," says Parag Mallick, director of clinical proteomics at the center. "With other systems we played with, you had to turn everything off and add disk space and [then] restart and turn everything back on again."
Cost-effective clustered storage
Cost-effective manageability is one of the major advantages a clustered storage system can provide, in addition to high availability, reliability, performance scaling beyond the limits of one device/system and, in some cases, load balancing.
Clustered storage systems come in several types: file, block and object based. Each provides a different mechanism for data access.
Greg Schulz, founder and analyst at StorageIO Group in Stillwater, Minn., points out that not all file-based options have a clustered file system. Those with one offer read/write access from any node on which the file system runs. But more traditional NAS devices without a clustered file system have read/write access from only one primary node at a time, he notes.
With the block-based options, some offer shared access to logcal unit numbers (LUNs) from any node, while others afford data access from a given node on a primary basis, says Schulz.
The object-based option is the most distinctive. Products such as EMC Corp.'s Centera make use of content-addressed storage (CAS) technology to convert a file into an object and then assign a unique identifier or digital fingerprint to each one. The unique identifiers are accessible via a directory.
The traditional use case for CAS is archiving large amounts data, such as email or medical records, for compliance or regulatory purposes. Performance scaling isn't a prime objective, as is the case with block- and file-based clustered storage systems.
But Schulz advises users to consider clustered storage "beyond where it's been pigeonholed in the past. The answer to all problems isn't clustered storage, but it can be used for a lot more purposes than just high-performance computing."
The University of Georgia's Research Computing Center (RCC) in Athens, Ga., didn't go looking for clustered storage. It simply conveyed to vendors the need for a highly available, scalable, general-purpose system with the headroom to handle unpredictable growth that varies based on grant awards, and the flexibility to meet a wide range of storage requirements.
The center serves researchers across a variety of academic disciplines, from physics and astronomy to genetics, and some users have greater computational needs than others. Systems range from Linux and IBM AIX high-performance computing clusters to Windows servers, so the new storage system had to support both NFS and CIFS.
Jerry NeSmith, director of the Office of Research Services at The University of Georgia, which helps to administer the RCC, says the university considered SAN block-level and NAS file-based options and ultimately decided on the latter, buying NetApp's FAS3070 GX clustered system. Total cost of ownership was the tipping point.
"We're managing a single namespace instead of multiple servers, and we can do it with a minimal staff effort," says NeSmith. "That's what the cluster architecture does for us. We can grow our capacity and throughput without growing our staff."
He says he likes how the RCC can put a low-demand database on one NAS head and spread a data set requiring high throughput across multiple heads. He also likes the system's ability to grow without the need for time-consuming data conversions or migrations.
"We can add heads to increase our throughput, and those heads have access to all the data, not just to the data that's attached directly to them," says NeSmith. "Therefore, we can increase our data throughput not only by adding disks but by adding heads."
Support for Fibre Channel block access
Los Angeles-based Shopzilla Inc., which hosts an online comparison shopping service for consumers, was looking for next-generation block-level storage with virtualized provisioning and performance. It discovered XIV Ltd.'s clustered product several months before IBM Corp. acquired the Tel Aviv, Israel-based company in January 2008.
Burzin Engineer, vice president of infrastructure technology, wanted a system that could support both Fibre Channel (FC) block access and a NAS gateway for file serving. He says that significantly narrowed his choices.
But Engineer is content with the three XIV systems now running in Los Angeles and Seattle, especially the absence of fees for maintenance and new features. "Over time, the operational costs of the purchase are as much as 30% to 50%, so not having any of that is a huge deal," he says. "I actually didn't believe it. I said, 'Are you serious?'"
XIV's use of off-the-shelf components is also appealing, since Engineer figures that could help to speed delivery of product upgrades and improvements. He was also impressed by XIV's ability to gain such high performance from SATA drives and internal TCP/IP, but he says the "secret sauce is in the software" of the clustered system.
"When you read, it can read from multiple locations, so performance is phenomenal," he says. He recommends XIV to anyone with a requirement for random small file reads.