Storage for high-performance computing

The storage and server cluster installed at The University of Texas at Austin is a lesson in how to do high-performance computing (HPC). Storage requirements for HPC go beyond massive capacity and include the use of high-performance file systems.

The storage and server cluster installed at The University of Texas at Austin is a lesson in how to do HPC. Storage requirements for HPC go beyond massive capacity, and include the use of high-performance file systems.

IMAGINE IF A SMALL law-enforcement office computer could recognize images of criminals caught on video surveillance systems or red-light traffic cameras from across the U.S., or if soldiers could locate buried bombs by scanning ground topography from a distance. Those are some of the practical uses that Rob Farber, senior research scientist in the molecular science computing facility at the Pacific Northwest National Laboratory (PNL) in Richland, WA, sees in the research he's doing.

Farber is looking at ways to identify specific faces and images from a massive amount of unstructured data and images. To do that, he has used a high-performance computing (HPC) and storage environment located at the Texas Advanced Computing Center (TACC) at The University of Texas (UT) at Austin. Because of the amount and different kinds of data generated by HPC, storage must be carefully tailored to the unique requirements of an HPC infrastructure.

"We're trying to answer the question 'Have I seen this face before?'" says Farber, who scheduled time on TACC's Ranger cluster to perform his analysis using the Sun Grid Engine, an application that schedules and distributes workload on the cluster. Farber says he uses the cluster because "Ranger is much, much larger than the HPC cluster being installed at PNL and shows marvelous scaling."

Ranger is significantly faster than the other HPC clusters installed at universities, research facilities and private industry; it's actually the fourth largest HPC cluster on the Top 500 Supercomputer Sites and the largest HPC cluster in an academic environment.

Farber's work is not unlike that being performed by commercial organizations on HPC clusters: weather forecasting, earthquake simulation, nano-molecules, astrophysics or computational biology. Earl Joseph, program VP, high-performance computing at Framingham, MA-based IDC, estimates that 25% of HPC implementations are in private industry rather than universities and national laboratories.

UT Austin makes Ranger available to private industry. While 90% of Ranger's time is allocated to the National Science Foundation's TeraGrid--an organization of 11 universities and national laboratories--its remaining time is given to UT researchers as well as commercial organizations, which receive 5% of Ranger's time through its industry affiliates program.

One of the commercial organizations using Ranger is SiCortex Inc., a supercomputer startup in Maynard, MA. The company develops high-performance, low- power supercomputers and uses Ranger to get a sense of how applications perform.

"Since we are competing in the marketplace and looking at clusters--which we don't see as the wave of the future--we have to get a sense of how applications perform on x86-based platforms," says Avi Purkayastha, application engineer at SiCortex. "In order to do well in designing our systems, we have to understand our competition in the marketplace and that's where Ranger comes in."

Saudi Aramco in Dhahran uses Ranger to "simulate reservoir models that contain billions of cells [maps]," says Jorge Pita, petroleum engineering specialist. "That research requires systems with a very large number of processors and memory. The reason we want to do that type of simulation with billions of cells is to get more accuracy."

Saudi Aramco uses the information from the simulations to plan the production of oil fields in Saudi Arabia. "The more detail you have in the characterization of the reservoir, the more accurate your forecasts are going to be of the production of oil," says Pita. While Saudi Aramco has its own HPC clusters, they're not of a sufficient size to run simulations on billions of cells. Saudi Aramco's biggest reservoir simulation cluster has 4,096 cores and 4TB of memory. "That size is good enough for many things, but not for pushing the billion-cell-and-beyond envelope," says Pita. "On Ranger, we can do 1 billion or 6 billion cell simulations that we couldn't do on our local cluster."

A recent study conducted by the Columbus, OH-based Council on Competitiveness and IDC underpins the use of HPC in private industry. Respondents to the study were 29 member companies of the Edison Welding Institute (EWI), which performs R&D for small- to medium-sized companies in the aerospace, automotive, government, energy, chemical, heavy manufacturing, medical and electronics industries. Many of those surveyed said they have important problems they're unable to solve with their current computers. They also cited a lack of strategic software and adequate HPC talent, as well as cost constraints as barriers to HPC adoption.

According to Jie Wu, research manager for IDC's high-performance and technical computing team, the market for HPC servers grew 15% in 2007 to reach $11.6 billion. From 2002 to 2007, the HPC server market grew an aggregate 134%. IDC projects this market will reach $16 billion by 2011.

Just what is Ranger?
UT Austin installed the Ranger cluster in late 2007 as the result of a $59 million grant from the National Science Foundation. Of that figure, $39 million was used to buy the hardware and software that makes up the cluster, while $20 million was allocated for four years of managing and operating the cluster.

But Ranger isn't the first HPC cluster the university has deployed. TACC has three smaller clusters: two Dell Inc. Linux clusters--one with 5,840 compute processors and 176TB of disk storage, and another with 1,736 compute cores and 68TB of disk space--and an IBM Corp. Power5 System with 96 processors and 7.2TB of disk space. From its experience running these clusters, the university learned how to build the Ranger cluster.

"With Ranger, we knew it was going to be a very large system, so we tried to adopt what we had already learned on previous visits or experiences with small clusters," says Tommy Minyard, TACC's associate director, advanced computing systems.

For instance, Minyard and his team had experience with Linux and the open-source Lustre file system. "Based on our experiences of performance in a Linux-based environment over native InfiniBand, we decided Lustre was the appropriate decision," explains Minyard. "We have a lot of people on staff that have Linux and Lustre experience."

The Ranger cluster consists of 82 Sun Microsystems Inc. Blade 6048 Modular System racks, each populated with 48 Sun Blade server modules for a total of 3,936 compute nodes. Each blade has four quad-core AMD Inc. Barcelona processors for a total of 15,744 processors with 62,976 cores. These servers contain Mellanox Technologies Inc. ConnectX IB InfiniBand host channel adapters that connect to 24-port InfiniBand leaf switches and then to two redundant Sun Data- center Switch 3456, a 3,456-port InfiniBand switch.

The InfiniBand switches also connect to 72 Sun Fire X4500 (Thumper) servers that serve as I/O servers and in aggregate contain as much as 1.73 petabytes (PB) of SATA disk. Six meta data servers comprising Sun Fire X4600 M2 servers also connect to the InfiniBand switches; the meta data servers connect to a Sun StorageTek 6540 array with 9TB of storage and act as controllers for the Thumper object storage target arrays. All of this connectivity is managed and aggregated with the open-source Lustre file system and runs under CentOS, a Red Hat Enterprise Linux distribution.

The Ranger cluster features more than 570 petaflops; the Thumper storage servers, each with 24TB of disk, offer aggregate bandwidth of 72Gb/sec. The largest Lustre file system on Ranger offers 1PB of storage.

Work on Ranger is scheduled through the Sun Grid Engine. Ranger also uses Rocks provisioning software to handle operating system and app deployments, the OpenFabrics stack that controls the InfiniBand interconnect and two message passing interface (MPI) implementations: MVAPICH and Open MPI (see "Ranger cluster diagram," below).

Ranger cluster diagram
The configuration of the Ranger cluster relies heavily on storage--72 Sun Fire X4500 Servers (Thumper) connect to the InfiniBand fabric and then to the 3,936 compute nodes in the cluster. Six meta data nodes running under the Lustre file system manage and organize the data stored on the Thumper servers. The only Fibre Channel in the system is the 32-port switch that connects the meta data nodes with their shared storage.

Ranger's storage
Ranger's storage differs from conventional SANs in the way it's provisioned, managed and backed up. The type of storage used in commercial organizations isn't appropriate for HPC in clusters the size of Ranger. Commercial systems and their SANs are oriented toward processing transactions, while HPC systems are built to maximize bandwidth and quickly process large quantities of unstructured data and files.

Until a few years ago, HPC relied on massive symmetric multiprocessing computers, but has moved to scale-out architectures with lots of x86 or commodity-based processors with DAS or NAS. The storage in these systems is aggregated under a small number of file systems such as Sun's open-source Lustre file system, Hewlett-Packard Co.'s StorageWorks Scalable File Share or Quantum Corp.'s StorNext.

Of the aggregated Thumper storage servers, six file servers with 144TB of disk space are allocated to user /home directories, 12 file servers with 288TB of disk space make up the work file system and 50 file servers with 1.2PB of disk space are reserved for scratch space. The remaining four Thumper servers are used as a "sandbox," says TACC's Minyard, to test file-system upgrades and new software versions.

In addition, the Lustre file system provides striping capability, in which data is divided and spread across several disks to increase performance (see "HPC: A study in tiered storage," below).

HPC: A study in tiered storage
Bruce Allen, director of the Max Planck Institute for Gravitational Physics in Hannover, Germany, has implemented a three-tiered storage model to support his high-performance computing (HPC) environment. Each tier is connected to a 10Gb/sec Ethernet network.

"The most reliable level consists of Sun [Microsystems Inc.] Thumpers [servers]," says Allen. "There's less reliable storage made up of Linux SuperMicro boxes that have Eureka 16-disk Serial ATA RAID controllers. The least reliable storage is the [internal] storage on the compute nodes themselves."

In Allen's network, 1,342 compute nodes operate in concert with the storage. The network is organized and managed by Sun's ZFS file system, and each type of storage in the network is provisioned according to its reliability.

Allen's tier one storage consists of 12 Sun Thumpers with 19TB of usable capacity. "What we typically store on the Thumpers is the users' /home directories, which we regard as the most valuable data," he notes. "We use the snapshot feature of ZFS to back up very fast."

Allen chose the Thumpers over other storage arrays based on the features promised with Sun's ZFS file system. "We liked the fact that with ZFS you can do snapshots very efficiently, use variable-sized striping and it incorporates block-level checksums in all the file-system data structures for guaranteed consistency," he explains. "And we liked the way the file system and the OS deal with bad blocks on the disk."

The next tier of storage for Allen is the SuperMicro storage servers. While Allen transfers some of the backup data from the Thumper boxes to the Linux boxes, "we typically use the Linux boxes for storing more experimental data; in most cases, we can get that data from tape archives located at CalTech. In some sense, that data is more expendable and less valuable than the /home directory data."

Finally, Allen has another 650TB of storage distributed across the compute nodes. "We typically mirror experimental data that's being accessed a lot across the compute nodes," he says. "Right now, we have a 40GB data set that's being accessed quite a lot; we have a copy of that data set on every single cluster node and programs access it locally. That gives us huge bandwidth because every node in parallel is reading off of the local disk."

However, Allen's happiness with the Sun storage system and ZFS is dampened by performance problems. "One of the things we haven't been so pleased with on the Sun storage side is that the most I/O we've been able to get out of the Thumpers is a couple of hundred megabytes per second," he says. "That's surprising because the local file system seems to be capable of 500MB/sec to 600MB/sec. We typically export that data to the cluster nodes by NFS. So far, we haven't even gotten close to saturating our wire. With NetBurst, we can get about 700MB/sec reading and writing, not necessarily to the storage device."

Allen says a recent patch from Sun is expected to dramatically improve NFS performance.

"Lustre provides us good performance from a single node, so we can push 700MB/sec to 800MB/sec from a single node, which is better than you can get over Gigabit Ethernet," says Minyard. "Or on aggregate, on Ranger we can push 30GB/sec to the single largest file system. At peak performance we've hit over 40GB/sec," he adds.

"Users can control the level of striping according to the performance they need from the system or how their application works," says Minyard. "So, for example, if someone is running on 4,000 MPI tasks, they may want to collect all that data to a single system and write it out. In that case, they would want to stripe across a lot of servers because they will get really good performance from a single server as it's shooting data to a lot of different servers rather than to just one."

However, says Minyard, if an app is writing out files per task and every task is writing out 4,000 files, it's better to have a smaller stripe because then each server shares the computing load. A disadvantage of Lustre vs. other file systems, he says, is how it handles server or storage failures: Lustre doesn't dynamically adjust to the failure of one node or another.

"If a storage node fails, Lustre will pause," says Minyard, who has had occasions where one of the Thumper servers failed. But he says that when a node fails, Lustre performs well, even though manual intervention needs to take place to restore a node.

"The nice thing about Lustre is its recovery mechanism," says Minyard. "[If] all of a sudden the server reboots or has a kernel panic, the Lustre file system will pause. Anything that's trying to write is sitting there waiting and once the file server comes back up, Lustre goes out and replays any of the transactions that were outstanding. All the jobs that were waiting to write continue to write as if nothing happened. It's only a 15 or 20 minute interruption."

There are other failover mechanisms built into Ranger. Each meta data server has active failover. The Sun StorageTek 6540 arrays with FC disks are mirrored to each other, says Minyard, so "they never lose meta data information." In addition, the Sun Datacenter Switch 3456 switches are paired for redundancy.

Backing up an HPC system
In typical HPC environments, Linux command line utilities are used to copy data users need to save in their /home directories from file system A to file system B. The rest of the data may be scratch data and output files from calculations. Scratch data doesn't need to be backed up because it'll take longer to recover than to regenerate it. Users will also save the output files, notes Richard Walsh, research director, high-performance and technical computing group at IDC.

As a result of the size of Ranger's environment, Minyard follows the advice of the experts and doesn't back up the entire system. "We back up only the user's /home directories, just because of the sheer capacity of the data," he explains. "So far, we have almost 800TB. Trying to back that up isn't feasible. A lot of the data is scratch data--checkpoint data, restart files--data that users write that they don't have to save."

When users need to back up their data, he adds, "they do it themselves, using command line utilities we supply. The data is saved to the Sun StorageTek SL8500 tape library."

In addition, "we don't back up the whole file systems, just some of the /home file systems for users," says Minyard. "We have some homegrown utilities that are using TAR and some incremental mechanisms. The Lustre file system right now doesn't have hierarchical storage management capability, but they're working on it. That's something we want to explore in the future."

Dig Deeper on Data center storage