| The storage and server cluster installed at The University of Texas at Austin is a lesson in how to do HPC. Storage requirements for HPC go beyond massive capacity, and include the use of high-performance file systems.
IMAGINE IF A SMALL law-enforcement office computer could recognize images of criminals caught on video surveillance systems or red-light traffic cameras from across the U.S., or if soldiers could locate buried bombs by scanning ground topography from a distance. Those are some of the practical uses that Rob Farber, senior research scientist in the molecular science computing facility at the Pacific Northwest National Laboratory (PNL) in Richland, WA, sees in the research he's doing.
Farber is looking at ways to identify specific faces and images from a massive amount of unstructured data and images. To do that, he has used a high-performance computing (HPC) and storage environment located at the Texas Advanced Computing Center (TACC) at The University of Texas (UT) at Austin. Because of the amount and different kinds of data generated by HPC, storage must be carefully tailored to the unique requirements of an HPC infrastructure.
"We're trying to answer the question 'Have I seen this face before?'" says Farber, who scheduled time on TACC's Ranger cluster to perform his analysis using the Sun Grid Engine, an application that schedules and distributes workload on the cluster. Farber says he uses the cluster because "Ranger is much, much larger than the HPC cluster being installed at PNL and shows marvelous scaling."
Ranger is significantly faster than the other HPC clusters installed at universities, research facilities and private industry; it's actually the fourth largest HPC cluster on the Top 500 Supercomputer Sites and the largest HPC cluster in an academic environment.
Farber's work is not unlike that being performed by commercial organizations on HPC clusters: weather forecasting, earthquake simulation, nano-molecules, astrophysics or computational biology. Earl Joseph, program VP, high-performance computing at Framingham, MA-based IDC, estimates that 25% of HPC implementations are in private industry rather than universities and national laboratories.
UT Austin makes Ranger available to private industry. While 90% of Ranger's time is allocated to the National Science Foundation's TeraGrid--an organization of 11 universities and national laboratories--its remaining time is given to UT researchers as well as commercial organizations, which receive 5% of Ranger's time through its industry affiliates program.
One of the commercial organizations using Ranger is SiCortex Inc., a supercomputer startup in Maynard, MA. The company develops high-performance, low- power supercomputers and uses Ranger to get a sense of how applications perform.
"Since we are competing in the marketplace and looking at clusters--which we don't see as the wave of the future--we have to get a sense of how applications perform on x86-based platforms," says Avi Purkayastha, application engineer at SiCortex. "In order to do well in designing our systems, we have to understand our competition in the marketplace and that's where Ranger comes in."
Saudi Aramco in Dhahran uses Ranger to "simulate reservoir models that contain billions of cells [maps]," says Jorge Pita, petroleum engineering specialist. "That research requires systems with a very large number of processors and memory. The reason we want to do that type of simulation with billions of cells is to get more accuracy."
Saudi Aramco uses the information from the simulations to plan the production of oil fields in Saudi Arabia. "The more detail you have in the characterization of the reservoir, the more accurate your forecasts are going to be of the production of oil," says Pita. While Saudi Aramco has its own HPC clusters, they're not of a sufficient size to run simulations on billions of cells. Saudi Aramco's biggest reservoir simulation cluster has 4,096 cores and 4TB of memory. "That size is good enough for many things, but not for pushing the billion-cell-and-beyond envelope," says Pita. "On Ranger, we can do 1 billion or 6 billion cell simulations that we couldn't do on our local cluster."
A recent study conducted by the Columbus, OH-based Council on Competitiveness and IDC underpins the use of HPC in private industry. Respondents to the study were 29 member companies of the Edison Welding Institute (EWI), which performs R&D for small- to medium-sized companies in the aerospace, automotive, government, energy, chemical, heavy manufacturing, medical and electronics industries. Many of those surveyed said they have important problems they're unable to solve with their current computers. They also cited a lack of strategic software and adequate HPC talent, as well as cost constraints as barriers to HPC adoption.
According to Jie Wu, research manager for IDC's high-performance and technical computing team, the market for HPC servers grew 15% in 2007 to reach $11.6 billion. From 2002 to 2007, the HPC server market grew an aggregate 134%. IDC projects this market will reach $16 billion by 2011.
Just what is Ranger?
But Ranger isn't the first HPC cluster the university has deployed. TACC has three smaller clusters: two Dell Inc. Linux clusters--one with 5,840 compute processors and 176TB of disk storage, and another with 1,736 compute cores and 68TB of disk space--and an IBM Corp. Power5 System with 96 processors and 7.2TB of disk space. From its experience running these clusters, the university learned how to build the Ranger cluster.
"With Ranger, we knew it was going to be a very large system, so we tried to adopt what we had already learned on previous visits or experiences with small clusters," says Tommy Minyard, TACC's associate director, advanced computing systems.
For instance, Minyard and his team had experience with Linux and the open-source Lustre file system. "Based on our experiences of performance in a Linux-based environment over native InfiniBand, we decided Lustre was the appropriate decision," explains Minyard. "We have a lot of people on staff that have Linux and Lustre experience."
The Ranger cluster consists of 82 Sun Microsystems Inc. Blade 6048 Modular System racks, each populated with 48 Sun Blade server modules for a total of 3,936 compute nodes. Each blade has four quad-core AMD Inc. Barcelona processors for a total of 15,744 processors with 62,976 cores. These servers contain Mellanox Technologies Inc. ConnectX IB InfiniBand host channel adapters that connect to 24-port InfiniBand leaf switches and then to two redundant Sun Data- center Switch 3456, a 3,456-port InfiniBand switch.
The InfiniBand switches also connect to 72 Sun Fire X4500 (Thumper) servers that serve as I/O servers and in aggregate contain as much as 1.73 petabytes (PB) of SATA disk. Six meta data servers comprising Sun Fire X4600 M2 servers also connect to the InfiniBand switches; the meta data servers connect to a Sun StorageTek 6540 array with 9TB of storage and act as controllers for the Thumper object storage target arrays. All of this connectivity is managed and aggregated with the open-source Lustre file system and runs under CentOS, a Red Hat Enterprise Linux distribution.
The Ranger cluster features more than 570 petaflops; the Thumper storage servers, each with 24TB of disk, offer aggregate bandwidth of 72Gb/sec. The largest Lustre file system on Ranger offers 1PB of storage.
Work on Ranger is scheduled through the Sun Grid Engine. Ranger also uses Rocks provisioning software to handle operating system and app deployments, the OpenFabrics stack that controls the InfiniBand interconnect and two message passing interface (MPI) implementations: MVAPICH and Open MPI (see "Ranger cluster diagram," below).
Until a few years ago, HPC relied on massive symmetric multiprocessing computers, but has moved to scale-out architectures with lots of x86 or commodity-based processors with DAS or NAS. The storage in these systems is aggregated under a small number of file systems such as Sun's open-source Lustre file system, Hewlett-Packard Co.'s StorageWorks Scalable File Share or Quantum Corp.'s StorNext.
Of the aggregated Thumper storage servers, six file servers with 144TB of disk space are allocated to user /home directories, 12 file servers with 288TB of disk space make up the work file system and 50 file servers with 1.2PB of disk space are reserved for scratch space. The remaining four Thumper servers are used as a "sandbox," says TACC's Minyard, to test file-system upgrades and new software versions.
In addition, the Lustre file system provides striping capability, in which data is divided and spread across several disks to increase performance (see "HPC: A study in tiered storage," below).
"Lustre provides us good performance from a single node, so we can push 700MB/sec to 800MB/sec from a single node, which is better than you can get over Gigabit Ethernet," says Minyard. "Or on aggregate, on Ranger we can push 30GB/sec to the single largest file system. At peak performance we've hit over 40GB/sec," he adds.
"Users can control the level of striping according to the performance they need from the system or how their application works," says Minyard. "So, for example, if someone is running on 4,000 MPI tasks, they may want to collect all that data to a single system and write it out. In that case, they would want to stripe across a lot of servers because they will get really good performance from a single server as it's shooting data to a lot of different servers rather than to just one."
However, says Minyard, if an app is writing out files per task and every task is writing out 4,000 files, it's better to have a smaller stripe because then each server shares the computing load. A disadvantage of Lustre vs. other file systems, he says, is how it handles server or storage failures: Lustre doesn't dynamically adjust to the failure of one node or another.
"If a storage node fails, Lustre will pause," says Minyard, who has had occasions where one of the Thumper servers failed. But he says that when a node fails, Lustre performs well, even though manual intervention needs to take place to restore a node.
"The nice thing about Lustre is its recovery mechanism," says Minyard. "[If] all of a sudden the server reboots or has a kernel panic, the Lustre file system will pause. Anything that's trying to write is sitting there waiting and once the file server comes back up, Lustre goes out and replays any of the transactions that were outstanding. All the jobs that were waiting to write continue to write as if nothing happened. It's only a 15 or 20 minute interruption."
There are other failover mechanisms built into Ranger. Each meta data server has active failover. The Sun StorageTek 6540 arrays with FC disks are mirrored to each other, says Minyard, so "they never lose meta data information." In addition, the Sun Datacenter Switch 3456 switches are paired for redundancy.
As a result of the size of Ranger's environment, Minyard follows the advice of the experts and doesn't back up the entire system. "We back up only the user's /home directories, just because of the sheer capacity of the data," he explains. "So far, we have almost 800TB. Trying to back that up isn't feasible. A lot of the data is scratch data--checkpoint data, restart files--data that users write that they don't have to save."
When users need to back up their data, he adds, "they do it themselves, using command line utilities we supply. The data is saved to the Sun StorageTek SL8500 tape library."
In addition, "we don't back up the whole file systems, just some of the /home file systems for users," says Minyard. "We have some homegrown utilities that are using TAR and some incremental mechanisms. The Lustre file system right now doesn't have hierarchical storage management capability, but they're working on it. That's something we want to explore in the future."