The Los Alamos National Laboratory
However, the laboratory can't use NFS for weapons research because it doesn't scale high enough, said Gary Grider, deputy division leader for high-performance computing. So the lab uses Panasas' proprietary DirectFlow parallel client technology, as well as IBM's General Parallel File System (GPFS) and Oracle Corp.'s Lustre parallel file system.
Grider said he is interested to see if parallel NFS (pNFS) can give him the scalability NFS lacks, but he doesn't expect it to handle all of the lab's parallel processing needs right off the bat. He expects the lab will likely dabble with pNFS on just "a few thousand nodes" when it becomes available. "My guess is that it's got some growing pains to get up to a hundred thousand or a million processors," he said. "Nothing scales that well right out of the box."
For Grider, a few thousand nodes is small scale. Supercomputers at the Los Alamos National Laboratory have hundreds of thousands of processors, and tens of thousands of nodes. According to Grider, the computer buildings occupy approximately 100,000 square feet of floor space, and the machines require 30 megawatts of power each year, at a cost of approximately $30 million.
And with a mean time to failure (MTTF) of between eight hours and 24 hours for the supercomputers' processors, there are multiple failures every day -- which can be a big problem when running an 18-month job.
The lab uses Panasas ActiveStor Series 7 and 8, as well as the HPC series with DirectFlow to execute the checkpoint or restore point restart application. The Panasas technology dumps the checkpoint data into back-end storage through multiple concurrent data streams writing to disk at the rate of 40 TB to 100 TB in 10 minutes. The lab also uses GPFS and Lustre for similar tasks.
Los Alamos National Laboratory has used Panasas parallel NAS since it brought online a 100 teraflop computer for weapons research in 2004. A teraflop computer is capable of performing 1,000,000,000,000 floating-point operations per second.
"We couldn't have done what we've done without the [Panasas] DirectFlow, Lustre or GPFS equivalent," Grider said. "NFS wouldn't do what we wanted it to do, so we had to go to something like DirectFlow. It's our workhorse."
Even with all the horsepower the lab has now, Grider's team must take extra precautions to keep its supercomputer online.
"The supercomputer chugs along for eight hours or so, until something fails and the job fails," Grider said. "So we do something called checkpoint restart every so often, hopefully before the machine fails. At that point, we checkpoint where the application and data was, and put it on disk."
When a job is restarted, Grider's team refers to the last checkpoint, reads it from the disk, puts it into memory and then takes off again. Over 18 months, Grider said, the system will generate an answer to a "deep, dark question" about what is happening inside a nuclear weapon.
The problem with creating multiple checkpoints is the speed with which the checkpoint application can dump the multi-terabyte data files into back-end storage. "We buy these supercomputers to compute," Grider said. "We don't buy them to do I/O. But we need them to do I/O because we need to do these checkpoints. We want to compute for a long time and not take very long to checkpoint. The faster we can checkpoint, the more compute time we get."
This was first published in June 2010