With “big data” storage performance requirements that dwarf the needs of most IT shops, the National Center for Supercomputing Applications (NCSA) knew that scale-out file storage was
The NCSA team also realized its data storage performance would come down to its parallel file system, which required tuning and high-performance hardware to reach required levels.
Boeing, Caterpillar, GE, John Deere, Procter & Gamble, Rolls Royce and others offload complex simulations to NCSA systems located at the University of Illinois at Urbana-Champaign. The private sector partners pay on a cost-recovery basis to work with NCSA.
“They want an environment where they can compute at speeds and levels of scale that are better than anything they can get internally,” said Evan Burness, project manager of the private sector program and economic development at NCSA.
On July 1, NCSA launched the second iteration of iForge. The upgraded compute side features 128 Dell M620 blade servers and a Dell PowerEdge C6145 with 2,048 cores of Intel Xeon E5-2670 processors and 64 cores of AMD Opteron 6282 SE processors. Gigabit QDR InfiniBand from Mellanox Technologies provides the network connectivity to a pair of DataDirect Networks (DDN) SFA10K-X arrays, with a total of 700 TB of 7,200 rpm SATA disks.
Among supercomputers, the iForge is modest in scale. For instance, NCSA’s Blue Waters project includes 380,000 cores, 25 PB of spinning disks, throughput of 1.2 TB per second, plus more than 380 PB of tape. NCSA’s retired Abe cluster had 10,000 cores. But the iForge project stands apart with one important requirement that many science-centric supercomputers don’t have: 99% uptime.
“In most supercomputing environments, reliability in the 92% to 95% range is fine. If you have an hour of downtime, it’s no big deal," Burness said. "When you’re trying to figure out the origins of the universe, it doesn’t really make a difference whether you figure it out at 2 p.m. or 4 p.m. But if you’re Boeing on a deadline to produce an aircraft, and you have to have it done by the end of the month, time really, really matters.”
To that end, the first critical storage decision for NCSA was the file system. NCSA engineers had extensive experience working with both of the prominent high-performance options: the General Parallel File System (GPFS) from IBM and the open-source Lustre. Such massively scalable, parallel file systems are designed to provide high-speed data access concurrently to applications that run on multiple nodes of server clusters.
“Files themselves are spread across nodes in the cluster, and a single client can get aggregate throughput from multiple nodes in parallel” through the use of special client protocols, said Mike Matchett, a senior analyst at Hopkinton, Mass.-based Taneja Group, via an email. “In this way, many, many disks can participate in serving each I/O request.”
GPFS gets the nod based on 'reliability'
NCSA selected Lustre for the Blue Waters project and the Abe system. But for iForge, engineers chose GPFS largely for reliability reasons.
“Lustre is not very fault tolerant. If there’s any sort of blip in connectivity or the hardware, you’re looking at an outage,” said Alex Parga, a senior engineer at NCSA. “GPFS can handle hardware problems a lot better than Lustre can. When it comes to reliability, you can barely even compare them.
“When it comes to metadata," he added, "GPFS has a huge advantage. Lustre’s metadata is currently a single point of failure. It has to be on a single, specific storage location; in GPFS, it can be distributed.”
Parga, who said he fought “tooth and nail” against Lustre for Blue Waters, maintains that the General Parallel File System also has a substantial edge in management capabilities. “GPFS is practically turnkey compared to Lustre. In order to implement Lustre well, there’s a lot more knowledge you need. It’s a lot more labor intensive,” he said.
The distinction between the General Parallel File System and Lustre “tends to be a little bit religious,” observed David Floyer, chief technology officer and co-founder of Wikibon, a community-focused research and analyst firm based in Marlborough, Mass.
“Good people can get good answers with both,” Floyer said, citing DDN’s support of both GPFS and Lustre. “Whenever you’re dealing with high-performance computing, and you are really at the end of performance, you have to know what you’re doing.”
Customers generally buy a DDN file storage system that includes either the General Parallel File System (with DDN’s GridScaler product) or Lustre (with ExaScaler), the servers that run the file system and a dual-controller disk array. Many also pay for consulting services to tune the system.
But with plenty of in-house expertise, NCSA chose to self-tune GPFS, which it licenses from IBM. NCSA runs the file system on Dell servers connected via InfiniBand to the compute nodes and the DDN storage controllers.
The compute nodes had local disks in the first iteration of iForge, as they do in many HPC environments. But they're diskless in the second iteration, with the storage concentrated in the external GPFS/DDN system for performance, fault tolerance and availability.
NCSA's Parga said he didn’t consider DDN’s GridScaler mature enough for production use last year when NCSA was putting together iForge, so he tuned the file system by fixing configuration issues and adjusting parameters such as page pull size and threads per disk.
“We kept on changing the parameters, being more and more aggressive with the load on the disk until we got right to the point where the disks were being maxed out,” Parga said. “With a parallel file system like this, you want to move your performance bottleneck to be the hardware. You’re trying to get the software out of the mix. You change your parameters so that you're basically taxing the hardware as much as it can handle.”
Parga aimed for a performance target of approximately 4 GBps. He said he stopped tuning when iterative testing showed throughput at about 6 GBps. “There probably was more I could have done, but I was pretty happy considering the goal,” he said.
Jeff Denworth, DDN’s vice president of marketing, claimed the maximum possible performance for the SFA10K-X’s is 12.75 GBps and customers “can on average easily see 10 GBps if their I/O is sequential and well-formed in nature.”
However, Parga noted that NCSA’s DDN SFA10K-X arrays are populated only about 50%, and he said the lack of spinning disk is a major factor limiting the performance of the iForge file system to its current 6.4 GBps. Future iForge iterations may double the performance, he said.
“I just know [DDN’s] hardware is fast,” Parga said. “When we benchmark it, it screams.”
Prior to iForge’s initial launch last September, NCSA did benchmark tests with local hard drives inside compute nodes, with SSDs inside servers and with the large shared file system running GPFS. The GPFS/DDN system produced as much as a 30% to 40% boost in application performance, even using 2 TB SATA disks, according to NCSA's Burness. He said the commodity components used in NCSA’s supercomputers often catch people off guard.
“The reality though is that we craft high performance from making systems operate at extreme scale,” Burness said. “So, the importance and speed of one component is demoted rather significantly.”
But NCSA is exploring the possibility of adding ultra-fast solid-state storage in a future iteration of iForge now that prices are dropping. So far, the chief industrial use case has been sequential streaming of huge files into main memory. If partners expand their use of iForge to database-driven applications, solid-state drives (SSDs) could make a difference, Burness said.
In addition, NCSA's Parga said, “You could put the metadata on solid-state because SSDs have a huge advantage when it comes to a lot of small I/O.”
In the meantime, NCSA’s industry partners and the National Science Foundation (NSF) are seeing significant performance benefits using the current system for research, exploration and product optimization work. (NSF paid for a “Forge” supercomputer that uses the same GPFS/DDN system and InfiniBand network fabric that iForge does, although the two are logically different machines.)
One private sector company reduced its simulation time with the ANSYS Mechanical finite element application from approximately 160 hours using its shared memory solver/algorithm at 8 cores to a mere 2.6 hours using NCSA’s distributed memory solver at 192 cores, according to Burness.
Adams Thermal Systems Inc., a supplier to John Deere, was able to slash the simulation time for a CD-adapco Star-CCM+ application from as many as 10 days when running on four cores to just a few hours when shifted to 256 cores at NCSA, Burness said.
“When you aggregate the performance of several hundred disks,” Burness said, “you can get some really breathtaking performance figures.”