When designing its Sooner supercomputer, the University of Oklahoma Supercomputing Center for Education and Research...
(OSCER) improved the supercomputer's performance by segregating its storage networks and installing low-latency InfiniBand connectivity.
According to Henry Neeman, the university's IT director of supercomputing, more than 450 users are on the system, representing about 20 science and engineering disciplines. Weather forecasting is one of the major Sooner projects. On the storage side, "we have several different flavors," said Neeman. "Placing different populations of users on different types of storage was valuable."
According to Neeman, the storage is split up so each type of data is handled by the appropriate file system. He said, "We discovered that when you have people who need large scale, high-performance parallel I/O and those who don't, they tend to interfere with each other if you put them on the same file system. "The ones running lots of small files tend to bog the system down and those who need high performance can't get high performance," he said. "If you separate them, people doing lots of small files are happy, people doing high performance are happy, and you don't deal with contention. If you have a sociology problem, you use a sociology solution, not a technology solution."
A technology solution was required when setting up the cluster that university researchers will use to study and model tornadoes. The goal is to improve early warning systems and minimize damage from the windstorms.
For that, Sooner uses 20 Gbps InfiniBand QLogic 7200 host card adapters, a 288-port QLogic SilverStorm 9420 InfiniBand director switch, and 37 SilverStorm 9024 24-port InfiniBand switches to connect 534 compute nodes.
There are 40-Gbps InfiniBand products hitting the market, but Neeman says they were "out of our price range" and 20-gig is sufficient bandwidth for the supercomputer's needs. Sooner achieved 28 teraflops (one trillion floating point operations per second) at 83 percent efficiency in benchmarking for the Supercomputing top 500.
"We're much more interested in latency than bandwidth," Neeman says. "Our applications are much more sensitive to latency, so it's extremely important that our high-performance interconnect have very low latency. "In general, having low latency but not having reliability is not terribly valuable," Neeman said. " We are delighted with the high-bandwidth 20-gig InfiniBand brings, but the key value is latency."
The QLogic InfiniBand devices were installed over the summer when OSCER migrated from its Xeon 64-bit TopDawg cluster with Cisco Topspin switches to the Sooner quad-core Xeon cluster. The trick was migrating without disrupting the availability of university production data stored on the supercomputer.
"We had the same space available as our old cluster was in," says Neeman. "So we had to do an in-space transition and gradually migrate from the old cluster to the new cluster, and we had to do it with minimal downtime because we're a production facility. We have users with publishing deadlines, some running real-time weather forecasting, and others completing dissertations so they can graduate."
The migration consisted of ramping up new compute nodes while bringing down old nodes, all in the same racks with the exact same power and cooling. Neeman said the migration began in mid-July and the first user was on the new system by mid-August. The transition was completed by October, and Neeman said that Sooner peaked at over 80 percent of capacity within 48 hours.