This article can also be found in the Premium Editorial Download "Storage magazine: CDP 2.0: Finding success with the latest continuous data protection tools."
Download it now to read this article plus other related content.
"Lustre provides us good performance from a single node, so we can push 700MB/sec to 800MB/sec from a single node, which is better than you can get over Gigabit Ethernet," says Minyard. "Or on aggregate, on Ranger we can push 30GB/sec to the single largest file system. At peak performance we've hit over 40GB/sec," he adds.
"Users can control the level of striping according to the performance they need from the system or how their application works," says Minyard. "So, for example, if someone is running on 4,000 MPI tasks, they may want to collect all that data to a single system and write it out. In that case, they would want to stripe across a lot of servers because they will get really good performance from a single server as it's shooting data to a lot of different servers rather than to just one."
However, says Minyard, if an app is writing out files per task and every task is writing out 4,000 files, it's better to have a smaller stripe because then each server shares the computing load. A disadvantage of Lustre vs. other file systems, he says, is how it handles server or storage failures: Lustre doesn't dynamically adjust to the failure of one node or another.
"If a storage node fails, Lustre will pause," says Minyard, who has had occasions where one of the Thumper servers failed. But he says that when a node fails, Lustre performs well, even though manual intervention
| needs to take place to restore a node.
"The nice thing about Lustre is its recovery mechanism," says Minyard. "[If] all of a sudden the server reboots or has a kernel panic, the Lustre file system will pause. Anything that's trying to write is sitting there waiting and once the file server comes back up, Lustre goes out and replays any of the transactions that were outstanding. All the jobs that were waiting to write continue to write as if nothing happened. It's only a 15 or 20 minute interruption."
There are other failover mechanisms built into Ranger. Each meta data server has active failover. The Sun StorageTek 6540 arrays with FC disks are mirrored to each other, says Minyard, so "they never lose meta data information." In addition, the Sun Datacenter Switch 3456 switches are paired for redundancy.
As a result of the size of Ranger's environment, Minyard follows the advice of the experts and doesn't back up the entire system. "We back up only the user's /home directories, just because of the sheer capacity of the data," he explains. "So far, we have almost 800TB. Trying to back that up isn't feasible. A lot of the data is scratch data--checkpoint data, restart files--data that users write that they don't have to save."
When users need to back up their data, he adds, "they do it themselves, using command line utilities we supply. The data is saved to the Sun StorageTek SL8500 tape library."
In addition, "we don't back up the whole file systems, just some of the /home file systems for users," says Minyard. "We have some homegrown utilities that are using TAR and some incremental mechanisms. The Lustre file system right now doesn't have hierarchical storage management capability, but they're working on it. That's something we want to explore in the future."
This was first published in October 2008