According to analyst firm IDC, the market for high-performance computing servers will reach $15.6 billion by 2012. But for storage administrators, the growth of the HPC server market translates into unique backup challenges, created by the special requirements of HPC.
HPC raises two issues when it comes to backup and disaster recovery preparation: the volume of data and the volume of files.
The workload or data volume generated by HPC applications can be very large when dealing with files containing seismic or genomic information. "Those files can be incredibly large," says Gartner analyst David Russell. "Traditional backup approaches may not be adequate or may simply take too much time." For example, he notes, some HPC files can be in the petabyte range.
Some HPC applications also generate exceptionally large numbers of files – "literally millions," according to Russell. "The challenge of how you account for those files or the time it might take to go through an operating system and traverse the file system to see what files have changed is very much a 'heaving lifting' task." Getting that data on disk, or simply just getting it through the server and switch, might take too much time. In short, he says, applying traditional backup tools directly to HPC tasks can be a formula for disaster.
As an alternative to traditional backup tools, Russell says that an HPC administrator could combine technologies such as array-based snapshots
Still, vendors offering compression techniques, such Ocarina Networks, "have figured out how to reverse-engineer giant files and look for redundancies," says Russell, and there may be ways to further improve the process.
But the number of files in HPC environments is still a major challenge for backup administrators. "If you have a million I/O cycles for a million files, the effort of interrogating all those files, even with a nightly update, will take a long time," says Russell. ""I've heard of some HPC applications where it took 30 hours to do a full backup and 28 hours of that was just spent scanning to see what files had changed."
In a world with no resource constraints, a storage administrator would have the necessary disk, power and floor space to handle all these backup tasks, says Russell. But what makes it even more difficult is that HPC environments are usually oriented towards scale-out, with lots of servers crunching data. That implies the need for tightly coordinated backup, because, notes Russell, "You don't want different points in time on 25 different servers." Backup can be coordinated, he notes, through "brute force methods" that flush buffers and set a machine check point.
HPC can bear small amounts of downtime
David Hill, an analyst with storage analyst firm The Mesabi Group, points out that for many HPC applications, small amounts of downtime would not be noticeable to the user because many compute-intensive jobs are actually batch jobs. That means the user will not see the results until the job has run to completion. "For a 1-hour-plus job, would five minutes missing in the middle be noticeable?" asks Hill. "The answer is no."
According to Hill, "What these types of jobs really need is checkpoint/restart capabilities, where the state of the memory in the computing environment is written to disk periodically so that it can be restarted."
Depending on the value of timeliness and the value of the data, Hill says that businesses doing HPC might also be willing to consider an active-active failover strategy to a remote disaster recovery site for both operational recovery from a local problem as well as disaster recovery to recover. Another option, according to Hill, is performing continuous data protection (CDP) locally, combined with a virtual tape library (VTL) and a standard backup-restore packages.
About the author: Alan R. Earls is a Boston-area writer focusing on the intersection of technology and business.
This was first published in November 2008