IBM's revelation last week that it scanned 10 billion files in 43 minutes with the use of the IBM General Parallel File System (GPFS) technology and solid-state drives (SSDs) showed the potential of SSDs and parallel processing to help organizations manage their rapidly growing data stores.
IBM accomplished the results with a combination of new algorithms in its clustered parallel file system and a hardware configuration of 10 eight-core IBM 1036 M2 servers and four Violin Memory 3205 flash SSD arrays that stored 6.5 TBs of metadata for a file system containing 10 billion files, said Bruce Hillsberg, director of storage systems at IBM Research Almaden.
Charles King, president and principal analyst at Pund-IT, called the IBM GPFS performance test an interesting experiment but hesitates in calling it a commercial product yet. The cost and performance of that setup would go beyond what many organizations can afford or need, although King said a scaled-down version would still bring performance gains over much of the technology in production today.
“This is like a super-charged GPFS system. There are a small number of applications that really need this high-end GPFS performance today,” he said. “But IBM has been very effective at creating commercial solutions on what began as company research and development projects. The company is extremely good at productizing their research. You can use this to create similar performance in a smaller footprint. The good thing about the technology is that it scales downward as well as upward.”
IBM’s Hillsberg said using SSDs was a significant factor in achieving the type of processing performance in the most recent GPFS performance lab test. In 2007, IBM scanned 1 billion files in three hours with approximately 20 disk drives. Processing the metadata of 10 billion files without SSDs would require at least 200 disk drives, according to Hillsberg.
GPFS is widely deployed as search engine software for databases and high-performance computing. IBM uses GPFS in commercial products such as its Scale Out Network Attached Storage (SONAS) and Information Archive. Enhancements that IBM made to GPFS for the test will be included in subsequent releases.
“All changes made to GPFS will go into GPFS products, and software offerings and products based on GPFS,” Hillsberg said.
According to an IBM whitepaper, the information lifecycle management (ILM) function in GPFS acts like a database query engine to identify files of interest. GPFS runs in parallel and scales out as additional resources are added. Once the files of interest are identified, the GPFS data management function uses parallel access to move, back up or archive the user data. GPFS tightly integrates the policy-driven data management functionality into the file system. This high-performance engine allows GPFS to support policy-based file operations on billions of files.
“The reason this is important is that there's an explosion of data,” Hillsberg said. “Customers have to manage that data so they can identify what to backup, what to replicate for disaster recovery and to determine what data is appropriate for tiering. They have to scan files to figure out what changed.”
The lab test was conducted at the IBM Advanced Storage Laboratory at its Almaden Research Center in San Jose, Calif.