Editor's note: This is the third in a series of five articles highlighting the recent Storage Innovator award winners. Dartmouth's fMRI data center took home honorable mention accolades, April 10 at the Storage Management 2003 conference in Chicago.
Scientists now regularly use medical imaging techniques such as magnetic resonance imaging (MRI) to study brain function. While most MRIs create a static record of the body, in vivo studies map brain functions over time. Such functional MRIs require costly equipment as well as appropriate subjects. They generate huge amounts of data -- in vivo functional magnetic resonance imaging (fMRIs) are typically 5 gigabytes (GB) and can go up to 45GB. With better equipment and techniques, file sizes keep growing.
The data generated by one study can be of immense value to other scientists and educators who are studying the same or similar topics. Unfortunately, while the studies are published in scientific journals with a couple of illustrations, much of the valuable raw functional imaging data isn't available to other researchers.
The fMRI Data Center at Dartmouth College, Hanover, NH, has a lofty aim: To archive the data from these studies and make it freely available to researchers around the world.
"We started this project to gather that data so people could subject it to new analyses and visualization techniques," said operations director John Van Horn.
He sees a secondary role for the archived data as "a fossil record for our field. In ten years, we can look back and see how far we've come."
The ever-expanding file sizes and growing archive quickly created storage problems for the fMRI data center. "We took a first pass at this with a Sun Enterprise 5500 system and a terabyte of disk space," Van Horn said. "We filled that up in a few months. We got another terabyte, and after that, we realized we'd better be a little more on the ball about how we're doing this."
Besides the simple bulk of data, "The first order of business," Van Horn said, "was to construct a framework that would allow scientists easy access to raw data from published, peer-reviewed studies." The data center needed a way to handle meta-data about the studies: things like how old the subject was, what stimuli or tasks were given during the study, how often images were collected, the strength of the scanner.
The resulting architecture is designed to take into account the "data life cycle." "When a study is first submitted," said senior systems administrator James Dobson, "it's new and exciting, so we expect a lot of requests for it. As newer studies come in, that study might not be requested as often and is moved to secondary storage."
Researchers submit studies in varying formats and via all kinds of media, including CD, tape and FTP. When a new study arrives at the fMRI data center, staff first organizes it and puts it into standard format. Then, it's placed in the disk-based section of the archive, composed of a Sun StorEdge 3910 RAID 5, a pair of Sun Fire V480R servers and Enterprise 5500 servers.
As requests for that study dwindle, it will be flushed off the disk to make way for more popular data. However, the fMRI data center data never dies; it's moved to a Sun StorEdge L700 tape library, with StorageTek tape drives and cartridges. If the study is requested again, it's pulled from the tape library and moved back onto disk, where it starts the disk-to-tape cycle again.
The biggest innovation in the system is a hierarchal storage management system (HSMS). For this, Van Horn uses Sun's StorEdge Utilization Suite with Sun StorEdge SAM-FS software that automatically handles cycling data from disk-to-tape and back again. When a request for a dataset is received by the HSMS, it will automatically be accessed from wherever it resides in the HSMS space, whether on disk or on a tape in the library.
"This lets us use the expensive, faster disk system for studies that are requested often," said Dobson. "It lets us automatically determine what is most important. From a total cost of ownership perspective, it lets us spend our money on the right stuff. It's a unique system and very exciting for us." The software also helps to find relationships between data sets, so that not only a frequently requested study but also related studies stay on the disk media.
This file system is able to write and read at very high speed due to its multithreaded architecture and the segregation of metadata from the data volumes. This approach gives the fMRIDC the ability to tune the file system for the large volume of small images and also the large single image files.
Besides the data center itself, the system architecture includes an "open-access cluster," a publicly-available parallel computer system that's offered to researchers for complex meta-analysis; and basic infrastructure to handle computation and support the fMRIDC's database, website and file services." Many centers in the country can't afford the scanner, but have very bright people who may have some ideas or theories they want to test," said Van Horn. "They should be able to have unhindered access to the data." Thanks to the fMRI Data Center, now they do.
Analysts agree that this method is valid for scaling to the levels the fMRI center will need in years to come. "What's unique is its architecture to support scientists interested in accessing raw data from scientific studies, and building a storage area network (SAN) datastore to support a significant data warehouse that will likely growing in the next few years," said Jamie Gruener, senior analyst with the research firm Yankee Group.
For more information on Dartmouth's Brain Imaging Center visit its Web site.
Additional information on Sun can be found here.
>> Best Web Links: Archiving