The GET Conference, held Tuesday at the Microsoft New England Research & Development Center, featured pioneers in the field of human genomics, including Nobel Prize winner James Watson, who helped established the Human Genome Project and was among the first to have his personal genome sequenced. The conference was held to benefit the Personal Genome Project (PGP), a collaboration between researchers, biotechnology companies and volunteers from the general public looking to make personal gene sequencing more practical and affordable.
The principals of the Personal Genome Project state that gene sequencing of individuals could lead to new treatments for diseases and improve preventative care, while contributing to a new cultural understanding of genealogy and identity. But IT pros charged with supporting genome sequencing research said there are data management hurdles to cross before those goals can be met.
Data management is the next frontier
"We've seen about a 30x scale-up in our storage in the last three years," said Matthew Trunnell, manager of research computing and acting director of advanced IT for the Broad Institute of MIT and Harvard, a joint venture between the two schools that is devoted to the research of genomic medicine.
"As things evolve, I expect the error rate to go down and we won't have to over-sample as much," he said. "The growth that's so dominated our storage environment for the last three years — I can see an end to that."
But whatever storage Trunnell is able to free up by storing fewer images and data will be almost immediately repurposed to support the medical research that is the ultimate purpose of all this sequencing and computation.
"Today we have about 600 terabytes of research data total, but it's doubling every nine to 10 months — three years ago it was 50 terabytes," he said.
Gene sequencing data is instrument generated and structured, and it's relatively easy to predict and manage storage needs, but research data is "truly unstructured," Trunnell said. "There's growing concern about the implications of an ad hoc approach to data management." Right now there are 1.9 billion files that Trunnell said he has "already written off — I'll have to keep them forever because I'm never going to be able to ascribe value or meaningful ownership to them."
That statement sounds familiar to enterprise data storage managers who have grappled with data classification and long-term management for years, but Trunnell said data archiving and management tools still need to be developed for his field's particular needs. "So much of the software we use expects to interact with data through a Unix file system," he said, meaning most object-based or content addressable storage (CAS) systems out now aren't an option for him.
There is a movement to introduce a data archiving and management system to high-performance computing (HPC) environments called iRODS (Integrated Rule-Oriented Data System). iRODS is being developed by the Data Intensive Cyber Environments Center (DICE Center) at the University of North Carolina at Chapel Hill. But iRODS won't eliminate all data management obstacles in the field, Trunnell said.
"The key is data is growing so quickly that as a discipline we've moved from being relatively data poor to a period of enjoying a wealth of data," he said. "But that's also the beginning of a data-intensive era where we'll have to think about problems in a way we haven't before."
Stuart Glenn, a software engineer at the Oklahoma Medical Research Foundation, agreed that applications need to be updated to contend with storage growth. "A lot of software isn't built to analyze a large data set — it's looking to put it all in memory," he said. "We can store it all, but finding it and not storing it twice is still an issue."
Data growth and oversequencing
Early gene sequencing instruments generated high-resolution images that had to be stored and then broken down into a binary file that could be analyzed by a computer, but recent gene sequencing instruments have been able to eliminate that step, according to Jay Flatley, president and CEO at genotyping services firm Illumina Inc. Now each base pair of proteins in the genetic sequence can be stored in as little as one byte of storage capacity.
"There's been a constant reduction in data size. [Researchers] don't look at images now, so the results can be stored on regular drives," Flatley said.
But Stephen Quake, professor of bioengineering and co-chair of the department of bioengineering at Stanford University, pointed out that "oversequencing" still leads to explosive storage growth. While each base pair of the DNA sequence can be stored on its own, many researchers choose to keep the original encoded image with its binary-encoded "intensity" values so they can be re-analyzed as algorithms evolve. This can represent up to 30 times the amount of storage capacity needed to store each individual genome.
"We spend almost as much on computer equipment and storage as we do on genetic sequencing tools," Quake said. He said his department at Stanford has already filled up 30 TB of Isilon's IQ clustered NAS storage it bought six months ago.
These issues are also being addressed in current research, said George Church, professor of genetics at Harvard Medical School and founder of the Personal Genome Project.
One approach to making gene sequencing storage more efficient, he said, would be to store a "pan-genome," a union of the common sequences of DNA across humanity, and then compare new genomes to the reference sequence — a method similar to storing only changes with space-efficient snapshots.
"If you find something you want to take into clinical research, some camps believe in storing as much data as possible," he said. "If we could get to the point where we could start to interpret images immediately and send them to sophisticated analysis tools in one fluid motion it might help, but we're not fully there yet, and people keep 'pack-ratting' away."