A research center at Johns Hopkins University turned to Caringo Inc.'s CAStor content-addressed storage (CAS) software...
to provide data archiving and also to manage its sensitive and rapidly expanding genotyping data.
The Center of Inherited Disease Research (CIDR) provides genotyping and statistical genetics services for investigators trying to identify genes that contribute to human disease. The work of CIDR is, to put it bluntly, a data hog. As part of its research, CIDR might scan up to 12 DNA samples on one slide, according to Lee Watkins, Jr., the Center's director of Bioinformatics. One sample can produce files ranging from 2 GB to 4 GB.
CIDR uses CAStor to archive the data and delete it from the Windows file share. With data from tens of thousands of DNA samples in its system, the archive builds up fast. The Baltimore-based CIDR often generates terabytes of data a week, sometimes hitting a terabyte in one day. The Center used high-capacity PetaBox systems from Capricorn Technologies to store the data, but last summer the 50-person research team realized they needed help managing it all.
But perhaps the hardest part was finding technology that wouldn't deplete the budget. "We're well-funded, but we can't go out and buy a system from EMC or Hitachi to do this," Watkins said. "We said, 'There has to be somebody who has written software that can keep track of this.'"
CIDR became aware of Caringo through Capricorn. Caringo gave CIDR a free trial period to test CAStor. CAStor passed the test and CIDR became a paying customer last November. The Center started with a 30 TB CAStor cluster and is now up to a 99.9 TB cluster with 80 TB used. . .and is still growing.
To keep up with its data growth, the Center is installing a high-density Rackable Systems array for more capacity and will install CAStor clusters on that as well. This new set-up is scheduled to go live in August.
At first, CAStor had trouble keeping up with the data that the Center was throwing at the clusters. "It wasn't 100% robust," Watkins said. "There were cases where a disk wouldn't fail but it would stop performing and act weird, give us little hiccups now and then. They wrote a fix a few months ago, and we haven't had that problem."
Derek Gascon, Caringo marketing vice president, said, "They wanted to have disk capacity freed up much quicker, so we put together a new version for them that includes a faster turnaround in releasing disk capacity." That fix is now included in the general release of the product.
According to Watkins, no relief from data growth is in sight. "Our plan is to keep data online for a year," he said. "We haven't gotten to that point yet where we've released projects, so we can't predict our high water mark. But we suspect it will be between 300 and 400 TB."
CIDR keeps its data on tape for long-term archiving, but uses CAStor for active data. "We've had to recover a lot of stuff we didn't think we would have to recover, and it's there," Watkins said. "What we were doing before was not scalable, and we couldn't keep track of everything. We had to do everything on a separate storage device. "Now it's simple as simple can be," he said. "You need more storage, add another storage device, boot up from a NetBoot server and you're done."
Watkins said CAStor has also helped provide disaster recovery, surviving various mechanical failures, and even a flood in the lab where the clusters were temporarily installed while CIDR was expanding its server. "We've had random disk failures, and power failures where all the nodes went down and we had to power it back up," he said. "We never had a problem with that, which is amazing to me."