The growth of unstructured data in traditional data centers will far outstrip the growth of transaction-based data over the next four years, according to reseach firm IDC.
For some organizations, that wave of unstructured data is already washing ashore. Biotechnology company Complete Genomics plans to release a genome-sequencing program next year that will cost around $5,000 for each genome analyzed. In the course of developing its application, the firm has watched unstructured data in its environment go from 10 TB last spring to more than 700 TB, and it's still purchasing storage at a rate of about 200 TB per quarter, said vice president of software Bruce Martin.
"What's also changing for us is that there's not only more and more data, but that it needs to be processed quicker," he said. "Trying to do the kind of data capture and analysis we're doing at the $5,000 price point, we need an infrastructure that can support sequencing genomes simultaneously."
The company started with a single IQ 9000 cluster from Isilon Systems last fall. It has since added dozens of nodes, most recently Isilon's biggest and fastest X 12000 nodes.
Isilon also supports multiple operating systems and file systems. Complete Genomics is using it for CIFS and NFS and has the system integrated with both LDAP and Active Directory access control programs. Martin also praised the reliability of the clustered system. "A system this big means you're constantly losing drives; it's common with any file system of this size," he said. "The system behavior around this has been very good."
Unstructured data growth and the use of a clustered system also has him rethinking backup, which, on a system like this, is more than just challenging. "I would even go so far as to say impractical," he said.
The product development team may produce up to 10 TB of new data daily. Some corporate information that's staged to the cluster is sent for backup using NDMP, and the company does use Isilon's snapshots, mostly for test and development work. Otherwise, said Martin, "our [data protection] strategy is to run [the Isilon system] in a highly redundant N+3 configuration."
The cluster routinely exceeds 1GB/sec throughput, even though the company has yet to fully ramp up its product for launch. "I hope Isilon continues to improve its performance; on paper it does look like I can still get higher bandwidth using something like InfiniBand and Lustre," Martin said. "We're always hungry for more performance and more density."
Research library uses BlueArc to feed the website beast
H.W. Wilson publishes print reference materials for librarians and researchers, and recently added CD-ROM and web-based publishing of resources for its clients. After the launch of the company's WilsonWeb product last year, the IT department began to evaluate products that would consolidate a mostly DAS environment to improve performance.
The company has about 20 TB of data, but new documents are added daily and file transfer performance for nearly 20 million document images on WilsonWeb was lagging. The IT staff evaluated NetApp's filers, EMC's Celerra NAS boxes and BlueArc's Titan NAS system.
According to vice president of information systems Lu Parziale, the publisher chose BlueArc because of its use of a FPGA (field programmable gate array) that stores its file system in silicon rather than software, boosting performance.
In December, H.W. Wilson bought a 60 TB Titan system and uses it to feed WilsonWeb. "For new content our customers were noticing a 40-second response time went down to five to 10 seconds," Parziale said. The company is still importing data to the BlueArc system, and, he said, "its caching allows us to do it almost seamlessly."
Centralizing storage has made for easier backups and the dual heads on the Titan box offer more redundancy than DAS. "We have to be up 24/7 as a web-based business," Parziale said. "We've had one part of the device go down and not even known about it until a day or two later."
Parziale would like to see BlueArc brush up its reporting capabilities by offering customizable reports and more granular information about virtual volumes and how they correspond to file systems. "It would be nice to customize reports for particular layers [of infrastructure]," he said.
The NAS wave is also hitting the military. When the Air Force Center for Engineering and Environment (AFCEE) began noticing an increase in file sharing among its users two years ago, it moved from Xyratex RAID arrays and internal storage direct-attached across 52 servers to an OnStor NAS cluster.
AFCEE has slowly built up the cluster to incorporate OnStor Bobcat NAS gateway and Pantera systems with disk integrated, while its storage has risen from gigabytes to 36 TB. The center is now adding a second system at a secondary site for disaster recovery. "The fact that it's a gateway allows for us to repurpose older storage we still have," said AFCEE network administrator Ralph Miles.
Though OnStor's Bobcat and Pantera are primarily used as clustered NAS systems, Miles is also using the Pantera disk with a software-based iSCSI target to support block storage for SharePoint. "I want it all," he said. "I want the ability to do block transfers for an application while retaining flexible file storage for home directories."
Because the gateway is clustered and can be expanded on the fly, as well as support different arrays on the back-end, it has also helped Miles keep up with ad-hoc storage requests. "Sometimes," he said, "there's not time to build out a fixed, hierarchically managed system."