The project, a partnership between Stanford and the search engine giant, will digitize millions of books from the Stanford library for preservation, as well as easier research. Though currently at several hundred terabytes (TB) in nearline and tape archive storage, according to Keith Johnson, product manager for the Stanford Digital Repository (SDR), it is expected to hit hundreds of terabytes in capacity within the year and multiple petabytes (PB) soon after that, as the scanning and formatting of billions of written pages takes place.
"Right now the bottleneck is in the process of preparing the books and digitizing them," Johnson said, emphasizing that though Stanford has already made an investment in Honeycomb and has taken delivery on a production system, just 1 TB of the first 36 TB has been put into production as of yet.
"Part of why we chose Sun is definitely because they can be a partner in helping us implement this system and take on this project," Johnson said.
However, according to Johnson, despite the personal connections and the fact that it is still a developing product, Stanford ultimately chose Honeycomb because of the promise of technical features, like programmable storage.
Honeycomb, begun three years ago, was originally developed to be similar to EMC Corp.'s Centera, IBM's DR 550 and Hewlett-Packard Co.'s (HP) RISS systems. However, this past May, Sun made the announcement that Honeycomb's story was changing -- now, instead of a "me-too" CAS system, the project had taken on a more ambitious goal of becoming an object-based programmable storage system.
"We did consider other CAS systems, and they could definitely have worked for us," Johnson said. However, given the nature of the library project, Stanford was attracted to the ability to program search engines and other processes within the archive according to its project's custom needs.
For example, Johnson said, Stanford is interested in the ability to create more complex, customizable searches than are available on other prepackaged CAS platforms. "Other CAS platforms are kind of a black box in this way," he said. "They have their own set of prewritten search capabilities. We are going to need to be able to go outside that."
Another example of how programmable storage will benefit the Stanford project, according to Johnson, is the ability to perform a takeoff on disk scrubbing he referred to as "format scrubbing."
"Everybody does disk scrubbing," he said. "What we need to do, though, is be able to look at objects at a higher level, according to format --what version of a file format are they stored in? Are they still readable? We need to verify that over a long period of time and look at the data as higher level structures and objects instead of bits."
Andrew Rothfield, group product marketing manager for strategy and messaging with Sun, also emphasized that the Honeycomb and Google library projects still had a long way to go, and that much of the development of both projects would continue during this implementation.
"We definitely see Stanford as a long-term partner rather than a one-time customer," Rothfield said. Once the first implementation of the library archive is in place, Rothfield said, Sun and the university will both need to work on figuring out how to implement multiple tiers of storage within the archive.
Rothfield also revealed that Sun has been shipping evaluation versions of the product to "select" independent software vendor's (ISV) and customers who have expressed an interest in Honeycomb over the last three months. Sun is not ready to announce any other partners yet, he said.
"Honeycomb remains an early-access product in what we call limited revenue release," said Sun public relations spokesperson Michelle Parkinson. Sun is not yet ready to announce or lay out a timeframe for general availability of the product, she said.