IBM researchers are developing a cognitive storage system designed to automatically differentiate high- and low-value data and determine what information to keep, where to store it and how long to retain it.
Zurich-based IBM Research scientists Giovanni Cherubini, Jens Jelitto, and Vinodh Venkatesan introduced the concept of cognitive storage in a recently published paper in the IEEE’s Computer journal. The researchers consider cognitive storage a way to reduce costs to store big data.
The IBM Research team drew inspiration for cognitive storage from its collaborative work with the Netherlands Institute for Radio Astronomy (known as ASTRON) on a global project to build a new class of ultra-sensitive radio telescopes, called the Square Kilometre Array (SKA).
The SKA won’t be operational for at least five years. Once active, the system will generate petabytes of data on a daily basis through the collection of radio waves from the Big Bang more than 13 billion years ago, according to IBM. The system could reap significant storage savings if it could filter out useless instrument noise and other irrelevant data.
“Can we not teach computers what is important and what is not to the users of the system, so that it automatically learns to classify the data and uses this classification to optimize storage?” Venkatesan said.
Cherubini said the cognitive system draws a distinction between data value and data popularity. Data value is based on classes defined by random variables fed by users, and it can vary over time, he said. Popularity deals with frequency of data access.
“We like to keep these two aspects separate, and they are both important. They both play a role in which tier we store the data and with how much redundancy,” Cherubini said.
The cognitive storage system consists of computing/analytics units responsible for real-time filtering and classification operations and a multi-tier storage unit that handles tasks such as data protection levels and redundancy.
Venkatesan said the analytics engine adds metadata and identifies features necessary to classify a piece of information as important or unimportant. He said the system would learn user preferences and patterns and have the sophistication to detect context. In addition to real-time processing, the system also has off-line processing units to monitor and reassess the relevance of the data over time and perform deeper analysis
The information goes from the learning system into a “selector” to determine the storage device and redundancy level based on factors such as the relevance class, frequency of data access and historical treatment of other data of the same class, according to Venkatesan. The cognitive system would have different types of storage, such as flash and tape, to keep the data.
IBM researchers tested the cognitive storage system on 1.77 million files spanning seven users. They split the server data by user and let each one define different classes of files considered important. They categorized the data into three classes based on metadata such as user ID, group ID, file size, file permissions, file creation time/date, file extension and directories.
Cherubini said the IBM Research team developed software for the initial testing using the information bottleneck algorithm. He said they’re currently building the predictive caching element, “the first building block” for the cognitive system, which he said should be ready for beta testing by year’s end.
“Beyond that, it’s harder to make predictions,” Cherubini said. “If everything goes well, I think we should be able to have the full system developed at least for the first beta tests within two years.”
IBM researchers said early testing has fared well for data value prediction accuracy with the contained data set. But additional research is necessary to address challenges such as identifying standard principles to define data value and assessing the value of encrypted data.
Although the cognitive storage system is designed to classify and manage enormous amounts of data, the researchers said the benefits could extend to IT organizations. Venkatesan said the potential exists for a service-based offering.
“We think that this has a lot of potential for application in enterprises because that’s where the value of data becomes of highest importance,” Cherubini said.
The IBM Research team is looking for additional organizations to share data and ideas and collaborate on the cognitive storage system. Click the following links for contact information: Cherubini and Venkatesan.