Scanrail - Fotolia
Small World Big Data
Published: 03 Oct 2016
Not too long ago, storage arrays were holed up deep in the data center and manageable without requiring much knowledge about the data actually stored therein. A storage admin might have known it was database data for a key application requiring high performance and solid backups, for example, but the database administrator took care of all the data-specific details. Today, this artificial wall separating information about data and the storage that it holds is changing, and rapidly.
Convergence isn't only closing the gaps between silos of infrastructure, it is collapsing the distance between the job of persistence on the back end in storage and what stored data actually means and is used for on the front end. No longer desirable or even sufficient to store and protect bit patterns deep in the innards of the data center, you must now manage storage in ways that directly advance business operations.
In fact, it's becoming a competitive necessity to leverage data at every level, or tier, of persistence throughout the data's lifecycle. This is good for IT folks, as new data-aware storage is helping IT come to the forefront of key business processes.
Smart storage systems are powered by a glut of CPU/cores, cheaper flash and memory, agile software-defined storage functions and lessons learned from the big data analytics world. Internally, smarter storage systems can do a better job of optimizing capacity and performance through smart deduplication and compression schemes, application-aligned caching and tiering, and policy-definable quality of service (QoS) and data protection schemes. Externally, smart storage systems can create and serve new kinds of metadata about the data inside, providing for better management and governance, application QoS reporting and alignment, and can even help to create direct business value.
The roots of data awareness
Data-aware storage has its roots in old archival "content-addressable storage" architectures, which were early object-based archives that kept additional metadata (i.e., data about data) in order to exactly manage retention requirements (and possibly help with legal discovery actions). Systems often indexed and made this metadata accessible outside of the content itself and, eventually, even content was indexed and made searchable for e-discovery processing. However, as appropriate for archival cold storage, this data intelligence was created offline in post-processing and only applied to static archived data sets, and therefore rarely used.
Ten years ago, the emergence of big data approaches demonstrated that masses of live, unstructured and highly varied data could have tremendous primary business value. Today, the massive web-scale object stores popular for cloud-building and used to power production web and mobile applications often store all kinds of metadata. In fact, these stores support user-defined metadata that developers can arbitrarily extend for advanced application-specific tagging or data labeling. Some advanced file systems directly incorporate content indexing on data ingest to enable end-users to query primary storage for content containing specific words or phrases.
As an example of this evolution, consider the difference between two popular online file-sharing services, Dropbox and Evernote. Both can be used to store and sync various files across devices and share them between groups of users. Dropbox was the baseline standard defining online file sharing and collaboration, but Evernote goes much farther -- although for a narrower set of use cases -- by becoming innately content-aware with full content search, inline viewers and editors for common file types, extra metadata (e.g., URL source or reference if available, user tagging) and "similar content" recommendations. Although I use both daily, I view Dropbox as just another file-sharing alternative, while Evernote is critical to my workflow.
IT data awareness
Company lawyers (for e-discovery) and detectives (in security) require online systems that proactively identify abnormal behavior to produce early warnings on possible breaches. Smart data-aware storage systems can fold in auditing-type information and help correlate files, data and metadata with patterns of "events" -- such as application crashes, file systems filling up, new users granted root access and shared or hidden key directories.
I remember one particularly blatant storage misusage (on a DEC VAX!) when we caught someone hoarding huge amounts of NSFW material on a little-accessed file system. Today's more content-aware smart storage systems could alert security about such transgressions and warn (or even prevent) creative boundary-pushing users from crossing into job-termination territory to begin with.
Benefits of data-aware storage
Fine-grained data protection: Storage that knows, for example, what VM files or volumes belong to or -- even better -- a specific policy to enforce that VM's data can directly ensure appropriate data protection (e.g., the right level of RAID or replication).
Fine-grained QoS: Similarly, storage that knows what database files require which kinds of performance acceleration can directly prioritize I/O and cache resources for optimal application performance.
Content indexing and search: Large stores used for text-based data can deliver extra value by indexing content upon ingestion and enabling built-in admin and (even) end-user search.
Social storage analysis: Storage can track usage and access by users and groups as metadata. Then other users can easily find out who in an organization had recent interest in certain content, identify group collaboration patterns and receive recommendations of new things to research based on collaborative filtering (e.g., "people who like the things I like also liked X").
Active capacity and utilization management: Storage can also track metadata about "per-data" system resource performance, capacity and utilization metrics. This enables storage admins to directly see what is going on in IT infrastructure for any piece or group of data tracked directly back to end users, departments and applications. Smart storage can also help optimize its own configuration and behavioral alignment to workloads.
Analytics and machine learning: As storage grows smarter, expect to see increasing amounts of both low-level compute processing and automated machine learning incorporated directly into the storage layer. Storage-side functions could then be used to automatically categorize, score, translate, transform, visualize and report on data even as it's being created and stored.
Outside of governance and protection, large "flat" collections of files tend to hide many interesting kinds of information that could be valuable not just for searching content by keywords, terms or phrases, but also for finding material about related concepts (perhaps through a domain-specific "taxonomy"). For example, users could find documents about tomatoes and cucumbers when searching for vegetables, or be interested in "who" created some piece of data, who else copied it and shared it, and even how many times and for how long they looked at it. They could also find out which group is the biggest user of certain data sets, who collaborated on sets of documents, who else might have something they're interested in, or who has similar interests and so on.
Some data-aware storage also tracks metadata on the usage and quality of its own I/O services at a fine-grained "per-data" level. These smart storage systems can become self-aware of not only how each piece of data is logically used, but can also record I/O access patterns (by users or applications) and performance over time. With time-series metadata about access patterns, delivered performance and required capacity for each piece of data, such smart storage systems could report, optimize and work towards securing QoS promises at scale, as well as come to learn how to self-optimize and "drive" itself.
A new data-aware age
Today, we are at the beginning of the age of smart data-aware arrays. So while we've long had layered e-discovery (Lucene/Solr is open source for you DIYers), some established distributed storage vendors are now fully integrating search engine capabilities directly in cross-functional stacks. Tarmin GridBank includes a fully distributed metadata service that feeds identity, security and application-storage alignment activities, for example. Hewlett Packard Enterprise has been integrating its IDOL content-leveraging technology directly into storage -- today, attached to StoreAll object stores along with a newer high-speed ingest and search database called Express Query. And, two years ago, Data Gravity rolled out a midrange array that automatically indexes content (on the passive side of their dual controller) for built-in text search and discovery of social usage patterns.
There's also Qumulo, a great example of data-aware storage that tracks performance and capacity metrics to users, applications and data objects. This enables Qumulo to apply and enforce data-level QoS policies, and provide great visibility into exactly who and what is using storage in various ways. Qumulo lets admins see what is actively going on in the storage system down to the file level, which makes it easy to see which files and directories are, or have been, hot at different times, and which clients are hitting which sections of the file structure. This is particularly useful as Qumulo can scale to store billions of objects, a point where external management tools would likely fall flat.
Another area of growing data awareness is products that are improving how to best cache, store and virtually present data based on expected usage. For example, Riverbed's SteelFusion knows enough about the data required locally in a remote office/branch office -- to run apps and virtual machines (VMs) -- that it can persist and protect all data in a data center, while projecting what's needed at edge locations (using Riverbed's WAN optimization technologies). The storage intelligence for this kind of edge "virtualization" requires knowledge about data content, data service requirements and the levels of data protection needed.
Storage can also grow more application-aware to work at a higher level externally with storage clients to provide application-accelerating and operational expense-reducing data services. So, instead of serving out LUNs, binary objects or files, storage serves application-level data constructs like VM images or database tables (or "chunks" of database records). Storage settings for data protection, availability and performance could then be managed in application terms.
Tintri pioneered storage that directly serves VMs to the hypervisor and provides storage management in terms of VMs. VMware, meanwhile, has APIs (VAAI, et al.) to help enable this approach more broadly among traditional array vendors, and even offers VSAN software-defined storage that works at the VM level.
Some applications, in the meantime, have become more storage-aware. This is basically a key design principle of Hadoop and big data, which fundamentally converges customized storage at the application level. For example, Hadoop's HDFS works hand-in-hand with the main job scheduling service to send compute jobs to specific "storage" nodes where needed partitions of data are stored.
Data intelligence to come
With affordable nonvolatile memory express flash and persistent memory (e.g., MRAM) on the horizon, storage will be brought even closer to compute and will become more data-aware. And, I have no doubt, the impending arrival of the internet of things and its accompanying data explosion will further give rise to highly intelligent converged storage/compute functionality.
Bottom line: Data always has value, but that value has to be mined, leveraged and made accessible. As a result, storage architectures have grown smarter to help recognize the inherent value in all kinds of data. The most competitive organizations will have the smartest storage systems.
About the author:
Mike Matchett is a senior analyst at Taneja Group.
Data storage systems get smarter
Data awareness makes storage smarter
In search of intelligent data storage technology
- Tiered Storage - Optimizing the Storage Infrastructure –Fujifilm Recording Media USA, Inc.
- Illuminating Insight for Unstructured Data at Scale –IBM