masterzphotofo - Fotolia
Small World Big Data
Published: 03 Apr 2017
The amount of data available to today's enterprise is staggering. Yet the race to collect and mine even more data...
to gain competitive insight, deeply optimize business processes and better inform strategic decision-making is accelerating. Fueled by these new data-intensive capabilities, traditional enterprise business applications primarily focused on operational transactions are now quickly converging with advanced big data analytics to help organizations grow increasingly (albeit artificially) intelligent.
To help IT keep pace with data-intensive business applications that are now embedding operational analytics, data center infrastructure is also evolving rapidly. In-memory computing, massive server-side flash, software-defined resources and scale-out platforms are a few of the recent growth areas reshaping today's data centers. In particular, we are seeing storage infrastructure, long considered the slow-changing anchor of the data center, transforming faster than ever. You might say that we're seeing smarter storage.
Modern storage products take full advantage of newer silicon technologies, growing smarter with new inherent analytics, embedding hybrid cloud tiering and (often) converging with or hosting core data processing directly. Perhaps the biggest recent change in storage isn't with hardware or algorithms at all, but with how storage can now best be managed.
For a long time, IT shops had no option but to manage storage by deploying and learning a unique storage management tool for each type of vendor product in use. This wastes significant time implementing, integrating and supporting one-off instances of complex vendor-specific management tools. But as managing data about business data (usage, performance, security and so on, see "Benefits of analytical supercharging") grows, simply managing a metrics database now becomes a huge challenge as well. Also, with trends like the internet of things proliferating the baking of streaming sensors into everything, key systems metadata is itself becoming much more prolific and real-time.
It can take a significant data science investment to harvest the desired value out of it.
Storage analytics 'call home'
So while I'm all for DIY when it comes to unique integration of analytics with business processes and leveraging APIs to create custom widgets or reports, I've seen too many enterprises develop their own custom in-house storage management tools, only for those eventually becoming as expensive and onerous to support and keep current as if they had just licensed one of those old-school "Big 4" enterprise management platforms (i.e., BMC, CA, Hewlett Packard Enterprise [HPE] and IBM). In these days of cloud-hosted software as a service (SaaS) business applications, it makes sense that such onerous IT management tasks should be subscribed out to and provided by a remote expert service provider.
Remote storage management on a big scale really started with the augmented vendor support "call home" capability pioneered by NetApp years ago. Log and event files from on-premises arrays are bundled up and sent daily back to the vendor's big data database "in the cloud." Experts then analyze incoming data from all participating customers with big data analysis tools (e.g., Cassandra, HBase and Spark) to learn from their whole pool of end-user deployments.
Benefits of analytical supercharging
Smarter infrastructure with embedded analytical intelligence can help IT do many things better, and in some cases even continue to improve with automated machine learning. Some IT processes already benefitting from analytical supercharging include the following:
- Troubleshooting. Advanced analytics can provide predictive alerting to warn of potential danger in time to avoid it, conduct root cause analyses when something does go wrong to identify the real problem that needs to be addressed and make intelligent recommendations for remediation.
- Resource optimization. By learning what workloads require for good service and how resources are used over time, analytics can help tune and manage resource allocations to both ensure application performance and optimize infrastructure utilization.
- Operations automation. Smarter storage systems can learn (in a number of ways) how to best automate key processes and workflows, and then optimally manage operational tasks at large scale -- effectively taking over many of today's manual DevOps functions.
- Brokerage. Cost control and optimization will become increasingly important and complex as truly agile hybrid computing goes mainstream. Intelligent algorithms will be able to make the best cross-cloud brokering and dynamic deployment decisions.
- Security. Analytical approaches to securing enterprise networks and data are key to processing the massive scale and nonstop stream of global event and log data required today to find and stop malicious intrusion, denial of service and theft of corporate assets.
That way, the array vendor can deliver valuable proactive advice and recommendations based on data any one organization simply couldn't generate on its own. With this SaaS model, IT doesn't have to manage their own historical database, operate a big data analysis platform or find the data science resources to analyze it. And the provider can gain insight into general end-user behavior, study actual feature usage and identify sales and marketing opportunities.
Although it seems every storage vendor today offers call home support, you can differentiate between them. Some look at customer usage data at finer-grained intervals, even approaching real-time stream-based monitoring. Some work hard on improving visualization and reporting. And others intelligently mine collected data to train machine learning models and feedback smarter operational advice to users.
Though HPE recently announced it would acquire Nimble, the latter touts a predictive analytics angle to their InfoSight service that doesn't just aim to prevent outages and downtime by automatically resolving what would be considered level-one and level-two support issues, but also to help forecast future capacity and performance through statistical comparison to their aggregated database.
Managing and mining intensive IT management data isn't the only new challenge facing IT. The larger trend toward convergence, collapsing stacks of formerly siloed IT architecture into more cohesively deployed offerings (e.g., hyper-converged appliances, hybrid cloud platforms, big data clusters), also requires admins to better map actual storage usage and costs -- actually, the total cost of ownership for everything in IT -- to consuming applications. Fortunately, many built-in smarter storage services are emerging that can connect the dots, working with both inherent data and direct application awareness.
A good example of an increasingly intelligent storage management cloud service is Tintri's Predictive Analytics offering for virtual machine (VM)-aware storage. The all-flash array vendor has worked hard to distill a complex set of low-level VM, hypervisor and storage data (from hundreds of thousands of virtual machines) into three key performance indicator-like metrics that readily indicate remaining capacity, performance capability and how much flash would be optimal for the intended workload. Tintri's browser dashboard also offers model-based future trend projections, arbitrary application/user workloads analysis and predictive what-if scenario planning.
Compared to yesteryear's application-blind block storage, today's natively data-aware storage products internally track new metadata about all sorts of aspects about their operation. Storage today might inherently know which application creates, owns and accesses each chunk of stored data; the appropriate levels of security and protection each requires; how to optimally balance application I/O performance (via caching, placement, and so on) with capacity costs (varying compression and dedupe stages); and even which users have accessed, shared and might soon require each bit of data again. Storage platforms may also internally index textual data, analyze stored data for regulatory compliance (or a security breach), translate foreign text, transcode embedded media and even self-learn categorizations of content.
Qumulo offers a call home service for front-line support, but where it really shines is that its storage actually tracks performance metrics for the data it serves. Because of its distributed architecture, Qumulo can efficiently report on historical performance and other key metrics for every file and object stored, which helps immediately spot new usage patterns, abnormal behavior and performance impacting hotspots, even at the scale of billions of objects.
Data Gravity embedded a search engine in its normally passive backup controller to index all stored content in an array, which has proven useful in regulatory and e-discovery use cases. Reduxio, meanwhile, offers essentially infinite versioning and on-demand snapshots back to any past second by keeping metadata about the last modification time for each block in its array -- basically a fine-grained time machine for enterprise block storage that can help defend against malicious viruses.
Many scenarios converge storage and servers closely together. Convergence aims to reduce expensive and slow remote data access by ensuring that compute processes run very local to the data they need to process. This at least was the underlying motivation for both Hadoop Distributed File System big data storage and VMware's hypervisor-integrated vSAN. Full hyper-converged platforms like Nutanix and Simplivity intimately integrate software-defined storage into an optimized data processing appliance that can further optimize storage operations leveraging built-in analytic-driven algorithms running on the appliance server resources. Datrium's "open converged" approach splits the difference between performance and capacity optimization, keeping compute-intensive storage optimizing analytics and flash-consuming hot storage local on each host server, while sharing colder networked capacity for data protection.
Lambda Lambda Lambda
Storage-hosted lambda functionality, the ability to have events trigger and execute arbitrary programmatic code within the storage system, is an extremely powerful capability. It opens up the storage layer for extension by third-party functionality modules, data-aware management and automation, custom integrations and important application optimization. Depending on the implementation, lambda functions could be triggered when data is written, updated or accessed; when certain users or applications make requests; or even when certain operating conditions occur.
Potential functionality could include the following:
- Advanced storage services. When data is written, lambda functions could kick off inline data encryption, deduplication, compression or replication tasks. They could also implement data management policies for compliance enforcement, data integrity and security.
- Workflow integration. Lambda functions can be used to converge existing data with streaming updates, implement data flow control and execute content transformations like automatic picture or video format conversion, speech-to-text or language translation.
- Application offloading. Application developers can use storage-side function execution to stage, serialize, index or precache important data. This could be an efficient way to train machine learning algorithms or score data with existing models. Lambda functions could be used to offload just about any data-intensive application processing.
All those architectures, though optimally converged, still maintain traditionally distinct I/O transfers between the storage and client applications. Containerization and microservice approaches are starting to break down this separation of data storage from compute, fostering lambda-functional approaches (aka serverless computing) that can operate on streaming data as if flows past. Despite the recent focus on the Amazon Web Services Lambda cloud service, lambda architectures are an old concept, and can be seen implemented in many enterprise databases as "stored procedures," little bits of remote code that get triggered by events to run directly in the database (rather than in application code). However, newer storage products that support containerized plug-ins right in the storage layer are opening up incredible opportunities for new kinds of highly efficient, scalable, and real-time analysis and optimization. And some storage vendors are writing new storage products to support containerized applications, many are themselves also being written as containerized applications. Minio, for example, is an emerging open source object store that, due to its modern containerized architecture, readily supports tremendous scalability and a native lambda functionality. Minio object storage can embed lambda functions for search, in-memory caching, messaging flow, pattern recognition, content transformations and other functionality that would be highly efficient when run directly in the data storage layer.
Full cloud management
Analytics improves, speeds up and smartens IT operations. I have no doubt we will see more storage analytics show up right in the storage layer close to the stored data and in the cloud applied to huge aggregations of storage metadata assembled from many storage systems (as found in call home services).
These worlds actually come together in management as a service (MaaS) offerings in which a cloud-based service provider fully manages and operates on-premises or hybrid infrastructure. For example, HyperGrid (formerly Gridstore) is pioneering a platform service in which you can subscribe (on demand) to hybrid cloud clusters of managed hyper-converged appliances, which can be ordered on-premises. Similarly, Galactic Exchange will remotely operate and manage your big data platforms as a service, while actual cluster compute and data nodes can live on premises or in a cloud.
As an example of MaaS storage, Igneous offers on-premises subscription object storage that they remotely operate and manage as a service. This helps an IT storage group recognize cloud economics while retaining actual data storage in the data center. While the MaaS vendor creates and consumes many of the operational analytics, when it comes to data-centered intelligence, there is an opportunity to use APIs to extend their MaaS storage platforms directly with cloud services (e.g., lambda processing or machine learning).
Where will data live?
The cloud-centric IT world is inevitable. We will all have to manage hybrid storage that spans from device to data center to global cloud hosting. As our data grows and spreads out, all of our analytics and intelligence will need to scale and follow along.
In the next few years, we'll see an explosion of data along with a tidal wave of new internet of things data sources. Most IT organizations won't be able to survive the predicted onslaught without a lot of help from smarter storage products and expert cloud-based management services.
Smarter data storage systems make cents
Smart storage starts with smarter tools
Storage system smarts start with metadata, object stores
- Tiered Storage - Optimizing the Storage Infrastructure –Fujifilm Recording Media USA, Inc.
- Illuminating Insight for Unstructured Data at Scale –IBM