Sergey Nivens - Fotolia
Small World Big Data
Published: 03 Apr 2018
Back in the good old days, we mostly dealt with two storage tiers. We had online, high-performance primary storage...
directly used by applications and colder secondary storage used to tier less-valuable data out of primary storage. It wasn't that most data lost value on a hard expiration date, but primary storage was pricey enough to constrain capacity, and we needed to make room for newer, more immediately valuable data.
We spent a lot of time trying to intelligently summarize and aggregate aging data to keep some kind of historical information trail online. Still, masses of detailed data were sent off to bed, out of sight and relatively offline. That's all changing as managing unstructured data becomes a bigger concern. New services provide storage for big data analysis of detailed unstructured and machine data, as well as to support web-speed DevOps agility, deliver storage self-service and control IT costs. Fundamentally, these services help storage pros provide and maintain more valuable online access to ever-larger data sets.
Products for managing unstructured data may include copy data management (CDM), global file systems, hybrid cloud architectures, global data protection and big data analytics. These features help keep much, if not all, data available and productive.
Handling the data explosion
We're seeing a lot of high-variety, high-volume and unstructured data. That's pretty much everything other than highly structured database records. The new data explosion includes growing files and file systems, machine-generated data streams, web-scale application exhaust, endless file versioning, finer-grained backups and rollback snapshots to meet lower tolerances for data integrity and business continuity, and vast image and media repositories.
The public cloud is one way to deal with this data explosion, but it's not always the best answer by itself. Elastic cloud storage services are easy to use to deploy large amounts of storage capacity. However, unless you want to create a growing and increasingly expensive cloud data dump, advanced storage management is required for managing unstructured data as well. The underlying theme of many new storage offerings is to extend enterprise-quality IT management and governance across multiple tiers of global storage, including hybrid and public cloud configurations.
If you're architecting a new approach to storage, especially unstructured data storage at a global enterprise scale, here are seven advanced storage capabilities to consider:
- Automated storage tiering. Storage tiering isn't a new concept, but today it works across disparate storage arrays and vendors, often virtualizing in-place storage first. Advanced storage tiering products subsume yesterday's simpler cloud gateways. They learn workload-specific performance needs and implement key quality of service, security and business cost control policies.
Much of what used to make up individual products, such as storage virtualizers, global distributed file systems, bulk data replicators, and migrators and cloud gateways, are converging into single-console unifying storage services. Enmotus and Veritas offer these simple-to-use services. This type of storage tiering enables unified storage infrastructure and provides a core service for many different types of storage management products.
How to make unstructured data more useful
- Build and serve out a global namespace to gain a governance control point and widen and simplify access.
- Create an online backup repository and archive to do the following:
- deliver next-generation object storage APIs;
- provide end-user file versioning;
- reduce backup TCO; and
- support high throughput and colder data reads for big data analytics.
- Support global data content and metadata search facilities.
- Implement analytics tools that provide business app owners with effective cost-improvement options.
- Metadata at scale. There's a growing focus on collecting and using storage metadata -- data about stored data -- when managing unstructured data. By properly aggregating and exploiting metadata at scale, storage vendors can better virtualize storage, optimize services, enforce governance policies and augment end-user analytical efforts.
Metadata concepts are most familiar in an object or file storage context. However, advanced block and virtual machine-level storage services are increasingly using metadata detail to help with tiering for performance. We also see metadata in data protection features. Reduxio's infinite snapshots and immediate recovery based on timestamping changed blocks take advantage of metadata, as do change data capture techniques and N-way replication. When looking at heavily metadata-driven storage, it's important to examine metadata protection schemes and potential bottlenecks. Interestingly, metadata-heavy approaches can improve storage performance because they usually allow for high metadata performance and scalability out of band from data delivery.
- Storage analytics. You can use metadata and other introspective analytics about storage use gathered across enterprise storage, both offline and increasingly in dynamic optimizations. Call-home management is one example of how these analytics are used to better manage storage. Komprise and other vendors use analytics to provide deep, workload-level reporting on storage use and even go beyond that to offer what-if planning before implementing storage hosting changes and storage virtualization. Using this insight to continually evaluate and optimize workload data storage decisions is key to controlling storage costs in the face of tremendous data growth.
Cloud storage service plans are especially competitive, and a wave of storage cost-capacity-performance brokerage is coming, in which IT will be able to use analytics to pit one provider or service against another in increasingly real-time contexts. Although we aren't quite there yet because data still has significant gravity, many of the new products mentioned can make data appear elsewhere in real time, without an application restart, while slowly migrating actual data under the hood to control costs.
- Capacity optimization. Data duplication, compression and thin provisioning help to optimize capacity at the array level, of course. But limiting the number of copies of a data set floating around an enterprise can also reduce management headaches and costs of dealing with massive unstructured data.
Companies such as Actifio and Delphix have done well with CDM that delivers a virtual copy, or clone, of data while protecting it with optimal change data protection schemes. Instead of, say, 15 copies of important data stored in various places across an enterprise, CDM deduplicates storage into one master copy and instant access to current virtual copies on demand.
- Smart data protection. Smart data protection vendors, such as Commvault, Rubrik, Strongbox Data Solutions and Veritas, deliver scalable, capacity-optimized backup storage. These vendors' products often use deep metadata and clever CDM-like techniques to provide instant data cloning and global recovery.
On a related note, storage archives have mostly become active archives. The active part means the new class of archival storage, based on web-scale object storage technology, keeps all data at hand. While perhaps not yet suitable for relational database management system style I/O, modern object storage can deliver massive I/O read throughput well-suited for information search and retrieval, file versioning, online data recovery and analytical processing.
Cohesity, Igneous Systems and others copy primary NAS data into a web-scale, object-based active archive. Doing that instead of just moving the data provides instant and robust backup data protection with immediate online recovery and restore at a granular level, if desired. You can also get the same primary storage file data in object form for things such as big data analytics, off-site replication and other tasks that might otherwise interfere with primary storage performance.
- Policy- and rules-based management. Increasing IT automation is key across the board, especially for scaling storage management and governance. Policy- and rules-based storage engines -- such as the open source iRODS, Starfish Storage and DataFrameworks' ClarityNow -- can ensure that access and compliance requirements are enforced. They can also implement other lifecycle management processes such as retention, aging, indexing, tracking provenance and checking data integrity. Rules engines can help with large-scale data ingest, drive complex background replication and tiering tasks, and even implement in-storage analytical functions. They're often used on data ingest to extract metadata and index data content to build a global data search function and trigger other storage services.
- User services and usage oversight. End users increasingly expect enterprise storage services to work like personal storage services. For files, users want automatic versioning and self-service recovery, while block and object users want cloud-like elasticity on demand. Everyone wants immediate provisioning, direct cost visibility and minimal service levels for performance, resiliency and availability. Many storage products provide this cloud-like experience to end users on behalf of IT, abstracting away the gritty details of the enterprise infrastructure.
IT must ensure data protection in terms of backup, disaster recovery and integrity, while also providing end-to-end security. Many storage products will layer on end-to-end data encryption, provide in-place data masking based on access policies, and track full audit trails of access and processing. With near-infinite elastic -- and costly -- capacity available, enforcing old-school quotas and file blocking, the way NTP Software's QFS does, can be more important now than ever.
How to improve governance and control of unstructured data
- Copy data management reduces copies and secures access to master data.
- Automated storage tiering moves colder data to cheaper storage, including clouds, without affecting running applications.
- Active backup and archive object storage reduces separate backup processing.
Get ready for super storage
Many storage vendors mentioned in this story span multiple categories of unstructured data management. You need good metadata to optimize or automate capacity to control costs and virtualize data across the enterprise to provide global end-user features and assure IT governance. Also, the capabilities on this list are often used together to ensure wider governance operations, such as enforcing data regionalization and globally scrubbing personally identifiable information by default.
Overall, many of these new data storage capabilities are focused on managing unstructured data sets, particularly large ones, in secondary storage. They're already letting us collect and exploit far more valuable information than ever before. These capabilities will converge to the point where we will see super storage emerge that combines all of what primary and secondary storage tools independently offer today.