AI and machine learning are positioned to become two of the most important tools to help businesses create competitive advantages using their core digital assets. But before buying AI data storage, an organization must consider a range of requirements based on how data is acquired, processed and retained by machine learning platforms.
Let's first examine the lifecycle of data used by machine learning software, as this helps businesses understand what to consider when selecting storage for AI. Initially, an organization must acquire data to train machine learning or AI algorithms. These are software tools that process data to learn a task, such as identifying objects, processing video and tracking movements. Data can be produced from a variety of sources and is typically unstructured in nature, such as objects and files.
The training process takes data assets and uses machine learning or AI software to create algorithms for processing future data sources. In training or developing an algorithm, AI software will process source data to develop a model that can create the insight or benefit a business needs to exploit.
Developing machine learning algorithms is rarely done as a single process. As businesses accumulate new data, algorithms are refined and improved. This means little data is thrown away, and instead, it grows and is reprocessed over time.
Criteria for buying AI data storage
Before an organization selects storage for an AI platform, it must first consider the following:
1. Cost. The price of AI data storage is a critical factor for businesses. Obviously, the C-suite and those involved in purchasing decisions will want storage to be as cost-effective as possible, and in many instances, that will affect an organization's product choice and strategy.
2. Scalability. I've already highlighted the need to collect, store and process large volumes of data to create machine learning or AI models. Machine learning algorithms require exponential increases in source data to realize only linear improvements in accuracy. Creating reliable and accurate machine learning models can require hundreds of terabytes or even petabytes of data, and this will only increase over time.
Building petabyte-scale storage systems typically means using object stores or scale-out file systems. Modern object stores can certainly address the capacity requirements of AI workloads, but they may not be able to keep up with other criteria, such as high performance. Scale-out file systems can offer high performance and good scalability, but storing entire data sets on a single platform can be expensive. Block storage isn't typically the right option for machine learning or AI, because of the scalability requirements and the cost of high-capacity products. The only exception here is in a public cloud, which is discussed later.
Variations in storage costs introduce the idea of tiering or using multiple types of storage to store data. For example, an object store is a good target for storing large volumes of inactive AI data. When data is needed for processing, it can be moved to a high-performance file-storage cluster or nodes within an object store that are designed for high performance, and the data can be moved back once processing is completed.
3. Performance. There are three aspects to storage performance for AI data. First, and possibly most important, is latency. This defines how quickly each I/O request that the software makes is processed. Low latency is important because improving latency has a direct effect on how long it takes to create machine learning or AI models. Complex model development may take weeks or months to run. By shortening this development cycle, organizations can create and refine models much faster. When examining latency capabilities, object stores reference time to first byte, rather than the latency of an individual I/O request, due to the streaming nature of object access.
Another aspect of performance is throughput and how quickly data can be written to or read from a storage platform. System throughput is important because AI training processes huge data sets, often repeatedly reading and rereading the same data to accurately develop a model. Sources of machine learning and AI data, such as sensors on automated vehicles, can generate multiple terabytes of new data each day. All of this information must be added to an existing data store and have minimal impact on any existing processing.
The final aspect of performance is parallel access. Machine learning and AI algorithms process data in parallel, running multiple tasks that can read the same data multiple times and across many parallel tasks. Object stores are good at parallel read I/O processing because there are no object locking or attributes to manage. File servers track open I/O requests, or file handles, in memory. So the number of active I/O requests is dependent on the memory available on the platform.
Machine learning data can consist of large amounts of small files. This is an area where file servers can deliver better performance than object storage. A key question to ask AI storage vendors is how the performance characteristics of their products will change over large and small file types.
4. Availability and durability. Machine learning and AI learning models can run continuously for a long time. Developing algorithms through training can take days or weeks. Storage systems must be up and continuously available during that time. This means any upgrades, technology replacements or expansion of systems needs to occur without downtime.
In large-scale systems, component failure is normal and must be handled as such. This means any platform used for AI work should be capable of recovery from device -- such as HDD or SSD -- and node or server failure. Object stores use erasure coding to widely distribute data across many nodes and minimize the impact of component failure. There are erasure coding techniques to scale-out file systems to provide equivalent levels of resiliency. The efficiency of erasure coding schemes is important because it relates directly to the performance of read and write I/O, especially with small files.
As most large-scale object stores are too large to regularly back up, reliable erasure coding becomes an essential feature of AI storage platforms.
5. Public cloud. Developing machine learning and AI algorithms requires both high-performing storage and high-performance compute. Many AI systems are based on GPUs, such as Nvidia DGX, that offload many of the complex mathematical calculations involved in developing accurate algorithms.
Public cloud service providers have started to offer GPU-accelerated virtual instances that can be used for machine learning. Running machine learning tools in the public cloud reduces the capital cost of building infrastructure for machine learning development, while offering the ability to scale infrastructure needed to develop machine learning models.
The challenge in using public cloud compute is how to get data into public clouds in a cost-effective and practical manner. Cloud-based object stores are too slow to keep up with the I/O demands of machine learning; therefore, local block storage must be used. Every minute of delay moving data represents a cost for running the infrastructure, plus a delay in performing machine learning.
Another issue with public clouds is the cost of data egress. Although cloud service providers don't charge to move data into their platforms, they do charge for any data accessed from the public network outside of their platforms. As a result, although public clouds offer flexibility in compute, getting data in and out of the cloud in a timely and cost-effective fashion isn't always straightforward.
Vendors are developing storage offerings that run their products in the public cloud, spanning on premises and the cloud. These products can efficiently replicate data or move data into the cloud and only move the results back once completed. These replication techniques are bandwidth-efficient, making it practical to store data on premises and import to the cloud for analytics work.
6. Integration. Throughout this article, we've looked at the storage aspect of machine learning and AI in isolation from compute. Building AI data storage can be hard, as additional factors must be considered for storage networking and tuning storage to work with machine learning applications.
As I wrote about converged infrastructure, prepacking of products enables vendors to test and optimize their offerings before shipping them to the customer. There are storage products today that combine popular AI software, compute such as general CPUs and GPUs, networking and storage to deliver an AI-ready platform. Much of the detailed tuning work is done before these systems are deployed. Although cost may be an issue, for many customers, a prepackaged system could reduce barriers to adoption for AI storage.
Clearly, picking the right AI data storage platform is a balance of metrics, such as performance, scalability and cost. Getting the storage platform right is essential because the volumes of data involved are significant. Picking the wrong product can be a costly mistake. As with any storage product decision, it's important to talk with vendors to understand exactly how their products fit the needs of AI and machine learning. This engagement process should include demonstrations and evaluations as a prelude to any possible purchasing decision.