NVMe enables storage infrastructure to take full advantage of flash-based storage by improving the physical interface, as well as increasing command counts and queue depth. But NVMe also creates a challenge: NVMe is so latency efficient it exposes weaknesses in other components of the storage infrastructure. Any weak link in the infrastructure increases latency and reduces the value of NVMe.
One of the more problematic links in storage infrastructure is the file system. It's time for vendors to rethink file system structure. In particular, they must revise how their file systems interact with NVMe-enabled storage in order to avoid being the primary bottleneck.
Why do file systems matter?
File systems that service AI and high-velocity workloads are typically scale-out. A scale-out file system is made up of multiple storage servers, or nodes. The file system aggregates the internal storage in each of these nodes, presenting it as a single storage pool that users and applications can access. Traditional file systems can also scale-out, but they are serial, meaning all I/O goes through a primary node that AI and high-velocity workloads can easily overwhelm, creating a bottleneck. These workloads count on a parallel file system structure that enables any node in the cluster to service I/O to the user or application, making network efficiency even more important.
Most NVMe storage systems are designed for block storage. As a result, they circumvent the performance overhead of the file system structure. In most cases, however, a file system is added to the block storage system so these AI and high-velocity workloads can use it. Most modern applications -- especially AI, machine learning and big data analytics processing ones -- count on a file system.
A well-designed block-based NVMe storage system with a file system added to it will still likely be faster than a block-based SAS storage system, but the performance drop between raw block storage and file system-controlled storage is significant. Organizations need file systems optimized for NVMe.
What to look for in benchmarks
There are several file system benchmarking tests vendors use to demonstrate their capabilities. Most of these tests use NVMe block storage with a parallel file system, such as IBM's Spectrum Scale file system. Vendors are free to use various configurations to claw their way to the top of the chart. This can be misleading.
For example, in the current Standard Performance Evaluation Corp. SFS 2014 benchmark, the top vendors vary significantly on the number of drives, types of drives and number of storage nodes in the test environment. In most cases, hardware vendors are trying to mitigate file system structure overhead by using more hardware than should be required and driving up the price beyond what's reasonable for most organizations.
What really matters is how well the hardware and file system will perform with an organization's workload types and budget. Most companies don't have unlimited funds to create the perfect NVMe-file system combination. IT professionals should look for the simplest configuration possible that achieves the results they need.
What to look for in a file system
There are primarily three limiting factors in file system performance:
- how efficiently the file system communicates with the storage node;
- how efficiently the file system manages the network that connects the various storage nodes and how efficiently it communicates with the clients; and
- how efficiently the file system manages metadata access.
In most modern application environments metadata accounts for more than 80% of all I/O.
File systems usually communicate with storage media via the operating system I/O stack. Most advanced file systems are based on Linux and communicate through that stack. But the Linux stack adds overhead. An alternative is for the file system to create its own I/O channel to the NVMe-based file system. Direct communication with the drive is more difficult from a file system development process, but it provides file system users with the best chance to derive maximum performance without having to overcompensate with expensive hardware.
File systems typically communicate with clients by using standard NFS protocols. But NVMe has a networking variant, (NVMe-oF). Modern file systems should provide software enabling parallel, native NVMe-oF access to run on the client. NVMe-oF could also be used to interconnect various storage nodes. The result for the file system customer is the ease of access that file systems provide at direct-attached storage latencies.
In an all-NVMe file system structure, metadata access is by its very nature fast, but the way the metadata is laid out must be efficient so as to benefit from NVMe's low latencies. Optimizing metadata performance means striping it across all the nodes in the file system cluster so no single node bottlenecks performance.
How to get the most out of NVMe
More so than any other workload type, AI and high-velocity use cases can potentially take full advantage of NVMe. The challenge with these workloads is applications typically access storage through a file system. Traditional file systems don't optimize their I/O for NVMe-based drives. Faster node hardware and NVMe drives deliver improved performance, but the file system structure doesn't allow the hardware to reach its full potential.
To circumvent this issue, look for file systems that write directly to NVMe drives instead of through the operating systems' I/O stack. Also look for file systems that enable the client to communicate across NVMe-oF and manage your metadata in a way that it isn't a bottleneck to performance.