JRB - Fotolia

The present and likely future of the NVMe protocol

The NVMe and PCIe combination of protocol and interconnect was an inevitable development once solid-state drives were created, as was moving NVMe to fabrics.

Jim O'Reilly

Published: 11 Apr 2017

Nonvolatile memory express started out as one of a handful of protocols to enable a faster connection of a flash drive to a PC via the PCI Express bus. While it is becoming the de facto standard for that in the PC world, in the enterprise world it is all about extending local connection out of the PC and into the world of networked storage via fabric. The goal is to give flash drives in a storage network the same I/O speed as one connected directly to a computer via PCIe. We'll look at what NVMe is, how it works and what its future holds.

The NVMe protocol was a logical consequence of the much higher performance of flash drives. Basically, the old SCSI-based file stack in the operating system couldn't keep up with the very high level of I/O operations. There were just too many interrupts, and the stack itself has thousands of CPU instructions per block of data.

Enterprises needed an offering that reduced interrupts dramatically, freeing up the associated overhead time for productive work in the CPU. In the same way, a method of transferring data that avoided interactions with the host CPUs as much as possible made sense.

All about that data transfer

The NVMe protocol evolved pretty rapidly to address these problems. Direct memory access over the PCIe bus addressed the interaction issue and was a well-tested method for moving data. It changed the transfer mechanism from a "push" system that required transfer requests and acknowledgments to a system that allowed the receiving node to "pull" data when ready. Experience had shown that this approach reduces CPU overhead to just a few percent.

The interrupt system was replaced by a circular queue method, with one set of queues for pending transfers and another for completion statuses. These command queues are parsed using remote direct memory access (RDMA), while the completion queues are handled in blocks of responses, which effectively aggregates interrupts.

One complaint on SAS/SATA/Fibre Channel file stacks was the lack of any sense of priority or source owner. NVMe attacks this elegantly, with 64,000 possible queues, each of which identifies both originator and priority. This permits data to be delivered back to the originating core, for example, or to a specific application. This addressing scheme becomes even more powerful when we look at the extensions of NVMe to fabrics.

The physical side of NVMe has resolved out as either a connection of two PCIe lanes contained in an M.2 connector or a standardized SATA Express (SATAe) connector. Both of these are defined to allow SATA drives to be connected properly as an alternative to NVMe/PCIe.

M.2 technology allows very compact SSDs to be attached. These M.2 drives leave out cases and other drive items to save space and cost and, as a result, substantial capacities are available in 1.5-inch and 3-inch packages that are just an inch or so wide. We can expect 10 TB or more in these tiny form factors in 2018.

Size, though, is an unexpected benefit of NVMe. The real headline is summed up in announcements made at September's Flash Memory Summit, where several drives boasting 10 billion IOPS were described. This compares well with the 150 IOPS of a traditional HDD, and is the primary reason that enterprise HDD sales are falling rapidly.

NVMe has made the SAS interface obsolete in the process. SAS is based on a SCSI software stack, and thus the old long-winded fileIO systems. It couldn't keep up with even small configurations of SSDs, and that was when drives achieved just 2 billion IOPS each. Attempts to add RDMA to SAS have been discussed, but SATAe has clearly won the battle.

With SAS fading away into the sunset, what about SATA itself? There is no cost difference involved in the actual PCI or SATA interfaces. They are almost identical electrically, and chipset support for autodetection and common electronics for connection are making any host-side connection trivial.

The only reason to keep SATA is to maintain an artificial price differential for NVMe drives versus SATA drives. The dynamics of pricing in the SSD market are much more sophisticated than in HDDs. Latency, IOPS and drive durability are all considered, as well as capacity, while the concept of enterprise versus commodity drives has become quite fuzzy as large cloud providers use the other differentiators to determine what they need.

Appliance-level redundancy and a no-repair maintenance approach mean the humble single-port commodity drive can move into the bulk storage space, while higher-performance and long-durability drives make up the primary tier. The upshot of this change in differentiators is that there is no long-term need for SATA, so it too will fade away, allowing NVMe/PCIe to be the local drive interface.

What technology will NVMe battle next?

All is not smooth sailing for the NVMe protocol, however. We have Intel's Optane technology to contend with. Intel plans to connect Optane SSDs using an Intel-developed -- and strictly, proprietary -- fabric scheme. This will be based on OmniPath, which will be native to the CPU. Intel's worldview is that this fabric will connect all the crucial modules in a server, including CPUs, graphics processing units and field-programmable gate arrays, as well as memory modules and local SSDs.

Intel drives the CPU business and can make this approach stick. Any alternative interfaces will need translation chips, which brings up the next evolution in NVMe. The idea of extending the NVMe protocol over fabrics has been in the works for some years, even as early as the first NVMe discussions. It's a sensible extension of the approach, especially as we have huge amounts of experience with RDMA over Ethernet and InfiniBand.

We can expect something of a battle between Intel's OmniPath approach and Ethernet/RDMA. The latter has the huge advantage of experience in the market, a strong technology roadmap, and plenty of installed base and near ubiquity as a cluster connection scheme. With industry leaders such as Mellanox designing translators from OmniPath to Ethernet, I bet we'll see clusters connecting using a hybrid of the two, with OmniPath internally and Ethernet/RDMA externally to the servers.

This is the future of storage systems. One of these fabric approaches will be anointed, and we'll move forward to rapidly accept it. The reason is that connecting all the major modules in a cluster into a virtualizable pool of resources that can be directly accessed via the fabric grid will make for some very powerful and versatile systems.

NVMe wins whatever happens. The ring-buffer/RDMA approach has worked so well that all of these various offerings will use it to handle operations. It's very safe to say that the NVMe protocol is the future of computing.

Next Steps

NVMe is also vital to Ethernet's future

Exelero joins the enterprise NVMe flash market

Flash, including NVMe, key to IBM's future, GM says

Dig Deeper on Flash memory and storage

E-Handbook: Everything you need to know about NVMe storage

Article2 of 3

Up Next

The present and likely future of the NVMe protocol

The NVMe and PCIe combination of protocol and interconnect was an inevitable development once solid-state drives were created, as was moving NVMe to fabrics.

All about that data transfer

What technology will NVMe battle next?

Next Steps

Dig Deeper on Flash memory and storage

NVMe over Fabrics (NVMe-oF)

NVMe (non-volatile memory express)

NVMe speeds vs. SATA and SAS: Which is fastest?

Western Digital will package new NVMe SSD into NVMe-oF JBOF

The NVMe SSD has moved from a fast PC drive to a networked future

The present and likely future of the NVMe protocol

Enterprise flash storage to speed up in 2017