Konstantin Emelyanov - Fotolia

Get started Bring yourself up to speed with our introductory content.

With NVMe-oF, who needs rack-scale PCIe?

NVMe over networks is winning out as the best way to offer shared, flash-based storage systems, with Dell EMC, NetApp, Pure Storage, Western Digital and startups in the game.

Just a few years ago, before the emergence of the NVMe-oF specification, the future data center in my crystal ball was based on a rack-scale, switched PCIe fabric. I figured that OCuLink provided a practical standard for the necessary cables and connectors, while merchant PCIe chips from Broadcom or IDT would let Dell, Hewlett Packard Enterprise or a bunch of startups build a rack-scale system without even programming a field-programmable gate array, let alone spinning an ASIC.

Sure, NextIO Inc. and VirtenSys Ltd. had unsuccessfully tried externalizing PCIe a decade ago, using it to enable a rack full of servers to share a few expensive 10 Gbps Ethernet cards, Fibre Channel (FC) host bus adapters (HBAs) or RAID controllers. Sharing expensive peripherals, such as disk storage and laser printers, was, after all, one of the original justifications for networks from LANs to SANs.

The problem with their approach was, to my mind, NextIO and VirtenSys products ended up costing almost as much as just putting a network interface card (NIC) and HBA in every server. Meanwhile, sharing those I/O cards reduced their value with each host, only getting one-half or one-quarter of the bandwidth.

Then, DSSD arrived on the scene, with Andy Bechtolsheim, Jeff Bonwick, and Bill Moore leading a well-funded, all-star cast that EMC acquired in 2014. Their approach was to share a block of flash across the hosts in a rack, which looked like a much better idea. Fusion-io cards and the other Peripheral Component Interconnect Express (PCIe) SSDs of the time were much more expensive than NICs and HBAs, but the real difference is that shared storage is much more valuable than the isolated puddles of ultralow latency that PCIe SSDs provided.

DSSD wasn't just a way to share SSDs; it was a shared storage system that could deliver 100 microsecond (µs) latency when all flash array vendors were bragging about 1 millisecond latency. For 100 µs latency, I thought users would put up with having to load a kernel driver.

But while EMC was touting DSSD as the best thing since System/360, a group of pioneering startups that included Apeiron Data Systems, E8 Storage, Excelero and Mangstor demonstrated nonvolatile memory express (NVMe) over networks. These systems added just 5 to 20 µs to the 75 or 80 µs latency of NVMe SSDs, matching DSSD's magic 100 µs latency over Ethernet.

In 2017, with NVMe over Fabrics (NVMe-oF) on the horizon, Dell EMC wisely decided to shutter the DSSD platform. Few customers wanted to pay for such expensive custom hardware and PCIe when they could get the same latency at a fifth the cost.

Today, most of those startups, along with industry titans, like Dell EMC, NetApp, Pure Storage and Western Digital, are shipping storage systems supporting the NVMe-oF protocol. These systems produce 100 µs-class latency over the standard Ethernet and FC network gear customers already use. Many of them use NVMe-oF as a front-end storage access protocol between hosts and storage systems, like iSCSI and FC, as well as a replacement for SAS as the back-end connection between storage systems and external media shelves.

NVMe over Fabrics
How NVMe over Fabrics works

If NVMe-oF is becoming the standard storage protocol for host-to-array and array-to-shelf communications, is there a place for rack-scale PCIe? Or was rack-scale PCIe a technology like the dirigible, autogyro and digital watch that went from futuristic to quaint without a real stop at modern?

Composable infrastructure supplier Liqid and Gen-Z Consortium are talking up a rack-scale infrastructure that shares not just storage or I/O cards, but GPUs and memory. That's a story for another day.

This was last published in September 2018

Dig Deeper on NVMe over Fabrics

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Which vendor do you think has the best NVMe-oF storage system?
Having spent 35 years as a consultant I can only answer a general question like that with "It Depends". 

Do you want a JBOF or a full-fledged array? High capacity? Low cost? Need a trusted provider? What's the app?  What's that app's I/O profile.

Best depends on all those things.

 - Howard
Thank you for an interesting POV.

Given that Software-defined Memory (SDM) is already available commercially from several vendors, the existing fabrics already provide (or can provide) the memory part of the Composable Infrastructure. That leaves only GPUs/FPGAs/Accelerators, and given the current trend of putting more than one GPU into a node, and even connecting those CPUs with their own super-fast interconnect, it looks like the world is not in a rush to have the GPUs shared beyond the existing niche solutions like rCUDA.

So, indeed, PCIe rack-scale interconnect is being challenged, and IMO the question whether existing fabrics win vs. PCIe will be determined by availability/cost/maturity of PCIe rack-scale switch/adapters vs. ability of existing fabrics to provide much better QoS than they do today (100us is nice, but with congested traffic sometimes getting to O(sec), that could be something that could cause transaction failures or application crashes)
Software-defined memory is just marketing speak for doing virtual memory better.  I don't buy it.

A system that's composable at the memory level can't have 100us or even 50us memory latency. Today we have to manage NUMA nodes because accessing a DIMM connected to another processor adds 200ns adding 50us is just too much.