Designing Powerful AI Systems with NVMe, PCIe and Logical Volumes

Explore some give-and-take and performance factors to consider when designing storage systems for AI and ML workloads. Technologies covered in this presentation include PCIe switching and fabrics, NVIDIA’s GPUDirect Storage and others.

Download the presentation: Designing Powerful AI Systems with NVMe, PCIe and Logical Volumes

00:01 Wilson Kwong: Hello, my name is Wilson Kwong, technical staff applications engineer at Microchip Technology. Today, I'll be talking about designing powerful AI systems with NVMe, PCIe and logical volumes. Let's get started.

I'll be going over a brief overview of the topic and the technologies involved. Namely, technologies involved include PCIe fabrics, NVMe and logical volumes, GPUDirect Storage, and then we'll put it all together for AI and ML design.

00:43 WK: Traditional data movement between NVMe storage and GPU requires CPU and bounce buffer involvement. The AI model training and big data analytics are reaching hundreds of terabytes, and it is a challenge for GPUs. With millions of file accesses by GPU, which are being bottlenecked by the CPU, real-time decisions require sourcing and storing large data sets and converting it into actionable intelligence with low latency and high-speed processing in the field. It has become obvious that the conventional ways of transferring data from a CPU to GPUs in storage is insufficient, severely restricting the ability of GPUs to utilize their enormous resources efficiently.

01:34 WK: Several advances have been made to solve the resulting bottlenecks, including Nvidia's GPUDirect Storage, by exploiting its full capabilities is hindered by traditional PCIe. We can solve this problem with GPUDirect Storage connected to logical RAID volumes over PCIe fabrics. Let's dive into the different technologies, starting with PCIe fabrics.

02:04 WK: There are a number of technologies and innovations to address this data challenge with big data and analytics. The first is the adoption of PCIe Gen 4 as a fundamental system interconnected within the storage subsystem. As we know, PCIe Gen 4 doubles the internal bandwidth over Gen 3; that will bring 16 giga transfers per second per lane between different components. But traditional PCIe switch, even at Gen 4, retains the basic tree-based hierarchy -- with that comes some complexities. Host-to-host communications requires the use of NTB to cross partitions, making it complex, especially in multi-host multi-switch configurations. Fabric switches overcomes traditional PCIe specification limitations. Let's take a look how that is.

03:02 WK: For example, one such solution is the Microchip PAX fabric switch. Its ability to do multi-host sharing of SR-IOV endpoints. It has the ability to dynamically partition a pool of GPUs and NVMe SSDs to be shared among multiple host, while supporting host-standard system drivers. This means that no custom drivers are needed to be developed for this type of solution, which makes deployment much easier. The PAX fabric switch also supports peer-to-peer transfers directly across the fabric, decreasing root port congestion and eliminating CPU performance bottlenecks.

03:52 WK: With the latest release of the PAX fabric switch firmware, we've also support fabric DMA. Fabric DMA is a high-performance, low-latency, cut-through architecture with maximum flexibility designed specifically for PCIe fabrics. PAX fabric switch contains two discreet but interoperable domains. There is a host virtual domain dedicated to each physical host and a fabric domain containing all endpoints and fabric links. Transactions from the host domains are translated into ID and addresses in the fabric domain and vice versa, with nonhierarchical proprietary routing of the traffic in the fabric domain. This allows the fabric links connecting to the switches and endpoints to be shared by all of the hosts in the system.

04:57 WK: A fabric firmware runs on an embedded bit processor and it has the ability to virtualize the PCIe spec compliance switch with a configurable number of downstream ports. So, as you can see in this figure, GPUs that are binded to host 1 can be spread across multiple switches. However, with firmware virtualization, it is able to make host 1 think that it has four endpoints connected to its four downstream ports in a PCIe spec-compliant hierarchy and then enumerate them properly. In the fabric switch, the switch intercepts all configuration plane traffic from the host, including the PCIe enumeration process. It's low latency and has the ability to pick the most optimal path. With PCIe fabrics, redundant paths are supported as well as loops, which is not possible in traditional PCIe.

06:11 WK: Let's now take a look at a example configuration and how the PAX fabric switch performance actually is. As we can see, the fabric domain contains four PAX switches and a number of GPUs. Mainly, we have four Nvidia Tesla GPUs in this test case. As well, we have a Samsung NVMe SSD with SR-IOV capability. The application that we ran on the host 1, which is a Windows OSthere is a TensorFlow CIFAR-10 image classification training model.

06:53 WK: This workload requires the use of all four GPUs, so we've binded four GPUs to host 1. The GPUs appear directly connected to the virtual switch, and then CUDA has been initialized to discover the GPUs. As we can see, the host 1 average bandwidth here for bidirectional peer-to-peer is 24.9 GBps. After the TensorFlow training is completed, two of these GPUs are then released back into the pool and consumed by host 2 for a different workload. In this case, host 2 was able to average 24.6 GBps for bidirectional P2P with two GPUs. The fabric domain is all controlled by a fabric manager that can be run with UR detection or Ethernet or in-band, and is the brains behind the management of the pools of GPU, and assigning the GPUs to the host.

So, with that in mind, let's move on to the NVMe and logical volumes as a second piece of this design. NVMe SSDs is a critical technology which directly utilizes high-bandwidth and low-latency PCIes for enhanced data movement directly to and from solid-state storage. Typically, NVMe uses bifur PCIe connection, and each drive is connected at 68 GBps full duplex at Gen 4.

08:45 WK: The NVMe protocol also eliminates overhead associated with legacy storage protocols. With NVMe, they are much faster than traditional hard disks and traditional all-flash structures because traditional disks and all-flash structures are limited to a single queue command, but with NVMe protocol it can support multiple queues. NVMe SSD has been broadly adopted as a primary storage in the enterprise as well. Data availability and ease of operation has become paramount. We're seeing an increased usage in data analytics and AI and ML as data sets have become massive, becoming massive, but also require fast access, and NVMe can be helpful in this area as well. With large data sets and a larger amount of drives, data protection are often required. So, we can solve this with logical volumes and data protection by the way of RAID. There are two flavors of RAID. There's software RAID, which is performed on the whole CPU, but requires additional CPU resource consumption. Software RAID uses a CPU to provide parity and redundancy. And then there's hardware RAID, which is dedicated hardware, such as a RAID controller.

10:29 WK: Hardware RAID uses all-flow capabilities of an ASIC to generate parity and redundancy. It handles the overhead related to I/O applications. You get better performance with dedicated data protection through a hardware RAID-specific ASIC. There may be slight penalties on I/O or latency when dealing with all-flow operations. However, even with PCIe Gen 4  . . . NVMe SSD drives, performance on the storage subsystem can still be limited by traditional data movement architectures of RAID technology. When we look at the actual architecture of software and hardware RAID, we see that there's different levels of interaction here. Software RAID will utilize inbox NVMe drivers to access the PCIe-attached NVMe drives either directly or through a switch. As you can see, we have the application layer followed by the OS storage stack layer, and within this OS storage stack layer there's software data protection. Again, because of its implementation, consumption of expensive compute and memory resources on the host will hinder the other applications' performance that are running on the host as well.

11:55 WK: Traditional hardware RAID relieves the parity management burden from the host. It funnels all of this data to the RAID controller. So, on the OS side you'd have the application layer followed by the OS storage stack, and then the host driver, which talks to the controller. The hardware protection encryption services and the NVMe drivers sits on the actual dedicated hardware controller, and then it connects to the NVMe drive. So, all of the data needs to pass through the RAID controller prior to placing it onto the drive's permission queue. Directing all of this data through RAID controller adds complexity to the data path, even when it's not necessary.

So how do we solve that problem? Microchip has been working on NVMe-optimized hardware RAID. So, to say, we can combine a multipath driver with an embedded switch within the controller, which allows us to unlock the best of the previous architectures for NVMe drives.

13:18 WK: The embedded switch provides a streamlined data path unencumbered by firmware, low RAID-on-Chip controller. We can maintain the availability of the hardware-based protection and encryption services through the use of the multipath driver. The multipath driver intelligently manages data based on the data service requirements, through either the switch or the RAID controller with negligible overhead.

13:53 WK: As you can see, combining the software and the hardware controller gives us the application layer on the right-hand side. And we can see how some of the data that does not need to be encrypted or parity-protected can flow straight through PCI switch into the NVMe drive. But when those services are needed, the multipath driver can redirect them through the green arrow here into the necessary blocks before it commits it to the NVMe drive.

So now, with that in mind, we can look at creating logical RAID volumes. Throughput is extremely important when dealing with large data sets. Well, we have a couple options. We have RAID 0, which is striping data across two disks or more for performance. It is generally well accepted that RAID 0 carries a sizable risk of data loss. But where acceptable, people often will desire RAID 0 for the speed benefits. What's not understood is that the speed benefits are dependent on the type of disk usage, and there are two main factors for this performance: access time and throughput.

15:19 WK: Access time dictates how quickly a hard drive can go from one operation to the next, and throughput determines how quickly the data can be read or written. RAID 0 does increase throughput, but it does not help access times. So, what does this all mean? If you're dealing with reading and writing a large number of small files, the performance benefit will be very minimal. The benefits are realized when reading or writing a large amount of data at once, and then the single location on the disk. Therefore, RAID 0 can make sense when transferring and copying very large files. And then we can look at RAID 10, a stripe of mirrors, if data loss is not acceptable, as RAID 10 adds to the level of protection that most people need. It's fast and resilient at the same time. So, if you need a hardware level of protection of faster storage, RAID 0 is a simple and relatively expensive fix, as you are securing data by mirroring duplicates of all your data. So, the final technology that we want to discuss here is the GPUDirect Storage.

16:44 WK: GPUDirect Storage enables a direct path between local or remote storage and the GPU memory. The standard path between GPU memory and NVMe drive uses the bounce buffer in the system memory that hangs off the CPU. The direct path for storage gets higher balance by skipping the CPU altogether. We sometimes call this the "zero copy." There's no need for a bounce buffer means there's no copies needed into the CPU memory. This eliminates the unnecessary copies decreasing CPU overhead, reduced latency and results in significant performance. The other piece we can also utilize is to enable the DMA engine near the storage to move data directly into another CPU.

17:42 WK: Expanding on the DMA point, Direct Memory Access asynchronously moves blocks of data over PCIe rather than loads and stores via a copy engine DMA, generally in GPUs, NVMe drivers and storage controllers, but not in CPUs. Traditional GPU DMA engines cannot target storage and storage DMA engines cannot target GPU memory without GPDirect Storage. To decouple from the need for architecture-specific DMA engines, we can use DMA engines in the drive or near the storage, instead of using the GPU's DMA. This would improve I/O bandwidth and simplify design. So, as we can see, the value of GPUDirect Storage gives two to eight times the higher bandwidth for data transfers directly between storage and GPU, up to 3.8 times lower end-to-end latency. The use of DMA engines near the storage not only designed to fewer GPU load and is less invasive to CPU load, and it can be made more generic as well. So, now with GPUDirect Storage, the GPU becomes the highest bandwidth to compute rather than the CPU.

19:18 WK: So, we've discussed three different technologies. Now let's talk about how we can put them all together for building a artificial intelligence and machine learning platform. First, using NVMe SSDs provides overall improved performance but the advantage is more pronounced with larger files. For AI engineers, the advantage can be witnessed during the training data phase when the model is constantly reading and learning from files, most likely stored on the local file system. And we also need a way to ensure that these file systems entering constant reads and writes have the necessary performance and data protection.

20:07 WK: So, as you can see on the right-hand side, this is what it would look like having CPU, switch, and accelerators and NVMe-controlled storage. So, first example. In this example, we're using a microchip PCIe fabric switch, a lot of dynamic allocation of pooled resources to multiple hosts provides through DMA engine that is close to the storage without using a GPU DMA or being dependent on NVMe driver DMA, for example. Then we can also utilize the Microchip NVMe rate controller to create our logical volumes with the NVMe SSDs for high performance and increased storage size. Currently, the only rate controller with the ability to accommodate a transfer rate of 24.9 gigabytes per second is the Microchip SmartROC 3200 RAID-on-Chip controller.

21:07 WK: It has a very low latency and offers up to 16 lanes of PCIe Gen 4 to the host and is backwards compatible with PCIe Gen 2. And the final piece to talk about is the GPUDirect Storage. It breaks the traditional model of accelerated data traversing through the CPU bounce buffer before being committed to storage. Now, we can look at a case study done by Nvidia, the Bandwidth and CPU Load Casestudy using GPUDirect Storage through switch and NVMe SSDs.

As we can see from this graph, in yellow, page caching introduces overhead extra copy within the CPU memory, but slightly outperforms the remaining data from storage to GPU through bounce buffer and small transfer sizes. As we move to large transfer sizes, you can see that DMA takes over in performance. However, GPUDirect Storage wins at any transfer sizes due to removal of the CPU bottleneck, as we can see in the blue.

22:22 WK: So, in summary, using PCIe fabrics with GPUDirect Storage and logical volumes can exploit the full ability of GPUDirect Storage to solve the bounce buffer problem. We can achieve close to . . . performance of 24.9 GBps, utilize dynamic switch partitioning and SROV sharing to allow GPUs and NVMe resources to be dynamically allocated to any host in a multi-host system. Using these technologies, we can design a very powerful system that meets the demands of artificial intelligence and machine learning workload in real time.

Thank you very much for listening to this presentation, and if you have any questions don't hesitate to reach out and contact me. Thank you very much.

Dig Deeper on Flash memory