00:10 Gopi Jandhyala: Hi everyone. Thank you for joining me for this rather unique version of FMS. This certainly has been an interesting and challenging year for all. And personally, I'm not new to FMS, and I have been a regular attendee for some time now. I've seen this conference grow in size, offerings and opportunities to mingle with like-minded people. And speaking of this year, while I certainly miss the in-person interactions with our customers, partners, and seeing the exciting innovations and the displays, I'm still very excited and thankful to be able to speak with you today.
Xilinx has had the opportunity to speak at Flash Memory Summit for several years in a row now, and this year I'd like to give an update on the trends that we talked about in previous years, and to provide our vision for the future adaptable data center. With this, let's dive right in.
01:10 GJ: First, I want to start by hitting a few megatrends that are influencing all aspects of our industry. You may be aware of them and most of you probably have heard them before, yet I believe they are critical enough that I want to reiterate them. And the three trends that I want to highlight are the following. The first one is the evolution of compute architectures to continue delivering the performance needs. The second one is a dramatic bandwidth expansion in almost all peripheral, starting with the CPU complex. And finally, the third is very exciting advancements in the heterogeneous and distributed cloud architectures.
Now, let's target the first trend, and the compute trend, we all know and see that the frequencies have stopped increasing, the power wall has been hit. Transistor densities are still increasing, so the performance gains can still be achieved, but only if they're used more effectively and efficiently by the application. And this takes architectural changes and needs for specialized accelerators are emerging for specific workloads, and we are seeing a rise in domain-specific architectures as potential solutions to address some of these as well.
02:33 GJ: Now, the next major trend is unfolding as we speak. After a period of relative stagnation of the Gen 3 speeds and PCIe, there has been a rapid increase in both the lean count as well as the bus performance, and what this is doing is that it's leading to a rethinking of system architectures. Why continue to design dual-socket systems in the current era of 64-plus core comm processors? And if you free up the connections needed for this processor link, it will open up almost twice the connectivity to other devices like accelerators and SSDs, and this extra connectivity can also eliminate the need for PCIe expansion so that even a 24 SSD 2U system can directly connect to the CPU without needing a switch.
03:24 GJ: Right now, there is also a dash towards CXL and Gen 5, PCIe Gen 5. So, not only is the bandwidth almost doubling, but the latency is also coming down as much . . . It's an order of magnitude. And furthermore, CXL opens up many interesting cases with coherency. All I can say is that it is very exciting times, and architectures are constantly evolving to take advantage of more and more available performance. The final trend that's influencing the IT architectures is the evolution of cloud architectures from cloud service providers providing instances running on black box infrastructure or on-premises enterprise managed systems with completely different policies, security teams, and management stacks.
04:19 GJ: Today, the data owners require workload mobility -- they need the processing and data to come together under their own governance requirements, and this requires, in turn, these two architectures, namely the cloud and on-premises architectures to blend. Already there are connectors, hybrid cloud software solutions that are coming up to fill this gap, but where software is leading, we all know for a fact that hardware will follow. Now, please stay with me as you look at what each of these trends mean to the future adaptable data centers and how FPGAs can play a crucial role in shaping these up with some real-life examples.
05:06 GJ: The first area we look at are solutions in the compute space that helps scale performance without increasing cost. Now, the first example I want to walk you through is in the area of networking. As the data continues to grow, servers are being overloaded just by moving this data in and out from the network, rather than doing what we call useful work. And, as I previously discussed, there is a particular urgency in this area because network core speeds continue to grow at a pace that's far faster than the CPU performance gains.
For instance, as this slide illustrates, in a virtualized environment a very common use case is the hypervisor needing to consult a virtual switch lookup cable to deliver a packet to the correct PM. We can all clearly see that this packet processing is causing excessive CPU utilization, is adding latency while limiting data rates, and fundamentally is not scalable as the network speeds grow. So, what's the solution? The solution is to move the processing to specialized hardware near the data. In this case, the packets entering the server.
06:35 GJ: In this particular scenario, I'm showing a Smartnet based on FPGA which is quite common nowadays in a deployment model. And, in recent years, Smartnets have emerged to precisely address this problem, rather than having the lookup tables in server memory that stays right in the Smartnet. And this way the package can be processed immediately without interrupting the server. The added advantage with an FPGA-based implementation like the one I'm showing here, is the users here can easily add additional customization rules and, most of them, they can add them at line rate because it's optimizing the hardware. And I want to highlight on this slide that leveraging this optimized hardware improves the server utilization, increases data plane performance, all the while being able to scale. I'm sure that you all appreciate this.
07:36 GJ: Now, let's look at another example, something that has to do with storage. We all know that many applications today need to process vast amounts of data. And what does that involve? This involves the CPU fetching the data from storage and routing it to the file system and putting it in the main memory -- only then can the CPU start processing it. And the CPU cores are just not optimized for the exact processing that needs to be done in most of the cases. And as the slide highlights, this design results in excessive data transfers, high latency, and limited data processing throughput. A very similar drawback that you discussed with the networking case.
So, for this storage scenario, what's the solution? The solution again is to move the processing to optimize hardware near the data, in this case the storage. This is exactly what the computational storage space is trying to address, by colocating the compute and storage. When one puts together this kind of architecture, the data doesn't need to leave the storage at all and can be processed locally within the drive. And this reduces the bandwidth needs, reduces the latency, frees up CPU cycles, all the while increasing the throughput of data processing.
09:09 GJ: One subtle thing that I want to point out here is that this also gives the CPU complex a scaling factor that was not possible before, then one expands this to many such drives, instead of one drive. So, now, computation storage is a topic that we discussed in previous Flash Memory Summits. It's very close to our heart, and we are happy to see that it's rapidly becoming mainstream. And I want to talk a little bit about computational storage becoming mainstream.
09:40 GJ: So, first, let's talk about the momentum we see and the standards body. There has been tremendous momentum within the bodies to support common interface, to computational storage with SNIA computational memory, and storage initiative developing common use cases, terminology-driving alliances to standardize interfaces.
Just, also this year, NVMe consortium started an NVMe computational storage task group to look at adding computational storage devices to the wildly successful NVMe ecosystem. At the same time, we are starting to see the computational storage products hitting the markets with growing customer deployment. In fact, we are very happy to announce that the SmartSSD, a joint solution from Samsung and Xilinx, has passed its qualification requirements and will be available as a mass-market computational storage drive at the end of this year. Very exciting times.
10:49 GJ: And for these products to gain traction, we need solutions, a whole lot of them. So, it's great to see the solution space continue to grow and continue to be adopted like some of the key ones I'm highlighting here, transparent compression to save on storage costs, video file transcoding that has more than taken off in the current sphere of remote conferences, like the one we're having right now, spark acceleration to increase ETL capabilities, and the exciting area of search-in-storage that I'd like to dig into a little more deeply.
11:29 GJ: Let's look at the search-in-storage solution from Lewis Rhodes Labs. These guys have developed a converged storage and searcher plan, and this follows in the footsteps of database appliances. However, when they look inside this box, rather than a proprietary implementation, they leverage an off-the-shelf computational storage type, the SmartSSD. Let's talk a bit about the search algorithm.
LRL or Lewis Rhodes Labs, they have developed a fast, fixed-throughput, regular expression-based search. They're able to place it right next to the storage on each of these SmartSSDs. And the solution is based on something called neuromorphic processing. This is a technique that uses insights on how the human brain processes data to match patterns in a single pass. Alternative solutions to this exist, but they either require complex indexing that limit the search domain or they have search times that become dramatically longer when we increase the complexity of the searches.
12:46 GJ: So, how is the system . . .? So, the CPU in this case runs a software responsible for filing storage, file system, the software application, including the GUI, as well as the API that coordinates the operation of the search accelerator in each of these SmartSSD. Each of these SmartSSDs also have a high-throughput parallel search engine, as I already said. When a user activates a search, they connect to GUI that you see here and put in a search expression. In this case, we're showing . . . Illustrating a relatively trivial regular expression, looking for two words with really good capitalization in a large data set, so the appliance receives a command and then sets each of these modular leads to work. And acceleration is multifold in this case; it's achieved here by both the design of each of the search hardware loaded on each of the FPGA and also the federation of all 24 SmartSSDs working together.
13:58 GJ: And, in fact, the acceleration doesn't just stop there; multiple appliances can run in parallel. With just one half of a rack systems, one can process almost a petabyte of data in just minutes. And while this example that I'm illustrating here is a simple one, but much more complex searches can be run the same time with minimal network and CPU demands.
In fact, a recent large government customer of Lewis Rhodes Labs, working in the cyberforensic space was able to see a reduction in search time from eight hours to just 18 seconds. It's my belief that we are just starting to see the true potential of computation storage, OK? And I believe there is a lot more to come in this space.
14:57 GJ: Now, let's turn our attention to the increasing bandwidth available and how we can leverage it. Large-set data centers today, they have moved towards a standard server template with fixed amounts of compute, storage, and accelerator resources, all connected on a high-performance network. This is great for logistics of procuring and deploying hardware resources, but these fixed resources result in suboptimal utilization and constrain what the developers can do. And to solve this utilization problem, newer data center architectures are being developed to compute storage, and accelerators are being disaggregated.
15:47 GJ: So, that, rather than provisioning a company's server, the necessary resources from each of the pools are fixed and combined, deployed to meet the demands of a certain workload. In many data centers, storage has already been disaggregated. And today, we're also seeing that even accelerators are being set up in pools. But I hope you notice that in the simplified diagram one of the most important parts of the server is missing: memory. After you remove the storage, according to IHS Markit report, memory counts for more than 50% of the value of the remaining silicone in a server.
In the simple view that I just presented in the previous slide, this memory that consumes so much hardware spending is just assumed to be a part of the CPU. And memory needs not only bandwidth, but latency is also very critical. But wouldn't it be ideal if pools of memory could be provisioned as needed, shared, and with features that can be added the same way that storage is being managed today?
17:09 GJ: With the higher PCIe bandwidth and higher-speed networking that is coming to market, we all know that bandwidth is certainly there. So, this Holy Grail of server disaggregation is coming down really to just a latency problem, isn't it? So, how are we addressing this latency?
Now, I want to talk a little bit about some of the enabling technologies that are addressing this latency. First, the emergence of storage class memory is offering a big reprieve on latency. Having latencies in the hundreds of nanoseconds makes it possible to have remote media without applications really noticing it. Now, increasing latency only makes sense if you get something in return. Fortunately, these new media offers both lower costs today and a path to continued relations with time. With DRM media having significant scaling challenges, this fact that these devices offer a path towards the future makes it even more important.
The second development and the other piece of the puzzle is the advent of low-latency current technology, CXL. Now, Xilinx is on the CXL board, and we use CXL as a critical enabling technology.
18:40 GJ: It's been designed with the low latency in mind from the beginning, and though it's built on PCIe 5.0, we all know that can enable transfers that are about 10 times faster than PCIe. Additionally, the covalency aspects of CXL open up many interesting new cases, where caches and the processes and accelerators can come into play together.
So, how is Xilinx enabling disaggregation? With our virtual adaptable device family, we have included everything that's needed to jump-start work in disaggregation. First, we have the high-speed programmable transceivers that can support PCIe, CXL, Ethernet, and other protocols; next, high-performance programmable I/Os that can be used to bring up both standard interfaces like DDR or new media types, for instance, like SCM. And finally, the fabric that the SCAs are known for, this fabric opens up endless possibilities.
19:42 GJ: Protocols can be bridged, new standard implemented right as they are released and even developed, and even value-added custom features can be implemented with hardware specs like the example I was talking about in . . . . We're working with several partners and some new and exciting designs in this space. And we hope to be able to share more details at future Flash Memory Summits.
Let's now turn our attention to the final megatrend, the movement of clouds to run on heterogeneous and distributed hardware. Before we look at what's happening now, it's always helpful to see where we have been. Before cloud architectures really took off, organizations had their own infrastructure, and they placed it in their private data centers, where physical security, the three Gs -- guards, gates, and guns -- were used for protection against unauthorized accesses. Security technology essentially was focused on any networking connections between these data centers and the untrusted internet.
20:53 GJ: Looking at now, today, there are prominently two types of setups: private clouds and public clouds. In the private clouds, a unified infrastructure team is responsible for both the external network security or the untrusted internet, as well as the internal security policy that enable provisioning of our various resources to the . . . And I want to highlight that the three Gs still remain to protect physical access in this case.
And for public clouds, the story is similar, but with an important difference -- that is, being able to support multi-tenancy. And frankly, your trust in the security isolation provided by a public cloud, it's hard to imagine multi-tenancy. While there may be limited trust between workloads within our organization, in a private cloud, it's very clear that the public clouds are not possible with compromised security architectures. So both for the purpose of isolating the management stack and the performance reasons, many clouds today are adopting separate hardware, for instance, in the form of, again, a smart link that I'm showing here, and I'm just using the smart link example a lot.
22:24 GJ: First, it provides the added security by having isolated hardware, that only the cloud provider's software can run on.
Second, it also offloads the processing involved in adding security to the shared resources, such as encrypting data that is stored remotely in a storage pool or a network link encryption to provide trusted connections over a shared network. While the infrastructure is still protected by the three Gs, that data addressed and data-in-motion encryption provider here requires much more sophisticated attacks to break this kind of infrastructure.
So, projecting the future, where are we going? There is currently a shift underway where the distinction between the public and private clouds is melting away, we all see that. The same management and security policies need to work seamlessly, whether our workload runs on-premises, in a public cloud, or a mix of both. In many cases, this means extending the cloud management software out onto third-party managed hardware and letting the customer run their management software inside the cloud.
23:36 GJ: We already see this happening today in some use cases. And let's just step back and think about what this means as this concept expands and public cloud management software extends out to customer data centers. This management software now controls the keys to the kingdom, both the encryption keys and the access controls. The distributor-ness of the hardware, also in this case means that from the management software's perspective, the three Gs of physical infrastructure protection is gone -- even the connections between the components, between the server, where the cloud security and provisioning software is running, could no longer be trusted.
The good news is that establishing secure trusted connections is a problem that has been solved as part of the rapid growth of the internet. What needs to happen is that the control plane of the hardware security engine just need to adopt the proven techniques to establish a trusted connection before exchanging keys or authorizing access to resources. Also, this must be extended across all the links, what is just a network. This makes the need for both data addressed and data-in-motion security to expand dramatically, and we are seeing that. Remember, the physical hardware security can no longer be . . . For example, data in the RAM chips could be considered data-in-motion, keychain buffers can be snooped, there are all kinds of scenarios that one must watch against.
25:25 GJ: I want to now talk about how Xilinx is helping with the enablement of confidential computing. We're very actively working with the industry in this area. Enabling the building blocks for secure and trusted design has been a key focus area for us.
Today, there is a platform controller in every virtual HR device that enables securely loading designs into our fabric as well as trusted bring-up of the embedded software. There is also a true random number generator for key generation, key management hardware for securely adding and revoking keys, and even physically unclonable functions that create a unique, unchangeable signature per device as each of these devices is being fabricated.
Another key focus area for Xilinx that I want to talk is encryption. We have a heavy focus on providing encryption both at IP level for our customers, to build their own solutions, as well as turnkey solutions for storage and networking encryption. And we are continuing to work in this area, with the aim to have the industry move towards convention computing in a zero-trust environment.
26:43 GJ: Cool. With this we've reached the end of my presentation, now it's the time to recap. We clearly covered a lot of ground today, but . . . to continue to get the performance improvements the IT industry is known for, architectures need to evolve as the computational storage becoming mainstream. There are already solutions available for data-intensive computing. Data center architectures are becoming disaggregated and while the pieces are coming together to enable this transformation, Xilinx has the right offering to help the industry along.
Finally, security is becoming the very first design principle in almost all system architectures. Cluster computing in a zero-trust environment is possible as long as you have the right hardware support. With this, I conclude, I appreciate the time you have given me today, and look forward to meeting in person in the not-too-distant future. Thank you.
27:53 GJ: Now, let's take some questions from our audience. OK, here is a great question. "How are FPGAs different than other types of accelerators? What makes them well-suited for computational storage?" OK, that's a great question, and thank you for asking this, and it's very relevant to some of the material that I just presented. Let me share my thoughts on this.
First of all, unlike other accelerators, FPGAs do not have a fixed architecture, and what this does is that this enables a developer to control how the data flows from one step to the next. So, rather than thinking about how to optimize their algorithm to a fixed architecture, they can think about how to optimize the architecture to meet the needs of the algorithm. And the reconfigurability of FPGAs provides programmable but hardware performance, and which I highlighted in some of the slides, the hardware performance aspect of it.
And the second part of the question is, "How are they suited for computational storage?" I'd say that as for the future computational storage, another broad strength that FPGAs bring to the table is the availability of high-performance programmable I/Os, and this is directly used for computational storage use cases that I spoke about, where the accelerator . . . You want to fix the data directly from storage networking resources. So, I think it's a huge strength. Thank you.