Your Questions Answered on Persistent Memory, CXL, and Memory Tiering

With the persistent memory ecosystem continuing to evolve with new interconnects like CXL™ and applications like memory tiering, our recent Persistent Memory, CXL, and Memory Tiering-Past, Present, and Future webinar was a big success. If you missed it, watch it on demand HERE!

Many questions were answered live during the webinar, but we did not get to all of them. Our moderator Jim Handy from Objective Analysis, and experts Andy Rudoff and Bhushan Chithur from Intel, David McIntyre from Samsung, and Sudhir Balasubramanian and Arvind Jagannath from VMware have taken the time to answer them in this blog. Happy reading!

Q: What features or support is required from a CXL capable endpoint to e.g. an accelerator to support the memory pooling? Any references?

A: You will have two interfaces, one for the primary memory accesses and one for the management of the pooling device. The primary memory interface is the .mem and the management interface will be via the .io or via a sideband interface. In addition you will need to implement a robust failure recovery mechanism since the blast radius is much larger with memory pooling.

Q: How do you recognize weak information security (in CXL)?

A: CXL has multiple features around security and there is considerable activity around this in the Consortium. For specifics, please see the CXL Specification or send us a more specific question.

Q: If the system (e.g. x86 host) wants to deploy CXL memory (Type 3) now, is there any OS kernel configuration, BIO configuration to make the hardware run with VMWare (ESXi)? How easy or difficult this setup process?

A: A simple CXL Type 3 Memory Device providing volatile memory is typically configured by the pre-boot environment and reported to the OS along with any other main memory. In this way, a platform that supports CXL Type 3 Memory can use it without any additional setup and can run an OS that contains no CXL support and the memory will appear as memory belonging to another NUMA code. That said, using an OS that does support CXL enables more complex management, error handling, and more complex CXL devices.

Q: There was a question on ‘Hop” length. Would you clarify?

A: In the webinar around minute 48, it was stated that a Hop was 20ns, but this is not correct. A Hop is often spoken of as “Around 100ns.” The Microsoft Azure Pond paper quantifies it four different ways, which range from 85ns to 280ns.

Q: Do we have any idea how much longer the latency will be?

A: The language CXL folks use is “Hops.” An address going into CXL is one Hop, and data coming back is another. In a fabric it would be twice that, or four Hops. The latency for a Hop is somewhere around 100ns, although other latencies are accepted.

Q: For memory semantic SSD: There appears to be a trend among 2LM device vendors to presume the host system will be capable of providing telemetry data for a device-side tiering mechanism to decide what data should be promoted and demoted. Meanwhile, software vendors seem to be focused on the devices providing telemetry for a host-side tiering mechanism to tell the device where to move the memory. What is your opinion on how and where tiering should be enforced for 2LM devices like a memory semantic SSD?

A: Tiering can be managed both by the host and within computational storage drives that could have an integrated compute function to manage local tiering- think edge applications.

Q: Re VM performance in Tiering: It appears you’re comparing the performance of 2 VM’s against 1. It looked like the performance of each individual VM on the tiering system was slower than the DRAM only VM. Can you explain why we should take the performance of 2 VMs against the 1 VM? Is the proposal that we otherwise would have required those 2 VM’s to run on separate NUMA node, and now they’re running on the same NUMA node?

A: Here the use case was, lower TCO & increased capacity of memory along with aggregate performance of VM’s v/s running few VM’s on DRAM. In this use case, the DRAM per NUMA Node was 384GB, the Tier2 memory per NUMA node was 768GB. The VM RAM was 256GB.

In the DRAM only case, if we have to run business critical workloads e.g., Oracle with VM RAM=256GB, we could only run 1 VM (256GB) per NUMA Node (DRAM=384GB), we cannot over-provision memory in the DRAM only case as every NUMA node has 384GB only. So potentially we could run 4 such VM’s (VM RAM=256Gb) in this case with NUMA node affinity set as we did in this use case OR if we don’t do NUMA node affinity, maybe 5 such VM’s without completely maxing out the server RAM. Remember, we did NUMA node affinity in this use case to eliminate any cross NUMA latency.78

Now with Tier2 memory in the mix, each NUMA node has 384GB DRAM and 768GB Tier2 Memory, so theoretically one could run 16-17 such VM’s (VM RAM =256GB), hence we are able to increase resource maximization, run more workloads, increase transactions etc , so lower TCO, increased capacity and aggregate performance improvement.

Q: CXL is changing very fast, we have 3 protocol versions in 2 years, as a new consumer of CXL what are the top 3 advantages of adopting CXL right away v/s waiting for couple of more years?

A: All versions of CXL are backward compatible. Users should have no problem using today’s CXL devices with newer versions of CXL, although they won’t be able to take advantage of any new features that are introduced after the hardware is deployed.

Q: (What is the) ideal when using agilex FPGAs as accelerators?

A: CXL 3.0 supports multiple accelerators via the CXL switching fabric. This is good for memory sharing across heterogeneous compute accelerators, including FPGAs.

Thanks again for your support of SNIA education, and we invite you to write askcmsi@snia.org for your ideas for future webinars and blogs!

Leave a Reply