Non-Uniform Memory Access (NUMA) is reshaping microservice placement

This Hacker News discussion revolves around Non-Uniform Memory Access (NUMA) architectures, the challenges they present for performance optimization, and various approaches to manage them.

The Necessity of Manual Pinning for High-Performance Workloads

A central theme is that for applications demanding the highest throughput and lowest latency, especially in High-Performance Computing (HPC) or high-traffic system designs, manual workload pinning to specific CPU cores and NUMA nodes is often a requirement for optimal performance.

"The long and short of it is that if you’re building an HPC application, or are sensitive to throughput and latency on your cutting-edge/high-traffic system design, then you need to manually pin your workloads for optimal performance." (stego-tech)
stego-tech also noted, "I can count one time in my entire 15 years where I had to pin a production workload for performance, and it was Hyperion." This highlights that while necessary in some cases, it's not a common requirement for the average user.

Scalability Challenges of Manual Pinning

While manual pinning can yield significant performance gains, a key concern raised is its scalability. As the number of CPU cores and chiplets continues to increase, managing manual pinning becomes more complex and may not be a sustainable solution for all scenarios.

"One thing the writeup didn’t seem to get into is the lack of scalability of this approach (manual pinning). As core counts and chiplets continue to explode, we still needbetter ways of scaling manual pinning or building more NUMA-aware OSes/applications that can auto-schedule with minimal penalties." (stego-tech)

Potential for Automation and Orchestration

Several users suggest that automation and orchestration platforms could play a role in managing NUMA awareness. Kubernetes is mentioned as a potential candidate if it were to implement NUMA-aware scheduling and affinity mechanisms.

"This strikes me as something that Kubernetes could handle if it could support it. You can use affinity to ensure workloads stay together on the same machines, if K8s was NUMA aware, you could extend that affinity/anti-affinity mechanism down to the core/socket level." (jasonjayr)
jasonjayr later clarified, "EDIT: aaaand ... I commented before reading the article, which describes this very mechanism."

NUMA Optimization as an "Edge Case" for Most Users

A counterpoint to the need for manual pinning is that for the majority of users and applications, manual NUMA tuning is an unnecessary optimization. Simpler performance gains can usually be found elsewhere, such as in query optimization or inefficient code.

"This is one of those way down the road optimizations for folks in fairly rare scale situations in fairly tight loops." (colechristensen)
"Most of us are in the realm of the lowest hanging fruit being database queries that could be 100x faster and functions being called a million times a day that only need to be called twice." (colechristensen)
"In 99% of use cases, there’s other, easier optimizations to be had. You’ll know if you’re in the 1% workload pinning is advantageous to." (stego-tech)

Ease of Pinning for Specific Workload Types

For applications that largely saturate entire CPUs with dedicated threads, pinning can be relatively straightforward. The difficulty often lies in identifying whether pinning will yield a benefit and in more complex scenarios like network socket handling.

"Cpu pinning can be super easy too. If you have an application that uses the whole machine, you probably already spawn one thread per cpu thread. Pinning those threads is usually pretty easy. Checking if it makes a difference might be harder... For most applications, it won't make a big difference, but some applications will see a big difference." (toast0)
"If you want to cpu pin network sockets, that's not as easy, but it can also make a big difference in some circumstances; mostly if you're a load balancer/proxy kind of thing where you don't spend much time processing packets, just receive and forward." (toast0)

Specific Use Cases Where NUMA Pinning is Critical

Users shared examples of scenarios where NUMA awareness and pinning have been crucial for performance. These include software-defined networking projects and database management systems.

"Yeah, I was once in this situation with a perf-focused software defined networking project. Pinning to the wrong NUMA node slowed it down badly." (frollogaston)
"Probably another situation is if you're working on a DBMS itself." (frollogaston)
"Last time I was architect of a network chip, 21 years ago, our library did that for the user. For workloads that use threads that consume entire cores, it's a solved problem." (ccgreg)

Emerging Solutions and Tools for NUMA Management

There's an ongoing effort within the Linux kernel and in specialized tools to improve NUMA handling. Projects like mpibind are mentioned as examples of solutions being deployed in HPC environments, with interest in similar approaches for cloud workloads.

"There are some solutions that try to tackle this in HPC. For example https://github.com/LLNL/mpibind is deployed on El Capitan. Would be interesting to see if something similar appears for cloud workloads." (PerryStyle)
The discussion also references ongoing work in the Linux kernel related to NUMA locality and scheduling, with mentions of Phoronix as a source for such updates. (jauntywundrkind)

Alternatives to Manual Pinning: Uniformly Slow Architecture

An alternative strategy for avoiding NUMA-related complexities is to opt for architectures that present a uniformly slow memory access, effectively abstracting away NUMA differences.

"If auto-NUMA doesn't handle your workload well and you don't want to manually pin anything, it's always possible to use single-socket servers and set NPS=1. This will make everything uniformly 'slow' (which is not that slow)." (wmf)
"Historically, the Sparc 6400 was derided for not being NUMA, but instead being Uniformly Slow." (ccgreg)

The Evolving Definition of NUMA and Future Architectures

The discussion touches upon the increasing complexity of modern CPUs, where even a single socket can contain multiple NUMA zones due to chiplet designs. This trend suggests a future where architectures might move beyond NUMA, with independent compute clusters communicating via high-speed interconnects.

"Even today, the title here is woefully under-describing the problem. A Epyc chip is actually multiple different compute die, each with their own NUMA zone and their own L3 and other caches. For now yes each socket's memory is all via a single IO die & semi uniform, but whether that holds is in question, and even today, the multiple NUMA zones on one socket already require careful tuning for efficient workload processing." (jauntywundrkind)
"I have this pocket belief that eventually we might see post NUMA post coherency architectures, where even a single chip acts more like multiple independent clusters, that use something more like networking (CXL or UltraEthernet or something) to allow RDMA, but without coherency." (jauntywundrkind)