NUMA (non-uniform memory access) is a memory architecture in which each processor or core has its own local memory, which it's able to access with lower latency than the memory co-located with other processors. This means that for any one CPU, its access time to different memory nodes is non-uniform. All processors within a NUMA multiprocessor setup share memory locally between themselves and work together — much like a cluster of computers would — to perform memory-intensive tasks more easily. It's optimized for performance under these conditions instead of focusing on accessibility speeds for remote caches.
NUMA differs from UMA (uniform memory access) architectures in that each processor in a NUMA setup is able to access available memory caches at different speeds. A CPU's access to its own local cache is lightning fast, and its access to its neighbors' caches is decently fast as well. However, the process of accessing the caches of physically distant CPUs isn't nearly as fast. Meanwhile, every processor in a uniform system would have equal access to memory through one central memory controller — but with only one shared access point, this can result in slowdowns at scale.
The NUMA architecture first arrived on scene with the AMD Opteron processor in 2003. Intel took some time to adapt its x86 CPU platform to the technology, waiting until 2007 to unveil support. Since then, processor technology has advanced to meet scalability needs for enterprises and everyday users alike.
How does NUMA (non-uniform memory access) work?
NUMA attempts to solve the memory access bottleneck that plagues UMA systems at scale. NUMA architectures use distributed memory nodes so that each CPU has fast local memory, which avoids bottlenecking through one central bus. This memory allotment lives closer to the processor than in other architectures, which shortens internal pathways to help both data and instructions travel more efficiently.
This essentially means that a large number of CPUs — 64, 128, or even more — can work in tandem to handle tasks within the same operating system. It impacts the kernel space by influencing how CPU workloads are scheduled, while also leading to noticeable user experience improvements. However, NUMA can be challenging to get right at scale — especially as configurations grow larger. It can also be expensive, but these tradeoffs may be worthwhile for demanding workloads where squeezing the utmost performance out of the system is important.
Each NUMA CPU exists as a manageable node that the OS can take advantage of. The addition of more CPU cores also doesn't guarantee better performance in this setup, since two CPU cores spaced further apart will perform worse together than two spaced more closely together. Software optimized for NUMA processors will favor performing tasks using threads, cores, and memory in close proximity to one another.
What are the benefits of NUMA (non-uniform memory access)?
Versus other multi-processing systems and especially older systems, NUMA offers the following advantages:
It's faster on average than uniform systems at large scale that can be prone to bottlenecking or lagging data transmission between memory and the CPU.
It offers more memory bandwidth overall. Distributed memory negates the limitations of one shared access point.
NUMA setups are highly scalable and can support concurrent, intensive workloads on massively multi-core systems.
Though not exclusive to NUMA, such setups allow for smarter memory management — especially as thread, core, and processor counts increase within the shared system.
You’ve mastered one topic, but why stop there?
Our blog delivers the expert insights, industry analysis, and helpful tips you need to build resilient, high-performance services.
Does HAProxy support NUMA?
Yes! HAProxy is designed to boost processing performance on multi-core, multi-processor systems and can leverage smarter multi-threading by maintaining awareness of your CPU architecture. HAProxy supports other performance features such as automatic CPU binding (CPU affinity) to take full advantage of complex multi-processor systems with minimal manual configuration. HAProxy 3.3 (and onward) better optimized for NUMA thanks to intelligent multithreading and caching improvements, effectively unlocking peak CPU performance within complex, scalable computing systems.
To learn more about NUMA support and related CPU optimizations in HAProxy, check out our How HAProxy takes advantage of multi-core CPUs blog.