Performance

Performance optimization for large systems

This page applies to:

  • HAProxy Enterprise 2.4r1 and newer
  • HAProxy 2.4 and newer

This information we present here explores tuning the load balancer for use on large machines, or those that have large numbers of cores. Keep in mind that HAProxy is designed to be deployed in clusters and therefore scales well horizontally. It may be more cost effective for you to deploy the load balancer on many small machines instead of deploying on a small number of large machines.

Using HAProxy Enterprise or HAProxy ALOHA?

If you’re using HAProxy Enterprise or HAProxy ALOHA and have questions about optimizing the performance of your load balancer, please contact HAProxy Technologies Support. We can help you determine the best settings for your specific system.

The load balancer runs in multithreaded mode automatically. What this means is that it is able to take advantage of systems with multiple CPUs by performing many processing tasks concurrently, distributing its computational load across many CPUs.

By default since version 2.4, the load balancer analyzes your system to determine how it should configure itself for your specific hardware so that it performs most optimally on your machine. We’ve found in testing that the settings it determines provide the best performance out of the box on most systems, no additional configuration or tuning required! We strongly recommend that you run with automatic settings first before you try any of the configuration changes on this page.

However, if your system has:

  • more than 64 threads that have only a few unified L3 caches
  • multiple physical CPU sockets
  • CPUs having multiple CCXs
  • CPUs having different types of cores, for example, “performance” cores and “efficiency” cores

Then you could potentially see better load distribution across your CPUs, better throughput, and ultimately more optimized performance by tuning how the load balancer interacts with your hardware. Keep in mind, though, that the configuration changes required are not something to implement lightly and without in-depth analysis of your system and traffic. They also require you to perform substantial benchmarking on your system with production-level traffic to ensure the settings are optimal for your system.

In sections that follow, we’ll cover the following:

  • How the load balancer uses multithreading to make the best use of system resources.
  • An overview of common CPU architectures and what they mean for the load balancer.
  • How the load balancer automatically applies the performance optimization settings it determines are the most appropriate for your hardware.
  • How to find more information about your specific hardware.
  • How you can tune the load balancer’s automatic CPU binding.
  • What to consider when adjusting the load balancer’s process management settings.
  • When to consider different configurations to optimize for performance.

Remember to test any performance tuning changes you make before you deploy them to your production environment, as some settings could cause performance degradation. Before making changes, ask these questions:

  • Are you running in a virtualized environment?
  • What are the characteristics of your traffic?
  • What are the specifics of your system’s hardware?

Concepts Jump to heading

In the subsections that follow, we’ll walk you through what being a multi-threaded application means for HAProxy, how it adapts based on CPU architecture, and how you can examine your system to determine how best to configure the load balancer

Multithreading and the load balancer Jump to heading

The load balancer is a multi-threaded application. This means that it uses multiple threads within a single process to run tasks concurrently, with the threads sharing the same memory space. Because of this shared memory, the load balancer must ensure thread safety by coordinating access to memory through synchronization, which adds latency. This latency is dependent on the characteristics of your CPU topology. Ranked in order from fastest to slowest, the distances between locations of threads sharing data dramatically affects the latency, depending on whether this communication occurs between two threads:

  • of the same core (this is the fastest)
  • of different cores, but belonging to the same CCX or node
  • of two cores of a unified L3 cache
  • of two cores belonging to different CCX or node
  • of two cores belonging to different CPU sockets (this is the slowest and should be avoided)

To illustrate these different distances between threads, the diagram below represents a NUMA system with 8 CCX and 32 hyperthreaded cores. NUMA (Non-Uniform Memory Access) is a computer memory design where subsets of memory and CPUs co-located physically are grouped into nodes, reducing memory access time by preferring that the CPUs access the memory closest to them, or within the same node.

CPU Diagram with 2 NUMA nodes

For each of the different distances, let’s consider the locations of threads in the diagram:

  • Two threads of the same core.
    • This is T0 and T1 most closely located next to each other on each of the cores.
  • Two threads of the same CCX.
    • These are any two threads that belong to cores that share an L3 cache. For example, for L3 Cache Instance 0, these would be threads running on cores 0, 1, 2, and 3. Though not on the same core, they are still physically close.
  • Two threads of different CCX.
    • In this diagram, each node has 4 CCX. Any two threads running on cores on different CCX on the same node, for example, core 0 and core 4, though each having a different L3 cache, are still closely located, though not as closely located as those within the same CCX.
  • Two threads running on cores of different CPU sockets.
    • These would be any two threads that must communicate between Node 0 and Node 1 over a socket interconnect.
      • As this is a very expensive operation and should be avoided, you could consider assigning an additional NIC to the second socket (shown optionally in this diagram) to ensure that all communication between the load balancer threads and a NIC happen on the same CPU socket. This requires additional configuration, and we touch on this in more detail when we discuss considerations for a heavy SSL load.
      • As for DRAM memory, there is much higher latency associated with accessing a different node’s memory than accessing the memory local to the node. A thread running on Node 0 would experience much greater latency accessing the memory local to Node 1 than it would accessing the memory local to Node 0.

In regards to threads running on cores of a unified L3 cache, this is a scenario this diagram does not show, as all cores share one L3 cache on such systems instead of being divided into clusters or CCX.

The load balancer’s architecture allows it to minimize sharing data between threads where possible. Huge development efforts have been underway since version 2.4 to optimize the load balancer’s operations to avoid process-wide data sharing. A more preferable scenario to this is that data should be shared only between very close threads, that is, threads that share an L3 cache, or those that belong to the same CCX or node. The most optimal scenario is not sharing data at all between threads, though some operations require it.

Introduced in version 2.7, the load balancer’s thread groups assign threads by locality into independent groups. By default, the load balancer operates with a single thread group. Using thread groups, the load balancer can limit communication between distant threads.

This concept of latency between threads is not universal across all systems. The load balancer behaves differently on different CPU architectures, as we’ll discuss next.

CPU architectures and automatic CPU binding Jump to heading

Differences in CPU architectures affect the latencies incurred when synchronizing memory access across threads. The load balancer’s automatic settings adjust based on the characteristics of the system.

The load balancer takes into account the following, ensuring that its threads are grouped in the most performant arrangement possible for your system:

  • On NUMA systems, the load balancer configures itself to use the cores of a single NUMA node. This avoids communicating across nodes, or worse, across CPU sockets. In our testing, we’ve found that the load balancer generally performs best when constrained to a single node, specifically the same one to which the NIC is assigned.
    • This effectively limits inter-CCX communication where possible, which otherwise introduces latency.
    • The load balancer tries to arrange threads per locality. The closer the threads, per the CPU on which they execute, the less latency there will be in inter-thread communication. Threads that share an L3 cache experience the lowest latency.
  • On systems that have a single, unified L3 cache, or in the case of very large systems that use multiple physical CPUs, each which its own unified L3 cache, the load balancer will use all available cores on a single physical CPU, and it defaults to using one thread per available core and assigns all threads to the same thread group. Thread groups have a limit of 64 threads, and as such, using more than 64 threads requires defining multiple thread groups.

Since version 2.4, the load balancer implements these automatic settings on your specific system using a mechanism called automatic CPU binding. CPU binding is the assignment of specific thread sets to specific CPU sets with the goal to optimize performance on some systems. The load balancer determines the NUMA settings, that is, whether your system is NUMA-aware and how many nodes it has, for your specific system and limits itself to a single node.

As of version 3.2, the load balancer’s automatic CPU binding mechanism analyzes the entire CPU topology of your specific system in detail to determine how it should most optimally assign its threads. CPU topology includes CPU packages, the arrangement of NUMA nodes, CCX, L3 caches, cores, available threads, and related settings. The load balancer considers all of these things when it determines its settings.

Taking those considerations into account, there are a few things to note:

  • As of version 3.2, there are options for easily tuning the automatic behavior.
  • In terms of networking sockets, per the default configuration where the load balancer runs on the first NUMA node only and creates a single thread group to run on CPUs of that node, the load balancer creates a single socket per thread group per listener, in this case one socket, to avoid the overhead associated with using multiple listening sockets. In the cases where you have large numbers of threads per thread group, a single socket per thread group per listener could cause contention between threads, so there are options for using more than one socket for the same address and port.
  • We’ve determined in our tests that the load balancer generally performs best when its threads execute on a single NUMA node, and as such, the default is to arrange the threads in this way. This means that for a system that has several NUMA nodes and/or CPU sockets, the load balancer is probably not using all available hardware resources. While, in general, the performance costs associated with sharing data across nodes and CPU sockets are not worth what you might gain with additional threads and cores, there may be some cases where this may not be true depending on your system and traffic characteristics. We explore these cases in subsequent sections.

Prior to making adjustments, you must examine your system, as this information will inform what specific options you should use.

Tip

As of version 3.2, if you determine through examining your system that your CPUs have multiple L3 caches, you may benefit from using a cpu-policy other than the default. See tuning the load balancer’s automatic CPU binding for more information about these options.

Examine your system Jump to heading

Use the following commands and utilities to learn more about your system and configuration.

Learn more about your hardware Jump to heading

These commands return the hardware details of your system.

  • Retrieve details about your system’s available CPUs, cores, caches, and their arrangement.

    nix
    lscpu -e
    nix
    lscpu -e
  • Show the number of NUMA nodes on your system, which indicates whether your system is NUMA-aware.

    nix
    lscpu | grep -i numa
    nix
    lscpu | grep -i numa
  • Show which NUMA node a NIC is bound (nodes numbered starting at 0). Replace <NIC name> with the name of your interface, such as eth0.

    nix
    cat /sys/class/net/<NIC name>/device/numa_node
    nix
    cat /sys/class/net/<NIC name>/device/numa_node
  • Show the NUMA-related system log entries. It logs these initialization details on startup.

    nix
    dmesg | grep -i numa
    nix
    dmesg | grep -i numa
  • Show kernel configuration settings related to NUMA.

    nix
    cat /boot/config-$(uname -r) | grep NUMA
    nix
    cat /boot/config-$(uname -r) | grep NUMA
  • Show the online NUMA nodes.

    nix
    sudo cat /sys/devices/system/node/online
    nix
    sudo cat /sys/devices/system/node/online
  • Show the CPUs associated with each NUMA node as well as the distances between nodes. You may need to install the numactl utility.

    nix
    sudo numactl --hardware
    nix
    sudo numactl --hardware
    Install the numactl utility

    The numactl utility can help gather information about the NUMA nodes on your system. Install it using the package manager for your system:

    nix
    sudo apt-get install numactl
    nix
    sudo apt-get install numactl
    nix
    sudo yum install numactl
    nix
    sudo yum install numactl
    nix
    sudo zypper install numactl
    nix
    sudo zypper install numactl
    nix
    sudo pkg install numactl
    nix
    sudo pkg install numactl

Examine the load balancer’s automatic configuration for your hardware Jump to heading

To see what settings the load balancer has determined are best for your system, you can run the load balancer with the -vv option:

nix
/opt/hapee-<VERSION>/sbin/hapee-lb -vv
nix
/opt/hapee-<VERSION>/sbin/hapee-lb -vv

Example for HAProxy Enterprise 3.1r1:

nix
/opt/hapee-3.1/sbin/hapee-lb -vv
nix
/opt/hapee-3.1/sbin/hapee-lb -vv
nix
/usr/sbin/haproxy -vv
nix
/usr/sbin/haproxy -vv
output
text
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=8).
output
text
Built with multi-threading support (MAX_TGROUPS=16, MAX_THREADS=256, default=8).

You can also use the Runtime API command show info :

nix
echo "show info" | \
sudo socat stdio tcp4-connect:127.0.0.1:9999
nix
echo "show info" | \
sudo socat stdio tcp4-connect:127.0.0.1:9999
output
text
Name: HAProxy
Version: 3.1.2-cda631a
Release_date: 2025/01/08
Nbthread: 22
[...]
output
text
Name: HAProxy
Version: 3.1.2-cda631a
Release_date: 2025/01/08
Nbthread: 22
[...]

View automatic CPU binding Jump to heading

Available since:

  • HAProxy 3.2

To see the results of the load balancer’s automatic CPU binding in action, run the load balancer with the -dc command line option. It will log the arrangement of threads, thread groups, and CPU sets that it has determined is optimal based on your CPU topology. Example:

text
CPU clusters:
0 cpus= 16 cores= 8 capa=1064
1 cpus= 16 cores= 8 capa=1064
2 cpus= 16 cores= 8 capa=1064
3 cpus= 16 cores= 8 capa=1064
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,64-79
text
CPU clusters:
0 cpus= 16 cores= 8 capa=1064
1 cpus= 16 cores= 8 capa=1064
2 cpus= 16 cores= 8 capa=1064
3 cpus= 16 cores= 8 capa=1064
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-32 1-32 32: 0-15,64-79

Here the load balancer created a single thread group with 32 threads assigned the group to CPUs numbered 0-15 and 64-79.

For more information including a full example with corresponding output from the lscpu -e command and explanations for why the load balancer applied this particular configuration, see tuning the load balancer’s automatic CPU binding.

When to consider a thread configuration other than the default Jump to heading

You may need additional settings to tune the load balancer’s configuration for your specific system. Or you may want to manually control how threads sets are mapped to CPU sets, though we do not recommend it, as it is easy to misconfigure and is not portable between machines.

Case #1: You probably don’t need to do anything Jump to heading

We recommend that you try using the default, automatic settings first before you implement any of the options on this page. In our testing, the settings the load balancer determines automatically and specifically for your system generally provide the best performance. However, you can tune certain behaviors for your system if benchmarks show that the defaults do not meet your needs. Here are some indications that you may benefit from tuning the load balancer configuration for your specific system:

There’s an additional case where the default, automatic settings may not be the most performant for your system. If your system has multiple CCX or L3 caches, you may see performance gains via some additional options to optimize how the load balancer distributes threads among the CCX.

Case #2: Insufficient throughput under heavy load Jump to heading

Prerequisite - OS compatibility

Linux versions 3.9 and greater, or versions that otherwise support SO_REUSEPORT, are able to take advantage of binding multiple sockets to the same address and port.

Shards create identical sockets that enable multiple listeners, that is, bind directives, on the same address and port. By doing this, you’re using a feature of Linux that enables you to offload to the kernel the load balancing of connections among multiple sockets. This allows the load balancer’s threads to then work in parallel, processing connections across many sockets, rather than all threads competing for work from one socket. Using multiple sockets could potentially reduce contention between threads and increase performance.

NUMA best practice

Prior to version 3.2, our benchmarks determined that in most cases, a single socket per thread group per bind directive incurs less overhead than having multiple sockets for the same bind when run on a single NUMA node, and as such, this is the default behavior. Version 3.2 includes significant updates that allow for better scaling of the load balancer’s subsystems across multiple NUMA nodes to improve performance for CPU-intensive workloads, such as high data rates.

Remember, the load balancer’s default configuration creates a single thread group of threads that will run on the CPUs of the first NUMA node. If you decide to change these default settings for shards, be sure to thoroughly test on your system. The characteristics of your specific system and of your traffic will affect how changing these settings impacts performance.

To change the load balancer’s default shards settings, add the shards argument to your bind directive. For example, if you want to create exactly two listeners on the same address and port, specify 2 as the number of shards for that bind directive:

frontend fe_main
bind 192.168.50.10:80 name website-1 shards 2
frontend fe_main
bind 192.168.50.10:80 name website-1 shards 2

The work from both listeners will then be evenly distributed across all threads available to this frontend.

You can achieve the same effect as shards by manually replicating your bind directives instead of using the shards keyword, as long as your system supports binding multiple sockets to the same address and port, which also enables this functionality for earlier versions of the load balancer. For each additional, identical bind directive, the load balancer will create an additional socket (file descriptor), producing another listener for that address and port. Your configuration would then look something like:

haproxy
frontend fe_main
bind 192.168.50.10:80 name website-1
bind 192.168.50.10:80 name website-2
[...]
# additional bind directives with identical address:port depending on your desired number of listeners
[...]
haproxy
frontend fe_main
bind 192.168.50.10:80 name website-1
bind 192.168.50.10:80 name website-2
[...]
# additional bind directives with identical address:port depending on your desired number of listeners
[...]

You may want to use this syntax if there are additional bind options you wish to enable per listener.

Caution

Do not use shards in conjunction with port ranges. Each additional shard requires at least one additional file descriptor per port. If you specify that you want one shard per thread, the number of file descriptors then becomes number of shards x number of threads x number of ports. This could quickly use up all available file descriptors.

Also, if you set shards to a hardcoded number, as the example above shows, be sure to set this value to a number higher than the number of thread groups. By default, the load balancer creates a single thread group; changing the default settings could result in more thread groups. As of version 3.2, view the automatic CPU binding by running the load balancer with the -dc option to determine how many thread groups it is using.

The shards option has additional arguments you can use to further tune its behavior:

  • Available since HAProxy 2.8: Set shards by-group to create a number of shards equal to the number of thread groups. This is the default. We recommend this option if you are not defining a specific number of shards. This creates an identical listener socket per thread group that’s shared among all threads of that group. It’s computationally expensive for distant threads to access the same listener socket, so each thread group having its own is beneficial. Each new connection is handled by the least-loaded thread of the shard. This effectively load balances the connections across the threads, which helps to better distribute the load across CPUs, which is particularly important for SSL/TLS connections that are CPU intensive at connection initialization.
  • Available since HAProxy 2.8: You can also set your shards policy globally with the global directive tune.listener.default-shards. This setting applies to all listeners (binds). You can override this setting per bind directive if you have a listener that requires different settings.

Tip

If you are using version 2.5 or 2.6, use caution when using the shards keyword, as the load balancer cannot take advantage of thread groups (introduced in version 2.7). Be sure to test on your system to find the optimal configuration.

There is an option by-thread for shards to indicate that you want one shard per thread. The load balancer will automatically create a number of listener sockets that matches the number of threads available to that listener. Use caution with this option, as a large number of threads will result in a large number of listener sockets, and therefore file descriptors, and some operating systems have limits on the number of sockets that you can bind to the same address and port. You should probably use by-group instead, as there is also no load balancing of connections across the threads when you set shards by-thread. That means that some threads may be more loaded than others.

Tip

Use the perf tool to see where your CPU utilization is high. In regards to shards, high CPU usage in the kernel’s native_queued_spin_lock_slowpath function indicates resource contention, which in this case could be the listener sockets.

nix
sudo perf top
nix
sudo perf top
output
text
Samples: 61K of event 'cpu-clock:ppp', 4000 Hz, Event count (approx.): 10088586179 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
35.39% [kernel] [k] native_queued_spin_lock_slowpath
output
text
Samples: 61K of event 'cpu-clock:ppp', 4000 Hz, Event count (approx.): 10088586179 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
35.39% [kernel] [k] native_queued_spin_lock_slowpath

Check your configuration to ensure you have as many or more shards than there are thread groups.

Case #3: Your system has many cores and only a few unified L3 caches Jump to heading

The load balancer supports using a maximum of 64 threads by default without further configuration. On larger system Intel x86 or ARM, you can make use of more cores and threads by manually setting the global directive nbthread, which allows you to manually specify how many threads the load balancer should use, and by defining thread groups. Thread groups, introduced in version 2.7, allow you to split threads into independent groups that can contain a maximum of 64 threads each. Once you have defined your thread groups, assign them to the appropriate CPU sets using cpu-map).

NUMA best practice

As of the latest versions on NUMA-aware systems, the load balancer intelligently determines the appropriate values for the number of threads (nbthread), and the number of thread groups for your system (default: 1), and it limits itself to the first NUMA node to limit inter-CCX communication. We don’t recommend that you change these settings, as defining a value for nbthread in your configuration disables the load balancer’s automatic configuration that best optimizes the values for NUMA on your specific system.

Consider an example system where examination showed 128 CPUs that share L3 caches as follows:

CPU L3 Cache
0-63 0
64-127 1

This means that half of the CPUs share one L3 cache (0), and the other half share a different L3 cache (1). For the sake of simplicity of this example, this example system does not have hyperthreading.

We can define thread groups and assign them to sets of CPUs using cpu-maps while keeping the following in mind:

  • Group CPUs by locality.
    • In this example, half of the CPUs share one L3 cache and the other half share a different cache. In this case we could consider the CPUs to be grouped by L3 cache, which then gives us two groups:
      • Group 1: 0-63
      • Group 2: 64-127
    • Note that there is not a directive specifically for defining these groups of CPUs, but rather, this step is more conceptual and will help you visualize how best to define your thread groups. Assigning threads to CPU sets using cpu-map denote the actual definition in the configuration.
  • Define thread groups such that they reflect the groups of CPUs (CPU sets).
    • Assuming that the CPUs are split into two groups, the minimum number of thread groups we would need is two, one for each group.
      • thread-group 1 0-63
      • thread-group 2 64-127
    • More than two groups of CPUs will require as many thread groups. If you grouped the CPUs such that there were four distinct groups, you would need four thread groups.
  • Assign no more threads to a CPU set as there are CPUs in that set, as that would cause some threads to compete for the same CPU. The load balancer will log a warning in this case:
    • This configuration binds 96 threads to a total of 64 CPUs via cpu-map directives. This means that some threads will compete for the same CPU, which will cause severe performance degradation. Please fix either the 'cpu-map' directives or set the global 'nbthread' value accordingly.
    • As there are 128 CPUs in this example, the maximum number of threads we can use is 128.
    • Since we created two thread groups, we can now assign them to CPU sets with the same number of CPUs, using the conceptual CPU grouping we determined earlier:
      • cpu-map 1/1-64 0-63
      • cpu-map 2/1-64 64-127

This example details a case where we wanted to use all cores on the system, and as such, needed to use more than 64 threads. Depending on whether you intend to use all available threads and all available cores or subsets of either affects how many threads you should put in each thread group. You will need to test different configurations on your system to determine the optimal configuration. Generally, you can put as many threads in a thread group as there are CPUs per CPU set for them to run on.

Caution

If your system uses hyperthreaded cores, be sure to arrange your cpu-maps such that threads of the same thread group have the opportunity to run on the same core. For an example configuration on a system with hyperthreaded cores, see Example: Use cpu-policy to enable more threads.

As of version 2.8, all threads are available to all listeners, so you don’t have to add configuration settings to your listeners to enable the thread groups. However, to further refine how thread groups are applied to your listeners, or to use thread groups in version 2.7, you must reference the specific thread groups with your bind directives to assign the thread groups that should handle processing for each listener. For example, to use two thread groups in a single frontend, reference each of them on a separate bind directive:

haproxy
frontend fe_main
bind :80 thread 1/all
bind :80 thread 2/all
haproxy
frontend fe_main
bind :80 thread 1/all
bind :80 thread 2/all
Want to enable multiple sockets per the same address and port?

If after defining thread groups and cpu-maps you also want to enable multiple sockets per the same address and port, you can add the shards keyword in conjunction with the thread keyword. Exercise caution with this, as it will create more file descriptors.

For example, if you want one of your listeners to create two sockets:

haproxy
global
nbthread 128
thread-groups 2
thread-group 1 1-64
thread-group 2 65-128
cpu-map 1/1-64 0-63
cpu-map 2/1-64 64-127
frontend web1
bind :80 thread 1/all shards 2
frontend web2
bind :81 thread 2/all
haproxy
global
nbthread 128
thread-groups 2
thread-group 1 1-64
thread-group 2 65-128
cpu-map 1/1-64 0-63
cpu-map 2/1-64 64-127
frontend web1
bind :80 thread 1/all shards 2
frontend web2
bind :81 thread 2/all

This example also shows how to map specific thread groups to specific listeners (binds). Of the two thread groups, the frontend web1 will use threads from the first, and the frontend web2 will use threads from the second. The frontend web1 will use two listener sockets, as denoted by shards 2, and the frontend web2 will use one, which is the default. You can assign any combination of thread groups and shards to your listeners, just be sure to test on your system for optimal performance depending on each listener’s traffic load.

Please note that this example is for illustrative purposes only, and you must create a custom configuration including an appropriate value for nbthread, thread groups, and cpu-maps that is tailored for your specific machine, or let the load balancer do it for you. Keep in mind as well that the arrangment of your NIC(s), may inform your configuration, as we’ve found that generally, except in cases where CPUs are saturated with NIC traffic, the load balancer sees the best performance when it runs on the same node as the NIC.

Case #4: You have a heavy SSL load Jump to heading

If you have a dual-socket CPU, we don’t recommend that you split processing across the physical sockets, and instead use all cores on a single physical CPU. This is because there’s significant overhead that results from communicating between the two physical CPUs. If, along with your dual-socket CPU, you also have a heavy SSL load, you could see an increase in performance by defining multiple thread groups, and assigning thread groups intentionally to each physical CPU using cpu-map, where thread groups on one CPU will handle SSL/TLS operations, and the other thread groups on the other CPU will handle non-SSL/TLS operations.

Keep in mind the following when defining your thread groups and cpu-maps:

  • You may see better performance assigning only the threads that will process SSL/TLS traffic to the node to which the NIC is assigned. That way, that node is dedicated to NIC operations and SSL operations. The rest of the threads can run on another node and not be impacted by these expensive operations.
  • Examine your system to see which node(s) the NIC(s) are assigned to. This may inform how you configure the threads.
  • You may see a performance increase when using multiple NICs, with each assigned to only one node or physical CPU socket. In this configuration, group your load balancer threads such that they will process only the traffic for a paticular node’s NIC.
    • The goal with this is to avoid cross-NIC traffic. For example:
      • One listener uses threads bound to the first node. You should configure your listener to only process traffic for the addresses associated with the NIC bound to the first node.
      • Another listener uses threads bound to the second node. You should configure this listener to only process traffic for addresses associated with the NIC bound to the second node.

Be sure to test this on your system to see if the performance gain of having additional CPUs available for SSL/TLS operations outweighs the performance cost of inter-socket or inter-CCX communication.

Case #5: Your system has multiple types of cores Jump to heading

This section applies to:

  • HAProxy 3.2 and newer

If your system is heterogeneous, in that it has multiple types of cores, you may want the load balancer to use only one type of core. For example, your system may have both performance cores and efficiency cores, or “big” and “little” cores. Performance cores are designed for demanding workloads, whereas efficiency cores are designed to prioritize power savings for light-weight tasks.

Set cpu-policy to performance to use only performance cores, or set it to efficiency to use only efficiency cores.

You could see a performance boost on such systems by adding the following to your global configuration section:

haproxy
global
cpu-policy performance
haproxy
global
cpu-policy performance

For more information about cpu-policy, see tuning the load balancer’s automatic CPU binding.

Case #6: Your system has multiple CCX or L3 caches Jump to heading

If your system has multiple CCX, such as with AMD EPYC, or multiple L3 caches, you could see performance gains by defining a thread group per CCX. In other words, you can pin thread groups, using cpu-map, to cores that share the same L3 cache to optimize performance.

This is much easier in version 3.2

To accomplish this in version 3.2, simply add one of the following cpu-policy to your global configuration section:

Either performance, where the load balancer will enable all available threads and organize them into efficient thread groups (grouping them by shared L3 cache) :

haproxy
global
cpu-policy performance
haproxy
global
cpu-policy performance

Or group-by-ccx, where the load balancer will group threads per CCX.

haproxy
global
cpu-policy group-by-ccx
haproxy
global
cpu-policy group-by-ccx

As this could then use multiple nodes on your system, and though thread groups help to reduce data sharing latency between threads, be sure to test the change to make sure that it’s more performant, as inter-CCX communication between threads could introduce latency.

For versions 2.7 to 3.1, you will need to define your configuration manually. Keep in mind the following when defining your thread groups and cpu-maps:

  • You should group threads such that they reflect the arrangement of your CCXs. For example, if your system has four CCX, you should probably have four thread groups.
  • You should assign the thread groups to CPU sets that are grouped by CCX.
  • Be sure to create no more threads than there are CPUs, as that would cause some threads to compete for the same CPU.

Tuning the automatic CPU binding Jump to heading

This section applies to:

  • HAProxy 3.2 and newer

In this section, we’ll walk through an example that explains how to gather information about your system and use it to inform how you can tune the load balancer’s automatic CPU binding behavior. Consider the case where you’re running the load balancer on a NUMA-aware system and have decided that you’d like the load balancer to use the CPUs from more than one NUMA node, which is the default.

Prior to version 3.2, accomplishing this required very specific management and mapping of threads and cores using cpu-maps, thread groups, and other process-management-related directives like nbthread. The drawbacks for configuring the binding manually are:

  • Each system requires very specific settings.
  • It’s dangerously easy to configure the CPU bindings in ways that detrimentally impact performance.
  • The configuration becomes more complex.
  • The configuration may not be portable across all of your machines.

Version 3.2 introduced a middle ground between the default, automatic configuration and complex manual configurations, allowing you to use simple configuration directives to tune how you would like the load balancer to apply the automatic CPU binding. These simple configuration directives introduced in version 3.2 are cpu-policy and cpu-set.

The global directive cpu-policy, which defaults to first-usable-node, offers you flexibility in how the load balancer arranges its threads across CPUs. The default behavior is that it uses the first NUMA node with all its available CPUs, and it creates a matching number of threads in a single thread group. This behavior was introduced as the default in version 2.4 and remains the default behavior through version 3.2.

However, if you have more than one node on your system, you may want the load balancer to create threads that will run on those nodes as well. In that case, you can set cpu-policy to one of the following, depending on the needs of your system:

  • group-by-cluster: The load balancer will create one thread group for each CPU cluster. We recommend this option for taking advantage of multiple NUMA nodes and for systems having multiple CPU sockets.
  • There are three more options, group-by-2clusters, group-by-3clusters, and group-by-4clusters could potentially show better thread grouping for your system than group-by-cluster. In each case, threads are grouped by numbers of clusters, either 2, 3, or 4, which could potentially spread the load across clusters. Be sure to benchmark this on your system, as communication between clusters could have negative effects. We recommend you try group-by-cluster first.

On most large server systems, cluster and CCX are synonomous. However, there may be some systems where this is not the case. On such a system, grouping CPUs by CPU cluster may be detrimental to performance if the CPUs are not grouped by their shared L3 caches. In this case, you can set cpu-policy instead to:

  • group-by-ccx: This is similar to group-by-cluster, except that CPUs are grouped by shared last level cache, usually the L3 cache.
  • Similarly to the group by cluster options, you can also group by two, three, or four CCX using group-by-2-ccx, group-by-3-ccx, and group-by-4-ccx, respectively.

Additional policies are available to further restrict which CPUs the load balancer will use:

  • resource: Use this option when you need to restrict the load balancer to the smallest and most efficient CPU cluster for cost or power savings.
  • efficiency: Use this option when you’re running on a system that has distinct performance and efficiency cores. In this case, the load balancer doesn’t use the most performant and powerful cores, saving them for other CPU-intensive operations.
  • performance: Similar to the efficiency setting except it does the opposite; the load balancer won’t use efficiency cores, or cores whose capacity is considerably less than other cores, as their inclusion could prove to be counterproductive if you’re expecting equal performance across cores.
  • none: This disables all of the automatic detection and enables as many threads, in a single thread group, as there are available CPUs. We don’t recommend using this option without careful consideration and analysis of your system, as it could have dramatically negative impacts on performance, as threads won’t be grouped by locality and could share data across CCXs, or worse, CPU sockets.

There may be cases where you don’t want the load balancer to use specific CPUs, or you want the load balancer to run only on specific CPUs. You can use cpu-set for this. It allows you to symbolically notate which CPUs you want the load balancer to use. It also includes an option reset that will undo any limitation put in place on the load balancer, for example by taskset.

Use drop-cpu <CPU set> to specify which CPUs to exclude or only-cpu <CPU set> to include only the CPUs specified. You can also set this by node, cluster, core, or thread instead of by CPU set. Once you’ve defined your cpu-set, the load balancer then applies your cpu-policy to assign threads to the specific CPUs.

For example, if you want to bind only one thread to each core in only node 0, you can set cpu-set as follows:

haproxy
global
cpu-set only-node 0 only-thread 0
haproxy
global
cpu-set only-node 0 only-thread 0

You can then use the default cpu-policy or choose which one you want the load balancer to use.

Info

If you include nbthread, thread groups, or cpu-maps in your configuration, the automatic behaviors are disabled and the load balancer ignores cpu-policy and cpu-set.

Example: Use cpu-policy to enable more threads Jump to heading

To see the automatic CPU binding in action, run the load balancer with the -dc command line option, also introduced in version 3.2. By setting this argument, the load balancer will log the arrangement of threads, thread groups, and their associated CPUs that it has determined is optimal based on your CPU topology. Changing the values of cpu-policy and cpu-set should show changes in the arrangement, depending on the policy and/or the specific CPU sets. Keep in mind cpu-set acts first to eliminate cores you want to exclude and then cpu-policy arranges threads onto the remaining CPUs.

Here is the output the load balancer logs when run with the -dc option on our example system:

text
CPU clusters:
0 cpus= 16 cores= 8 capa=36800
1 cpus= 16 cores= 8 capa=36800
2 cpus= 16 cores= 8 capa=36800
3 cpus= 16 cores= 8 capa=36800
4 cpus= 16 cores= 8 capa=36800
5 cpus= 16 cores= 8 capa=36800
6 cpus= 16 cores= 8 capa=36800
7 cpus= 16 cores= 8 capa=36800
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-64 1-64 128: 0-127
text
CPU clusters:
0 cpus= 16 cores= 8 capa=36800
1 cpus= 16 cores= 8 capa=36800
2 cpus= 16 cores= 8 capa=36800
3 cpus= 16 cores= 8 capa=36800
4 cpus= 16 cores= 8 capa=36800
5 cpus= 16 cores= 8 capa=36800
6 cpus= 16 cores= 8 capa=36800
7 cpus= 16 cores= 8 capa=36800
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-64 1-64 128: 0-127

Without additional configuration, the load balancer uses the default cpu-policy, first-usable-node, creates a single thread group with 64 threads (the maximum number of threads for a single thread group), and assigns the group to all of the available CPUs of the first node. As all of these CPUs are part of the same node and physical socket, this arrangement of threads should be sufficient for our needs, but we’ll examine our system using the lscpu -e command to see if there are other configurations we could consider.

nix
lscpu -e
nix
lscpu -e
output
text
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 1 1:1:1:0 yes
2 0 0 2 2:2:2:0 yes
3 0 0 3 3:3:3:0 yes
4 0 0 4 4:4:4:0 yes
5 0 0 5 5:5:5:0 yes
6 0 0 6 6:6:6:0 yes
7 0 0 7 7:7:7:0 yes
8 0 0 8 8:8:8:1 yes
9 0 0 9 9:9:9:1 yes
10 0 0 10 10:10:10:1 yes
11 0 0 11 11:11:11:1 yes
12 0 0 12 12:12:12:1 yes
13 0 0 13 13:13:13:1 yes
14 0 0 14 14:14:14:1 yes
15 0 0 15 15:15:15:1 yes
16 0 0 16 16:16:16:2 yes
17 0 0 17 17:17:17:2 yes
18 0 0 18 18:18:18:2 yes
19 0 0 19 19:19:19:2 yes
20 0 0 20 20:20:20:2 yes
21 0 0 21 21:21:21:2 yes
22 0 0 22 22:22:22:2 yes
23 0 0 23 23:23:23:2 yes
24 0 0 24 24:24:24:3 yes
25 0 0 25 25:25:25:3 yes
26 0 0 26 26:26:26:3 yes
27 0 0 27 27:27:27:3 yes
28 0 0 28 28:28:28:3 yes
29 0 0 29 29:29:29:3 yes
30 0 0 30 30:30:30:3 yes
31 0 0 31 31:31:31:3 yes
32 0 0 32 32:32:32:4 yes
33 0 0 33 33:33:33:4 yes
34 0 0 34 34:34:34:4 yes
35 0 0 35 35:35:35:4 yes
36 0 0 36 36:36:36:4 yes
37 0 0 37 37:37:37:4 yes
38 0 0 38 38:38:38:4 yes
39 0 0 39 39:39:39:4 yes
40 0 0 40 40:40:40:5 yes
41 0 0 41 41:41:41:5 yes
42 0 0 42 42:42:42:5 yes
43 0 0 43 43:43:43:5 yes
44 0 0 44 44:44:44:5 yes
45 0 0 45 45:45:45:5 yes
46 0 0 46 46:46:46:5 yes
47 0 0 47 47:47:47:5 yes
48 0 0 48 48:48:48:6 yes
49 0 0 49 49:49:49:6 yes
50 0 0 50 50:50:50:6 yes
51 0 0 51 51:51:51:6 yes
52 0 0 52 52:52:52:6 yes
53 0 0 53 53:53:53:6 yes
54 0 0 54 54:54:54:6 yes
55 0 0 55 55:55:55:6 yes
56 0 0 56 56:56:56:7 yes
57 0 0 57 57:57:57:7 yes
58 0 0 58 58:58:58:7 yes
59 0 0 59 59:59:59:7 yes
60 0 0 60 60:60:60:7 yes
61 0 0 61 61:61:61:7 yes
62 0 0 62 62:62:62:7 yes
63 0 0 63 63:63:63:7 yes
64 0 0 0 0:0:0:0 yes
65 0 0 1 1:1:1:0 yes
66 0 0 2 2:2:2:0 yes
67 0 0 3 3:3:3:0 yes
68 0 0 4 4:4:4:0 yes
69 0 0 5 5:5:5:0 yes
70 0 0 6 6:6:6:0 yes
71 0 0 7 7:7:7:0 yes
72 0 0 8 8:8:8:1 yes
73 0 0 9 9:9:9:1 yes
74 0 0 10 10:10:10:1 yes
75 0 0 11 11:11:11:1 yes
76 0 0 12 12:12:12:1 yes
77 0 0 13 13:13:13:1 yes
78 0 0 14 14:14:14:1 yes
79 0 0 15 15:15:15:1 yes
80 0 0 16 16:16:16:2 yes
81 0 0 17 17:17:17:2 yes
82 0 0 18 18:18:18:2 yes
83 0 0 19 19:19:19:2 yes
84 0 0 20 20:20:20:2 yes
85 0 0 21 21:21:21:2 yes
86 0 0 22 22:22:22:2 yes
87 0 0 23 23:23:23:2 yes
88 0 0 24 24:24:24:3 yes
89 0 0 25 25:25:25:3 yes
90 0 0 26 26:26:26:3 yes
91 0 0 27 27:27:27:3 yes
92 0 0 28 28:28:28:3 yes
93 0 0 29 29:29:29:3 yes
94 0 0 30 30:30:30:3 yes
95 0 0 31 31:31:31:3 yes
96 0 0 32 32:32:32:4 yes
97 0 0 33 33:33:33:4 yes
98 0 0 34 34:34:34:4 yes
99 0 0 35 35:35:35:4 yes
100 0 0 36 36:36:36:4 yes
101 0 0 37 37:37:37:4 yes
102 0 0 38 38:38:38:4 yes
103 0 0 39 39:39:39:4 yes
104 0 0 40 40:40:40:5 yes
105 0 0 41 41:41:41:5 yes
106 0 0 42 42:42:42:5 yes
107 0 0 43 43:43:43:5 yes
108 0 0 44 44:44:44:5 yes
109 0 0 45 45:45:45:5 yes
110 0 0 46 46:46:46:5 yes
111 0 0 47 47:47:47:5 yes
112 0 0 48 48:48:48:6 yes
113 0 0 49 49:49:49:6 yes
114 0 0 50 50:50:50:6 yes
115 0 0 51 51:51:51:6 yes
116 0 0 52 52:52:52:6 yes
117 0 0 53 53:53:53:6 yes
118 0 0 54 54:54:54:6 yes
119 0 0 55 55:55:55:6 yes
120 0 0 56 56:56:56:7 yes
121 0 0 57 57:57:57:7 yes
122 0 0 58 58:58:58:7 yes
123 0 0 59 59:59:59:7 yes
124 0 0 60 60:60:60:7 yes
125 0 0 61 61:61:61:7 yes
126 0 0 62 62:62:62:7 yes
127 0 0 63 63:63:63:7 yes
output
text
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
0 0 0 0 0:0:0:0 yes
1 0 0 1 1:1:1:0 yes
2 0 0 2 2:2:2:0 yes
3 0 0 3 3:3:3:0 yes
4 0 0 4 4:4:4:0 yes
5 0 0 5 5:5:5:0 yes
6 0 0 6 6:6:6:0 yes
7 0 0 7 7:7:7:0 yes
8 0 0 8 8:8:8:1 yes
9 0 0 9 9:9:9:1 yes
10 0 0 10 10:10:10:1 yes
11 0 0 11 11:11:11:1 yes
12 0 0 12 12:12:12:1 yes
13 0 0 13 13:13:13:1 yes
14 0 0 14 14:14:14:1 yes
15 0 0 15 15:15:15:1 yes
16 0 0 16 16:16:16:2 yes
17 0 0 17 17:17:17:2 yes
18 0 0 18 18:18:18:2 yes
19 0 0 19 19:19:19:2 yes
20 0 0 20 20:20:20:2 yes
21 0 0 21 21:21:21:2 yes
22 0 0 22 22:22:22:2 yes
23 0 0 23 23:23:23:2 yes
24 0 0 24 24:24:24:3 yes
25 0 0 25 25:25:25:3 yes
26 0 0 26 26:26:26:3 yes
27 0 0 27 27:27:27:3 yes
28 0 0 28 28:28:28:3 yes
29 0 0 29 29:29:29:3 yes
30 0 0 30 30:30:30:3 yes
31 0 0 31 31:31:31:3 yes
32 0 0 32 32:32:32:4 yes
33 0 0 33 33:33:33:4 yes
34 0 0 34 34:34:34:4 yes
35 0 0 35 35:35:35:4 yes
36 0 0 36 36:36:36:4 yes
37 0 0 37 37:37:37:4 yes
38 0 0 38 38:38:38:4 yes
39 0 0 39 39:39:39:4 yes
40 0 0 40 40:40:40:5 yes
41 0 0 41 41:41:41:5 yes
42 0 0 42 42:42:42:5 yes
43 0 0 43 43:43:43:5 yes
44 0 0 44 44:44:44:5 yes
45 0 0 45 45:45:45:5 yes
46 0 0 46 46:46:46:5 yes
47 0 0 47 47:47:47:5 yes
48 0 0 48 48:48:48:6 yes
49 0 0 49 49:49:49:6 yes
50 0 0 50 50:50:50:6 yes
51 0 0 51 51:51:51:6 yes
52 0 0 52 52:52:52:6 yes
53 0 0 53 53:53:53:6 yes
54 0 0 54 54:54:54:6 yes
55 0 0 55 55:55:55:6 yes
56 0 0 56 56:56:56:7 yes
57 0 0 57 57:57:57:7 yes
58 0 0 58 58:58:58:7 yes
59 0 0 59 59:59:59:7 yes
60 0 0 60 60:60:60:7 yes
61 0 0 61 61:61:61:7 yes
62 0 0 62 62:62:62:7 yes
63 0 0 63 63:63:63:7 yes
64 0 0 0 0:0:0:0 yes
65 0 0 1 1:1:1:0 yes
66 0 0 2 2:2:2:0 yes
67 0 0 3 3:3:3:0 yes
68 0 0 4 4:4:4:0 yes
69 0 0 5 5:5:5:0 yes
70 0 0 6 6:6:6:0 yes
71 0 0 7 7:7:7:0 yes
72 0 0 8 8:8:8:1 yes
73 0 0 9 9:9:9:1 yes
74 0 0 10 10:10:10:1 yes
75 0 0 11 11:11:11:1 yes
76 0 0 12 12:12:12:1 yes
77 0 0 13 13:13:13:1 yes
78 0 0 14 14:14:14:1 yes
79 0 0 15 15:15:15:1 yes
80 0 0 16 16:16:16:2 yes
81 0 0 17 17:17:17:2 yes
82 0 0 18 18:18:18:2 yes
83 0 0 19 19:19:19:2 yes
84 0 0 20 20:20:20:2 yes
85 0 0 21 21:21:21:2 yes
86 0 0 22 22:22:22:2 yes
87 0 0 23 23:23:23:2 yes
88 0 0 24 24:24:24:3 yes
89 0 0 25 25:25:25:3 yes
90 0 0 26 26:26:26:3 yes
91 0 0 27 27:27:27:3 yes
92 0 0 28 28:28:28:3 yes
93 0 0 29 29:29:29:3 yes
94 0 0 30 30:30:30:3 yes
95 0 0 31 31:31:31:3 yes
96 0 0 32 32:32:32:4 yes
97 0 0 33 33:33:33:4 yes
98 0 0 34 34:34:34:4 yes
99 0 0 35 35:35:35:4 yes
100 0 0 36 36:36:36:4 yes
101 0 0 37 37:37:37:4 yes
102 0 0 38 38:38:38:4 yes
103 0 0 39 39:39:39:4 yes
104 0 0 40 40:40:40:5 yes
105 0 0 41 41:41:41:5 yes
106 0 0 42 42:42:42:5 yes
107 0 0 43 43:43:43:5 yes
108 0 0 44 44:44:44:5 yes
109 0 0 45 45:45:45:5 yes
110 0 0 46 46:46:46:5 yes
111 0 0 47 47:47:47:5 yes
112 0 0 48 48:48:48:6 yes
113 0 0 49 49:49:49:6 yes
114 0 0 50 50:50:50:6 yes
115 0 0 51 51:51:51:6 yes
116 0 0 52 52:52:52:6 yes
117 0 0 53 53:53:53:6 yes
118 0 0 54 54:54:54:6 yes
119 0 0 55 55:55:55:6 yes
120 0 0 56 56:56:56:7 yes
121 0 0 57 57:57:57:7 yes
122 0 0 58 58:58:58:7 yes
123 0 0 59 59:59:59:7 yes
124 0 0 60 60:60:60:7 yes
125 0 0 61 61:61:61:7 yes
126 0 0 62 62:62:62:7 yes
127 0 0 63 63:63:63:7 yes

This output shows that our system has:

  • One NUMA node (numbered 0)
  • One physical CPU socket (numbered 0)
  • CPUs grouped in groups of eight each sharing an L3 cache (the caches numbered 0-7)
  • 64 hyperthreaded cores, supporting two threads per core to total 128 CPUs.
    • In order, CPUs 64-127 belong to cores 0-63.

The arrangement of the CPUs per their L3 cache is as follows:

CPU L3 Cache
0-7, 64-71 0
8-15, 72-79 1
16-23, 80-87 2
24-31, 88-95 3
32-39, 96-103 4
40-47, 104-111 5
48-55, 112-119 6
56-63, 120-127 7

With this information, we could consider some adjustments to the settings the load balancer has determined automatically, if testing shows that we are experiencing uneven or high CPU load.

These adjustments include:

  • Create multiple thread groups such that they are organized by CPUs that share an L3 cache.
    • This should reduce the communication latency between threads, as this will help minimize sharing data between distant threads.
    • It will also allow the load balancer to user more than 64 threads.
  • Assign the thread groups to CPUs such that the threads can make the best use of the hyperthreaded cores.
    • The communication latency between threads is the lowest between two threads of the same core.

We can add this line to the global section of the load balancer configuration to change the cpu-policy:

haproxy
global
cpu-policy group-by-cluster
haproxy
global
cpu-policy group-by-cluster

The load balancer now applies the group-by-cluster policy which creates as many threads as there are CPUs and divides the threads into one thread group per each CPU cluster. Each CPU cluster in this case is a group of CPUs that share an L3 cache. From what we saw in our system analysis, we would now expect to see that the load balancer creates 128 threads divides them into groups of CPUs that share L3 caches. The load balancer logs the change in the thread arrangement:

text
CPU clusters:
0 cpus= 16 cores= 8 capa=36800
1 cpus= 16 cores= 8 capa=36800
2 cpus= 16 cores= 8 capa=36800
3 cpus= 16 cores= 8 capa=36800
4 cpus= 16 cores= 8 capa=36800
5 cpus= 16 cores= 8 capa=36800
6 cpus= 16 cores= 8 capa=36800
7 cpus= 16 cores= 8 capa=36800
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-16 1-16 16: 0-7,64-71
2/1-16 17-32 16: 8-15,72-79
3/1-16 33-48 16: 16-23,80-87
4/1-16 49-64 16: 24-31,88-95
5/1-16 65-80 16: 32-39,96-103
6/1-16 81-96 16: 40-47,104-111
7/1-16 97-112 16: 48-55,112-119
8/1-16 113-128 16: 56-63,120-127
text
CPU clusters:
0 cpus= 16 cores= 8 capa=36800
1 cpus= 16 cores= 8 capa=36800
2 cpus= 16 cores= 8 capa=36800
3 cpus= 16 cores= 8 capa=36800
4 cpus= 16 cores= 8 capa=36800
5 cpus= 16 cores= 8 capa=36800
6 cpus= 16 cores= 8 capa=36800
7 cpus= 16 cores= 8 capa=36800
Thread CPU Bindings:
Tgrp/Thr Tid CPU set
1/1-16 1-16 16: 0-7,64-71
2/1-16 17-32 16: 8-15,72-79
3/1-16 33-48 16: 16-23,80-87
4/1-16 49-64 16: 24-31,88-95
5/1-16 65-80 16: 32-39,96-103
6/1-16 81-96 16: 40-47,104-111
7/1-16 97-112 16: 48-55,112-119
8/1-16 113-128 16: 56-63,120-127

Indeed, the load balancer created 128 threads and divided them among 8 thread groups, where each thread group will execute on sets of CPUs that share an L3 cache. This arrangement also takes advantage of the hyperthreaded cores, as CPUs of the same cores are placed together in the groups. This means that when the threads share data within the group, there is a good chance that two threads sharing data are executing on the same physical core.

Tip

Though this configuration appears that it will be more performant, be sure to test it on your system if you decide to use it.

All of this is accomplished by the addition of a single line to the configuration. Expand the following example to see the corresponding manual configuration for this specific system. A configuration such as this would be required instead on versions 2.7-3.1 which can’t use the cpu-set and cpu-policy directives.

Example: Manually configure cpu-maps

In the example above, we used a single global configuration directive to change the automatic CPU binding behavior on our system. On previous versions (2.7-3.1), the same changes would require the following:

haproxy
global
nbthread 128
thread-groups 8
thread-group 1 1-16
thread-group 2 17-32
thread-group 3 33-48
thread-group 4 49-64
thread-group 5 65-80
thread-group 6 81-96
thread-group 7 97-112
thread-group 8 113-128
cpu-map 1/1-16 0-7,64-71
cpu-map 2/1-16 8-15,72-79
cpu-map 3/1-16 16-23,80-87
cpu-map 4/1-16 24-31,88-95
cpu-map 5/1-16 32-39,96-103
cpu-map 6/1-16 40-47,104-111
cpu-map 7/1-16 48-55,112-119
cpu-map 8/1-16 56-63,120-127
haproxy
global
nbthread 128
thread-groups 8
thread-group 1 1-16
thread-group 2 17-32
thread-group 3 33-48
thread-group 4 49-64
thread-group 5 65-80
thread-group 6 81-96
thread-group 7 97-112
thread-group 8 113-128
cpu-map 1/1-16 0-7,64-71
cpu-map 2/1-16 8-15,72-79
cpu-map 3/1-16 16-23,80-87
cpu-map 4/1-16 24-31,88-95
cpu-map 5/1-16 32-39,96-103
cpu-map 6/1-16 40-47,104-111
cpu-map 7/1-16 48-55,112-119
cpu-map 8/1-16 56-63,120-127

This configuration includes the following:

  • Our system analysis showed we can use 128 threads, as we have 128 CPUs. We set the number of threads to 128 using nbthread.
  • As there are 8 groups of 16 CPUs that share a cache (for example CPUs 0-7 and 64-71 use L3 cache 0), we define 8 thread groups of 16 threads each.
  • We define CPU sets using cpu-map:
    • Each cpu-map pins 16 threads to as many CPUs. The 16 threads can run on any of the CPUs in the set.
    • The CPUs in each set share an L3 cache. This also groups the CPUs by core, which ensures that the threads that run on those CPUs make the best use of the hyperthreaded cores.

Troubleshooting and pitfalls Jump to heading

Keep in mind the following when manually changing the load balancer’s process management settings:

  • In the vast majority of cases, the configuration the load balancer determines automatically for your system will provide the best performance.
  • We’ve observed the best performance when assigning thread sets, defined in thread groups, to CPU sets versus assigning one particular thread to one particular CPU. This is because if the single CPU on which a thread is allowed to run is otherwise occupied by another operation outside of the load balancer, the thread may have to wait.
  • If you define a value for nbthread, thread-groups or cpu-map in your configuration, this disables the load balancer’s automatic configuration that best optimizes its settings for NUMA.
  • Only threads in the same thread group can work together. Threads in the same group should not be split across multiple NUMA nodes or CPU sockets, as this would cause performance degradation.
  • If you don’t explicitly define your thread groups, that is, which threads belong to each group, and instead let the load balancer distribute them equally, the number of thread groups you define must be a divisor of the value you specify for nbthread in order for the threads to divide equally among the groups.
  • The load balancer can use a maximum of 64 threads automatically without further configuration. A thread group can, at maximum, consist of 64 threads, with a maximum of 16 thread groups. This enables you to use more threads on systems that support it.
  • The number of threads you define for nbthread and assign to thread groups must be at most the same number of CPUs to which you assign them in your cpu-maps, as discrepancies, such as defining a number of threads larger than the number of cores, can cause a significant decrease in performance.
  • If you’re using more than 32 CPUs and are also using Lua scripts, you must use lua-load-per-thread instead of lua-load. When using lua-load, the script is run on a single CPU at a time and because it can share state, must lock, which stalls the other CPUs.
  • In the case of hyperthreaded cores or vCPUs, threads should be grouped by physical CPU, that is, a thread should be able to run on either a physical core or its virtual counterpart, as the communication between two threads of the same core has the lowest latency.

See also Jump to heading

  • For complete information on the shards bind option including syntax and options, see shards
  • For complete information on the cpu-map global directive including syntax, options, and more examples, see cpu-map
  • For complete information on the thread-groups global directive including syntax and options, see thread-groups
  • For complete information on the thread-group global directive, see thread-group
  • For complete information on the thread bind option, see thread
  • For complete information on the nbthread global directive, see nbthread
  • For complete information on the tune.listener.default-shards global directive, see tune.listener.default-shards
  • For complete information on the lua-load-per-thread global directive including syntax and options, see lua-load-per-thread
  • For complete information on the lua-load global directive syntax and options, see lua-load

Do you have any suggestions on how we can improve the content of this page?