There are two possible ways to have HAProxy run on multiple CPU cores:

  1. By using the multiprocess model, where HAProxy automatically starts a number of separate system processes (method available since HAProxy version 1.1.7)
  2. By using the multithreading model, where HAProxy automatically starts a number of threads within a single process (method available since HAProxy version 1.8)

The traditional multiprocess approach currently achieves better performance, but the new multithreading model solves all of the limitations typically associated with multiprocess configurations and could certainly be interesting for early adopters who prefer ease of management over maximum performance.

The choice of the method is also somewhat dependent on the specific user needs and configuration. We know that SSL offloading and HTTP compression scale well in a multithreading model, at least on a relatively small number of threads (2 to 4). For other uses or a larger number of threads, we are still in the process of gathering definitive benchmarks and experiences.

In this blog post, we are going to take you on a tour of the multithreading functionality in HAProxy 1.8. We will provide you with background information, configuration instructions, a more detailed technical overview, and some debugging tips.

So let’s start!

From Multiprocess to Multithreading

Starting with HAProxy version 1.1.7 released in 2002, it has been possible to automatically start multiple HAProxy processes. This was done using the configuration directive “nbproc”, and later individual processes were also mapped to individual CPU cores using “cpu-map”.

These multiprocess configurations were a standard way to scale users’ workloads and, with the correct settings, each individual process was able to take full advantage of the high-performance, event-driven HAProxy engine.

Also, multiprocess configurations had additional, specialized uses — for example, they were the configuration of choice for massive SSL offloading solutions. The general recipe for SSL offloading was:

  1. Dedicate all but one HAProxy process to offloading SSL traffic
  2. Have all those processes send decrypted traffic to the remaining process which handles the actual application logic (compression, HTTP headers modification, stickiness, routing, etc.)

However, multiprocess configurations come with certain limitations:

  1. The HAProxy peers protocol, which is used to synchronize stick tables across HAProxy instances, may only be used in single-process configurations, leading to complexity when many tables need to be synchronized
  2. There is no information sharing between HAProxy processes, so all data, configuration parameters, statistics, limits, and rates are per-process
  3. Health checks are performed by each individual process, resulting in more health checking traffic than strictly necessary and causing transient inconsistencies during state changes
  4. The HAProxy Runtime API is applicable to a single process, so any Runtime API commands need to be sent to all individual processes

Consequently, while multiprocess configurations are useful in many cases, the performance benefits come combined with increased management complexity.

In the multithreading model, HAProxy starts multiple threads within a single process rather than starting multiple individual processes, and as such, it avoids all of the aforementioned problems.

Multithreading support has been implemented and included in HAProxy starting with HAProxy 1.8.

Our goal for the first multithreading release was to produce a stable, thread-safe implementation with an innovative and extensible design. The initial work took us 8 months to complete and we believe we have accomplished the task, but multithreading support will remain labeled experimental until we improve its overall performance and confirm stability in the largest installations.

Multithreading Support

Before activating multithreading, HAProxy must be compiled with multithreading support. This is done by default on Linux 2.6.28 and greater, FreeBSD, OpenBSD, and Solaris. For other target platforms, it must be explicitly enabled by using the flag “USE_THREAD=1”. Similarly to enabling it, on the mentioned platforms where multithreading is enabled by default, it can be disabled by using “USE_THREAD=”.

To check for multithreading support in your HAProxy, please run “haproxy -vv”. If multithreading is enabled, you will see the text “Built with multithreading support” in the output:

    $ haproxy -vv

    HA-Proxy version 1.8-rc4-358847-18 2017/11/20
    Copyright 2000-2017 Willy Tarreau <willy@haproxy.org>
     ...
    Default settings :
     maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
     ...
    Built with multithreading support.
    Available polling systems :
     epoll : pref=300, test result OK
     poll : pref=200, test result OK
     select : pref=150, test result OK
    Total: 3 (3 usable), will use epoll.
    ...

Multithreading Configuration

By default, HAProxy will start one process and one thread. To start more threads, you should set the option “nbthread” in the global configuration section.

Please note that the option “nbthread” is compatible with “nbproc“, which means that it is even possible to start multiple HAProxy processes with multiple threads in each.

Both the processes and threads should then also be mapped to CPU cores by using the configuration directive “cpu-map”.

The complete configuration needed to run a single HAProxy process (1) with 4 threads (1-4) mapped to first four CPU cores (0-3) would look like the following:

global
  nbproc 1
  nbthread 4
  cpu-map auto:1/1-4 0-3

And that is basically all there is to it for a simple, fully functional use case!

Please refer to the HAProxy Configuration Guide, sections #3.1-nbproc, #3.1-nbthread and 3.1-cpu-map for the complete description of all the available options.

Advanced: Multithreading Architecture

This section provides a deeper technical overview for those wishing to get better insight and understanding of the multithreading functionality in HAProxy.

From an architectural point of view, numerous parts of HAProxy were improved as part of adding the multithreading support.

But, instead of having one thread for the scheduler and a number of threads for the workers, we have decided to run a scheduler in every thread. This has allowed the proven, high performance, event-driven engine component of HAProxy to run per-thread and to remain essentially unchanged. Additionally, in this way the multithreading behavior was made very similar to the multiprocess one as far as usage is concerned, but it comes without the multiprocess limitations!

In the multithreading model, each entity (a task, fd, or applet) is processed by only one thread at a time, and all entities attached to the same session are processed by the same thread. This means that all of the processing related to a specific session is serialized, avoiding most of the locking issues that would otherwise be present.

Thread affinity is also set on each entity. The session-related entities stick to the thread which accepted the incoming connection. Global entities (listeners, peers, checks, DNS resolvers, etc.) have no affinity and all threads are likely to process them, but always one at a time in accordance with the description given in the previous paragraph.

Another important subject related to multithreading are the changes in backend server states and their propagation. Changes to server states are now done in a single place, synchronously, removing the need to use locks in places where they would normally be needed.

Any remaining multithreading topics mostly boil down to locks and atomic operations. In this initial release of HAProxy 1.8.0, some parts are conservatively locked to make them thread-safe, and we will surely improve performance over time by refining or removing some of these locks.

For example, one of the areas that will receive improvements in the future is Lua multithreading performance. Lua design forced us to use a global lock, which means that using Lua scripts with several threads will have a noticeable cost as the scripts will essentially run single-threaded.

Some other places in the code have been made thread-local to avoid the need for locks, but consequently, they could slightly change the expected behaviour. For instance, reusable connections to backends are available to sessions sticky on the same thread only.

Finally, in terms of the low level details, it should be mentioned that we use pthreads to create threads and GCC’s atomic built-in functions to do atomic operations. We use progressive locks invented by our HAProxy Technologies CTO Willy Tarreau for all spinlocks and RWlocks. We use macros to abstract all the details, and all of this can be seen in the HAProxy header file “include/common/hathreads.h”.

Advanced: Debugging

As mentioned, the multithreading support is labeled experimental. User feedback and any bug reports will be very helpful to us to reach the final desired level of performance and stability.

To help us diagnose locking costs or problems, you could enable debug mode by compiling HAProxy with the option “DEBUG=-DDEBUG_THREAD”. With it, HAProxy will provide statistics on the locks. Currently, we display this information when HAProxy is stopped, but we are considering adding the equivalent option to the Runtime API too.

From a performance perspective, it is always helpful to know the costs of the locks as this could help highlight bottlenecks. Here is an excerpt from a sample output:

Stats about Lock THREAD_SYNC:
# write lock : 0
# write unlock: 0 (0)
# wait time for write     : 0.000 msec
# wait time for write/lock: 0.000 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read     : 0.000 msec
# wait time for read/lock : 0.000 nsec

Stats about Lock FDTAB:
# write lock : 139315
# write unlock: 139315 (0)
# wait time for write     : 13.739 msec
# wait time for write/lock: 98.622 nsec
# read lock : 0
# read unlock : 0 (0)
# wait time for read     : 0.000 msec
# wait time for read/lock : 0.000 nsec

 

Debugging information is also useful in tracing back deadlocks or double locks. For example, an attempt to do a double lock will fail and HAProxy will exit. For each lock, we keep track of the last place where it was locked and that information can then easily be printed in gdb:

 (gdb) p rq_lock
 $1 = {lock = 0, info = {owner = 0, waiters = 0, last_location = {function = 0x5abd80 <__func__.26911> "process_runnable_tasks", file = 0x5abd25 "src/task.c", line = 252}}}

 

Conclusion

We hope you have enjoyed this blog post providing the introduction into multithreading functionality in HAProxy, its configuration, and basic troubleshooting procedures.

If you would like to use multithreading with HAProxy 1.8 in your infrastructure backed by enterprise support from HAProxy Technologies, please see our HAProxy Enterprise Edition – Trial Version or contact us for expert advice.

Happy multithreading and stay tuned!

SHARE THIS ARTICLE