The operating system and its tuning have a strong impact on the global performance of the load-balancer. Typical CPU usage figures generally show:

  • 15% of the processing time spent in HAProxy versus 85% in the kernel in TCP or HTTP close mode

  • 30% for HAProxy versus 70% for the kernel in HTTP keep-alive mode

Usage can vary depending on whether the focus is on bandwidth, request rate, connection concurrency, or SSL performance. This section aims to provide a few elements to consider as you set up your configuration.

Evaluating the costs of processing requests

It is important to keep in mind that every operation comes with a cost. Hence, each individual operation adds its overhead on top of other operations, which can either be negligible in some circumstances or can dominate in others.

When processing requests from a connection, we observe that:

  • Forwarding data costs less than parsing request or response headers

  • Parsing request or response headers cost less than establishing then closing a connection to a server

  • Establishing and closing a connection costs less than a TLS resume operation

  • A TLS resume operation costs less than a full TLS handshake with a key computation

  • An idle connection costs less CPU than a connection whose buffers hold data

  • A TLS context costs even more memory than a connection with data

So in practice, it is cheaper to process payload bytes than header bytes; thus, it is easier to achieve high network bandwidth with large objects (few requests per volume unit) than with small objects (many requests per volume unit). This explains why maximum bandwidth is always measured with large objects, while request rate or connection rates are measured with small objects.

Some operations scale well on multiple process spread over multiple processors, such as:

  • The request rate over persistent connections: This does not involve much memory nor network bandwidth and does not require to access locked structures.

  • TLS key computation: This is completely CPU-bound.

  • TLS resume (moderately well): This operation reaches its limits around 4 processes, when the overhead of accessing the shared table offsets the small gains expected from more power.

Other operations do not scale as well, such as:

  • Network bandwidth: The CPU is rarely the bottleneck for large objects.

  • Connection rate: This is due to a few locks in the system when dealing with the local ports table.

Optimizing performance

The performance values you can expect from a very well tuned system are in the following range.

Note

It is important to take these values as orders of magnitude and to expect significant variations in any direction based on the processor, IRQ setting, memory type, network interface type, operating system tuning, and so on.

The following values were found on a Core i7 running at 3.7 GHz, equipped with a dual-port 10 Gbps NICs running Linux kernel 3.10, HAProxy 1.6 and OpenSSL 1.0.2.

HAProxy was running as a single process on a single dedicated CPU core, and two extra cores were dedicated to network interrupts:

  • 20 Gbps of maximum network bandwidth in clear text for objects 256 kB or higher, 10 Gbps for 41kB or higher

  • 4.6 Gbps of TLS traffic using AES256-GCM cipher with large objects

  • 83000 TCP connections per second from client to server

  • 82000 HTTP connections per second from client to server

  • 97000 HTTP requests per second in server-close mode (keep-alive with the client, close with the server)

  • 243000 HTTP requests per second in end-to-end keep-alive mode

  • 300000 filtered TCP connections per second (anti-DDoS)

  • 160000 HTTPS requests per second in keep-alive mode over persistent TLS connections

  • 13100 HTTPS requests per second using TLS resumed connections

  • 1300 HTTPS connections per second using TLS connections renegociated with RSA2048

  • 20000 concurrent saturated connections per GB of RAM, including the memory required for system buffers; it is possible to improve with careful tuning, but this setting it easy to achieve.

  • About 8000 concurrent TLS connections (client-side only) per GB of RAM, including the memory required for system buffers

  • About 5000 concurrent end-to-end TLS connections (both sides) per GB of RAM, including the memory required for system buffers

Guidelines about sizing

There are a few rules of thumb to keep in mind in your sizing exercise:

  • The request rate is divided by 10 between TLS keep-alive and TLS resume, and between TLS resume and TLS negotiation; while it is only divided by 3 between HTTP keep-alive and HTTP close.

  • A high frequency core with AES instructions can do around 5 Gbps of AES-GCM per core.

  • Having more core is rarely helpful (except for TLS), and can even be counter-productive due to the lower frequency. In general, it is better to have a small number of high frequency cores.

On the same server, HAProxy is able to saturate approximately:

  • 5-10 static file servers or caching proxies

  • 100 anti-virus proxies

  • 100-1000 application servers depending on the technology in use