The operating system and its tuning have a strong impact on the global performance of the load-balancer. Typical CPU usage figures generally show:

  • 15% of the processing time spent in HAProxy versus 85% in the kernel in TCP or HTTP close mode

  • 30% for HAProxy versus 70% for the kernel in HTTP keep-alive mode

Usage can vary depending on whether the focus is on bandwidth, request rate, connection concurrency, or SSL performance. This section aims to provide a few elements to consider as you set up your configuration.

Evaluating the costs of processing requests

It is important to keep in mind that every operation comes with a cost. Hence, each individual operation adds its overhead on top of other operations, which can either be negligible in some circumstances or can dominate in others.

When processing requests from a connection, we observe that:

  • Forwarding data costs less than parsing request or response headers

  • Parsing request or response headers cost less than establishing then closing a connection to a server

  • Establishing and closing a connection costs less than a TLS resume operation

  • A TLS resume operation costs less than a full TLS handshake with a key computation

  • An idle connection costs less CPU than a connection whose buffers hold data

  • A TLS context costs even more memory than a connection with data

So in practice, it is cheaper to process payload bytes than header bytes; thus, it is easier to achieve high network bandwidth with large objects (few requests per volume unit) than with small objects (many requests per volume unit). This explains why maximum bandwidth is always measured with large objects, while request rate or connection rates are measured with small objects.

Some operations scale well on multiple process spread over multiple processors, such as:

  • The request rate over persistent connections: This does not involve much memory nor network bandwidth and does not require to access locked structures.

  • TLS key computation: This is completely CPU-bound.

  • TLS resume (moderately well): This operation reaches its limits around 4 processes, when the overhead of accessing the shared table offsets the small gains expected from more power.

Other operations do not scale as well, such as:

  • Network bandwidth: The CPU is rarely the bottleneck for large objects.

  • Connection rate: This is due to a few locks in the system when dealing with the local ports table.

Optimizing performance

The performance values you can expect from a very well tuned system are in these ranges for:

  • TCP connections per GB of RAM: 29000

  • TLS frontend connections per GB of RAM: 7900

  • TLS end to end connections per GB of RAM: 7100

Results from tests with HAProxy Enterprise 2.1r1 kernel 4.19 2x i40g & core i7-8700 are the following:

Note

It is important to take these values as orders of magnitude and to expect significant variations in any direction based on the processor, IRQ setting, memory type, network interface type, operating system tuning, and so on.

Frontend performances**(request cached object, no forward to server)

Frontal protocol

Max requests per seconds

Max bandwidth

HTTPv1.1 connection close

315k

35Gb/s

HTTPv1.1 keepalive

750k

35Gb/s

HTTPSv1.1 keepalive*

670k

34Gb/s

HTTPSv2.0*

740k

29Gb/s

Edge Load Balancer performances** (HTTPSv2.0 frontal protocol*)

Server side protocol

Max requests per seconds

Max bandwidth

HTTPv1.1 keepalive

580k

21Gb/s

HTTPSv1.1 keepalive*

450k

19Gb/s

HTTPSv2.0*

500k

12Gb/s

Classic/legacy Load Balancer performances** (HTTPv1.1 keepalive on server side)

Frontal protocol

Max requests per seconds

Max bandwidth

HTTPv1.1 connection close

195k

33Gb/s

HTTPv1.1 keepalive

540k

33Gb/s

HTTPSv1.1 keepalive*

360k

30Gb/s

TCP/HTTP Filtering performances**

Frontend protocol

Requests per seconds

HTTPv1.1 connection close

380k

HTTPSv1.1 keepalive

380k

HTTPSv2.0*

1350k

TLS key computation performances

Key type

Key per seconds

RSA 2048

9000

RSA 4096

1500

ECDSA 256

25000

Note

  • (*) using TLSv1.2 with 2048 bits key and ECDHE-RSA-AES256-GCM-SHA384 cipher

  • (**) using 1000 conccurent transactions

Guidelines about sizing

There are a few rules of thumb to keep in mind in your sizing exercise:

  • The request rate is divided by 10 between TLS keep-alive and TLS resume, and between TLS resume and TLS negotiation; while it is only divided by 3 between HTTP keep-alive and HTTP close.

  • A high frequency core with AES instructions can do around 5 Gbps of AES-GCM per core.

  • Having more core is rarely helpful (except for TLS), and can even be counter-productive due to the lower frequency. In general, it is better to have a small number of high frequency cores.

On the same server, HAProxy is able to saturate approximately:

  • 5-10 static file servers or caching proxies

  • 100 anti-virus proxies

  • 100-1000 application servers depending on the technology in use