HAProxy 2.0 introduced layer 7 retries, which provides resilience against unreachable nodes, network latency, slow servers, and HTTP errors.
HAProxy powers the uptime of organizations with even the largest traffic demands by giving them the flexibility and confidence to deliver websites and applications with high availability, performance and security at any scale and in any environment. As the world’s fastest and most widely used software load balancer, ruggedness is one of its essential qualities.
When HAProxy receives a request, but can’t establish a TCP connection to the selected backend server, it automatically tries again after an interval set by
timeout connect. This behavior has been baked in from the beginning. This smooths out short-lived network flakiness and brief downtime caused by server restarts.
You can further customize this by setting a
retries directive in a
backend to the desired number of attempts. It defaults to three. Also, if you add
option redispatch to the
backend, HAProxy tries another server instead of repeatedly trying the same one.
Now with HAProxy 2.0, you aren’t limited to retrying based on a failed connection only. The new
retry-on directive lets you list other kinds of failures that will trigger a retry and covers both Layer 4 and Layer 7 events. For example, if messages time out after the connection has been established due to network latency or because the web server is slow to respond,
retry-on tells HAProxy to trigger a retry.
You can think of it like this:
retriessays how many times to try
retry-onsays which events trigger a retry
option redispatchsays whether to try with a different server
While I was learning about the new
retry-on feature, it got me thinking about the novel ways for adapting to failure. In particular, I began looking at Chaos Engineering and how purposefully injecting faults can guide you to make a better, stronger system. In this blog post, I’ll share what I learned about testing for failure and how
retry-on is a powerful tool when it comes to building system resilience.
Ultimately, you want to build systems that can adapt to unusual conditions and that keep on humming even in the face of failed components and turbulence. A resilient system can bounce back in the face of adversity. What sort of adversity? Adrian Cockroft, VP of Cloud Architecture at AWS, gave the keynote presentation at Chaos Conf 2018. He lists possible faults that may happen within your infrastructure or software. Here are just a few of the potential disasters related to infrastructure alone:
- Device failures (disk, power supply, cabling…)
- CPU failures (cache corruption, logic bugs…)
- Datacenter failures (power, connectivity, cooling, fire…)
- Internet failures (DNS, ISP, routing…)
There’s plenty that can go wrong, but oftentimes we avoid purposefully trying to break our systems and fix weaknesses. The result is that we don’t adequately test that the mitigations we’ve put in place actually work. Are the load balancer settings you’ve configured optimal for reducing outages?
Acting out real-world failure modes is the best way to test whether your system is resilient. What actually happens when you start killing web server nodes? What is the effect of inducing latency in the network? If a server returns HTTP errors, will downstream clients be affected and, if so, to what degree?
I found that by applying some techniques from Chaos Engineering, by intentionally injecting faults into the system, I began to see exactly how I should tune HAProxy. For example, I saw how best to set various timeouts and which events I should set to trigger a retry.
Creating Chaos is Getting Easier
Maybe it’s due to the maturation of Chaos Engineering, which is a maturation of our collective incident management knowledge, but the tooling available for creating chaos in your infrastructure is getting better and better.
Gremlin allows you to unleash mayhem such as killing off nodes, simulating overloaded CPU and memory, and filling up disk space.
Pumba is a command-line tool that lets you simulate bad network conditions such as latency, corrupted data, and packet loss.
Muxy can be used to alter the responses from your web servers, such as to return HTTP errors.
First, let’s take a look at killing off a node that’s serving as a backend server in HAProxy. If I use Gremlin or the Docker CLI to stop one of the web servers, then HAProxy will fail to connect to that node. This assumes that HAProxy has not already removed it from the load-balancing rotation during its regular health checks. For testing, I disabled health checking in order to allow HAProxy to attempt to connect to a down server.
Gremlin can be run as a Docker container, giving it access to other containers in the network. Then, you can use its UI to kill off nodes. For my experiment, I ran a group of web servers in Docker containers alongside Gremlin.
backend looked like this:
HAProxy adds an implicit
retries directive. It will automatically retry a failed connection three times. You can also set
retries explicitly to the number of desired attempts. After killing a node and trying to connect to it, the HAProxy log entry looked like this:
This shows that there were three retries to the same offline server, as indicated by the last number in the section 1/1/0/0/3. Ultimately, it ended in a 503 Service Unavailable response. Notice that the termination code is sC, meaning that there was a timeout while waiting to connect to the server.
To handle this scenario, you should add an
option redispatch directive so that instead of retrying with the same server, HAProxy tries a different one.
Then, if HAProxy tries to connect to a down server and hits the timeout specified by
timeout connect, it will try again with a different server. You’ll see the successful attempt in the logs with a +1 as the last number in the 1/1/0/0/+1 section, indicating that there was a redispatch to the s2 server.
HAProxy keeps trying servers until they’ve all been tried up to the
retries number. So, for this type of Layer 4 disconnection, you don’t need
There is another scenario: The connection was established fine, but then the server disconnected while HAProxy was waiting for the response. You can test this by injecting a delay using Muxy (discussed later in this article) and then killing Muxy before the response is sent. With the current configuration, which includes
option redispatch, this type of failure causes the client to receive a 502 Bad Gateway response. The HAProxy logs show an SH termination code, indicating that the server aborted the connection midway through the communication:
Here is where
retry-on comes into play. You would add a retry policy of
empty-response to guard against this:
Now, if HAProxy successfully connects to the server, but the server then aborts, the request will be retried with a different server.
Another failure mode to test is latency in the network. Using Pumba, you can inject a delay for all responses coming from one or more of your web servers running in Docker containers. For my experiment, I added a five-second delay to the first web server, and no delay for the others. The command looks like this:
First, note that the
defaults section of my HAProxy configuration looked like this:
timeout connect, which is the time allowed for establishing a connection to a server, is set to three seconds. My hypothesis for this experiment was that the HTTP request would be delayed and hit the
timeout server limit. What actually happened was the connection timeout struck first, giving me an sC termination code in the HAProxy logs, which means that there was a server-side timeout while waiting for the connection to be made.
From this I learned that generic network latency affects all aspects of a request, Layer 4 through Layer 7. In other words, the HTTP messages did not have a chance to time out because even establishing a connection was timing out first. It sounds obvious, but until I tested it, I was only focused on Layer 7.
Retrying when there is a connection timeout is covered by adding a
conn-failure retry policy. You can append it, like this:
If you don’t set
retry-on at all, then
conn-failure is on by default. However, since we’ve set it in order to include
empty-response, we need to include that retry policy explicitly as well. So, whether the server is completely down or just slow to connect, it’s counted as a connection timeout. Also note: How quickly HAProxy will retry depends on your timeout settings.
To learn what would happen if latency affected only the HTTP messages and not the initial connection, I moved on to using a different tool named Muxy. Muxy is a proxy that can change a request or response as it passes through. You can run it in a Docker container so that it has access to muck with messages from other containers in the network, hence its name. Use it to add a delay to one of the web server’s responses so that the connection is established fine, but the application appears sluggish. The following Muxy configuration injects a five-second delay for responses coming from server1:
You’ll need to point HAProxy to Muxy instead of the actual backend server:
This causes a different type of timeout in HAProxy, one that’s triggered when
timeout server strikes. The client receives a 504 Gateway Timeout response, as shown in the HAProxy logs:
response-timeout retry policy to cover this scenario:
Suppose that there was no latency, but that the server returned an HTTP error. You can deal with this type of chaos too. It is important that you know how your application behaves with Layer 7 retries enabled. Caution must be exercised when retrying requests such as POST requests. Be sure to read the next section regarding retrying POST requests!
To change the returned status to 500 Server Error before it reaches HAProxy, use a Muxy configuration like this:
You’ll need to update your retry policy to look for certain HTTP response status codes. In the following example, a retry happens if there’s a timeout, a connection failure, or an HTTP 500 status returned:
With this in place, the HAProxy log shows that the request was routed away from the failing server, s1, to the healthy server, s2:
Recall that it is actually
option redispatch that tells HAProxy to try a different server. The
retry-on directive only configures when a retry should happen. If you wanted to keep trying the same server, you’d remove
You can keep appending more retry policies to mitigate different types of failure modes. Or, you can use the all-inclusive option, all-retryable-errors:
It’s the same as if you’d specified all of the following parameters:
You will find all of the available options in the HAProxy documentation.
Beware of POSTs
Retrying requests that fetch data is often safe enough. Although, be sure to test! The application may have unknown side effects that make it unsafe to retry. However, it’s almost never safe to retry a request that writes data to a database, since you may be inserting duplicate data. For that reason, you’ll often want to add a rule that disables retries for POST requests. Use the
http-request disable-l7-retry directive, like this:
In this blog post, you learned about the
retry-on directive that was added to HAProxy 2.0 and complements the existing
option redispatch features. This is a versatile feature that lets you mitigate various types of failures by specifying the events that should trigger a retry. However, you never know how things will truly work until you inject some faults into the system and see how it responds. By using Chaos Engineering techniques, I was able to verify that this directive adds resilience against unreachable nodes, network latency, slow servers, and HTTP errors.
Contact us to learn more about HAProxy Enterprise, which combines HAProxy, the world’s fastest and most widely used, open-source load balancer and application delivery controller, with enterprise-class features, services and premium support. You can also sign up for a free trial.