With HAProxy, you can implement a circuit breaker to protect services from widespread failure.
Martin Fowler, who is famous for being one of the Gang of Four authors who wrote Design Patterns: Elements of Reusable Object-Oriented Software, hosts a website where he catalogues software design patterns. He defines the Circuit Breaker pattern like this:
The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.
A circuit breaker has these characteristics:
- It monitors services in real time, checking whether failures exceed a threshold;
- If failures become too prevalent, the circuit breaker shuts off the service for a time.
When enough errors are detected a circuit breaker flips into the open state, which means it shuts off the service. When that happens, the calling service or application realizes that it can not reach the service. It could have some contingency in place such as increasing its retry period, rerouting to another service, or switching to some form of degraded functionality for a while. These fallback plans prevent the application from tying up any more time trying to use the service.
You would never use a circuit breaker between end users and your application, since it would lead to a bad user experience. Instead, it belongs between services in your backend infrastructure that depend on one another. For example, if an order fulfillment service needs to call an address verification service, then there is a dependency between those two services. Dependencies like that are common in a distributed environment, such as in a microservices architecture. In that context, a circuit breaker isolates failure to a small part of your overall system before it has a chance to seriously impact other parts.
In this blog post, you will learn how to implement a circuit breaker with HAProxy. You’ll see a simple way, which relies on HAProxy’s
observe keyword, and a more complex way that allows greater customization.
Why We Need Circuit Breaking
A circuit breaker isolates failure to a small part of your overall system. To understand why it’s needed, let’s consider what other mechanisms HAProxy has in place to protect clients from faulty services.
First up, consider active health checks. When you add the
check parameter to a
server line in your HAProxy configuration, it pings that server to see if it’s up and working properly. This can either be an attempt to connect over TCP/IP or an attempt to send an HTTP request and get back a valid response. If the ping fails enough times, HAProxy stops load balancing to that server.
Because they target the service’s IP and port or a specific URL, active health checks monitor a narrow range of the service. They work well for detecting when a service is 100% down. But what if the error happens only when a certain API function is called, which is not monitored by the health checks? The health checks would report that the service is functioning properly, even if 70 or 80% of requests are calling the critical function and failing. Unlike active health checks, a circuit breaker monitors live traffic for errors, so it will catch errors in any part of the service.
Another mechanism built into HAProxy is automatic retries, which let you attempt a failed connection or HTTP request again. Retrying is an intrinsically optimistic operation: It expects that calling the service a second time will succeed, which is perfect for transient errors such as those caused by a momentary network disruption. Retries do not work as well when the errors are long-lived, such as those that happen when a bad version of the service has been deployed.
A circuit breaker is more pessimistic. After errors exceed a threshold, it assumes that the disruption is likely to be long-lived. To protect clients from continuously calling the faulty service, it shuts off access for a specified period of time. The hope is that, given enough time, the service will recover.
You can combine active health checks, retries and circuit breaking to get full coverage protection.
Implement a Circuit Breaker: The Simple Way
Since 2009, HAProxy has had the
observe keyword, which enables live monitoring of traffic for detecting errors. It operates in either layer4 mode or layer7 mode, depending on whether you want to watch for failed TCP/IP connections or failed HTTP requests. When errors reach a threshold, the server is taken out of the load balancing rotation for a set period of time.
Consider the following example of a
backend section that uses the
observe layer7 keyword to monitor traffic for HTTP errors:
Keywords on the
default-server line apply to all
server lines that follow. These keywords mean the following:
||How many connections HAProxy should open to the server in parallel.|
||Enables health checking.|
||Monitor live traffic for HTTP errors.|
||If errors reach 50, trigger the on-error action.|
||What to do when the error-limit is reached: mark the service as down.|
||How often to send an active health check; In conjunction with rise, this sets the period to keep the server offline.|
||How many active health checks must pass before bringing the server back online.|
||After the server recovers and is brought back online, this sends traffic to it gradually over 20 seconds until it reaches 100% of maxconn.|
With these keywords in place, HAProxy will perform live monitoring of traffic at the same time as it performs active health checking. If 50 consecutive requests fail, the server is marked as down and taken out of the load balancing rotation. The period of down time lasts for as long as it takes for the active health checks to report that the server is healthy again. Here, we’ve set the interval of the active health checks to one per second. There must be 30 successful health checks. So, a service will be shut off for a minimum of 30 seconds.
We’re also including the
slowstart keyword, which eases the server back into full service once it becomes healthy, sending traffic to it gradually over 20 seconds. In circuit breaker terminology, this is called putting the server into a half-open state. A limited number of requests are allowed to invoke the service during this time.
With this implementation, each server is taken out of the load balancing rotation on a case-by-case basis as the load balancer detects a problem with them. However, if you prefer, you can put a rule in place that will quicken your reaction time by removing the entire pool of servers from active service when X number of servers have failed. For example, if you started with ten servers, but six have failed and been circuit broken, assume it won’t be long before the other four will fail too. So, circuit break them now.
To do this, add an
http-request return line that uses the
nbsrv fetch method to check how many servers are still up and if that number falls below a threshold return a 503 error status for all requests. HAProxy will continue to check the servers in the background and will bring them back online when they are healthy again.
Implement a Circuit Breaker: The Advanced Way
There’s another way to implement a circuit breaker—one that isn’t as simple, but offers more ways for you to customize the behavior. It relies on several of HAProxy’s unique features including stick tables, ACLs, and variables.
Consider the following example. Everything in the
backend section except for the
server lines make up our circuit breaker logic:
When our circuit breaker detects that more than 50% of recent requests have resulted in an error, it shuts off the entire backend—not only a single server—and rejects all incoming requests for the next 30 seconds.
Let’s step through this configuration. First, we define a stick table:
A stick table tracks information about requests flowing through the load balancer. It stores a key, which in this case is a string, and associates it with counters. Here, the key is the name of the backend, serviceA. The counters include:
http_req_rate(10s)– the HTTP request rate over the last 10 seconds;
gpc0– a general purpose counter, which will store a cumulative count of errors;
gpc0_rate– the rate that the general purpose counter (errors) is increasing over 10 seconds;
gpc1– a second general purpose counter, which will store a 0 or 1 to indicate whether the circuit is open.
This stick table will store the HTTP request rate and the error rate for the backend. When the errors make up a percentage of the requests, we set the second general purpose counter, gpc1, to 1, opening the circuit and shutting off the service. The stick table’s expire parameter is set to 1m, which means one minute, which is how long the circuit will stay open before it reverts back.
You can think of the stick table as looking like this when errors have reached the threshold, the circuit is open, and the service is offline for 30 seconds:
stick-table line, we define an ACL named circuit_open:
This line defines an expression that checks the gpc1 counter to see whether its value is greater than zero. If it is, then the circuit_open ACL will return true. Note that we’re using the
table_gpc1 converter to get the value. There’s an important difference between this and the similar fetch method
sc_get_gpc1(0) fetch method will reset the expiration on the record when it’s used, but the
table_gpc1 converter will not. In this instance we do not want to reset the expiration because that would extend the period of time that the service is down every time someone makes a request. With the converter, the expiration counts down a minute and then restores the service, regardless of whether clients are trying to call the service in the meantime.
After that, an
http-request return line rejects all requests if the circuit is open, returning an HTTP 503 status:
To give the caller more information, it sends back a JSON response with the message Circuit Breaker tripped. If this line is invoked, it ends processing of the request and the rest of the lines will not be called.
The next line begins tracking requests:
It adds a record to the stick table if it doesn’t exist and also updates the counters during each subsequent request. Note that the action method is called
track-sc0, which means it should start tracking sticky counter sc0. A sticky counter is a temporary variable that holds the value of the key long enough to add or fetch the record from the table. It is a slot HAProxy uses to track a request as it passes through. The
http-request track-sc0 line assigns the sticky counter variable to use—sc0—and stores the backend’s name in it.
track-sc2. Increase the number of sticky counters by compiling HAProxy with the
The next line uses the
sc-inc-gpc0(0) function to increment the first general purpose counter in the stick table if the server returned a status greater than or equal to 500:
The expression “status ge 500” counts any errors in the HTTP 5xx range. We’re counting errors manually, which allows us to control which error codes we care about. Later, we calculate the rate of errors using the
How should you read the funky syntax of the sc-inc-gpc0(0) function? It says: Look up the sticky counter 0, which is the sticky counter we chose previously—it’s the number in parentheses—and find the associated counter called gpc0. Then increment it. In other words, find the record that has the key serviceA and increment the error counter. Granted, in this configuration, the table will only ever have one record in it, since the stick table is defined in one backend, but not used anywhere else.
http_err_rate, but these look for errors with HTTP 4xx statuses only.
The next two lines store the HTTP request rate and error rate in variables. The error rate is the rate at which the gpc0 counter is being incremented. The helper functions
sc_gpc0_rate return these values. We store them in variables named res.req_rate and res.err_rate:
We must store them in variables because the next line uses the
div functions, which accept variables only:
This line increments the second general purpose counter, gpc1, if 50% of the requests had errors. By setting gpc1 to 1, the circuit is opened. There’s some math here that creates a percentage showing the rate of errors relative to the rate of all requests:
100 x error_rate / request rate = X
if X > 50, open circuit
That’s it. That configures a circuit breaker for this backend that will shut off the service if at least 50% of the recent requests were errors. You can adjust the circuit breaker threshold by changing the number 50 or adjust the time that the circuit breaker stays open by adjusting the expire field on the stick table. You may also want to require a minimum number of requests per second before the error rate is checked. For example, maybe you only care if 50% of requests are errors if you’ve had at least 100 requests in the past ten seconds. If so, change the last line to this, which includes that extra condition:
The circuit breaker pattern is ideal for detecting service failures that might not be caught by active health checks. It protects a system from widespread failure by isolating a faulty service, restricting access to it for a time. Clients can be designed to expect a circuit break and fallback to another service or simply deactivate that part of the application. HAProxy offers both a simple and an advanced way to implement the pattern, giving you plenty of flexibility.
Interested in advanced security and administrative features? HAProxy Enterprise is the world’s fastest and most widely used software load balancer. It powers modern application delivery at any scale and in any environment, providing the utmost performance, observability, and security. Organizations harness its cutting edge features and enterprise suite of add-ons, backed by authoritative expert support and professional services. Ready to learn more? Sign up for a free trial.