HAProxy is a high-performance load balancer that provides advanced defense capabilities for detecting and protecting against malicious bot traffic to your website. Combining its unique ACL, map, and stick table systems with its powerful configuration language allows you to track and mitigate the full spectrum of today’s bot threats. Read on to learn how.

Read our blog post Application-Layer DDoS Attack Protection with HAProxy to learn why HAProxy is a key line of defense against DDoS used by many of the world’s top enterprises.

 

It is estimated that bots make up nearly half the traffic on the Internet. When we say bot, we’re talking about a computer program that automates a mundane task. Typical bot activities include crawling websites for indexing, such as how Googlebot finds and catalogues your web pages. Or, you might sign up for services that watch for cheap airline tickets or aggregate price lists to show you the best deal. These types of bots are generally seen as beneficial.

Unfortunately, a large portion of bots are used for malicious reasons. Their intentions include web scraping, spamming, request flooding, brute forcing, and vulnerability scanning. For example, bots may scrape your price lists so that competitors can consistently undercut you or build a competitive solution using your data. Or they may try to locate forums and comment sections where they can post spam. At other times, they’re scanning your site looking for security weaknesses.

HAProxy has best-in-class defense capabilities for detecting and protecting against many types of unwanted bot traffic. Its unique ACL, map, and stick table systems, as well as its flexible configuration language, are the building blocks that allow you to identify any type of bot behavior and neutralize it. Furthermore, HAProxy is well known for maintaining its high performance and efficiency while performing these complex tasks. For those reasons, companies like StackExchange have used HAProxy as a key component in their security strategy.

In this blog post, you’ll learn how to create an HAProxy configuration for bot protection. As you’ll see, bots typically exhibit unique behavior and catching them is a matter of recognizing the patterns. You’ll also learn how to whitelist good bots.

HAProxy Load Balancer

To create an HAProxy configuration for bot protection, you’ll first need to install HAProxy and place it in front of your application servers. All traffic is going to be routed through it so that client patterns can be identified. Then, proper thresholds can be determined and response policies can be implemented.

In this blog post, we’ll look at how many unique pages a client is visiting within a period of time and determine whether this behavior is normal or not. If it crosses the predetermined threshold, we’ll take action at the edge before it gets any further. We’ll also go beyond that and see how to detect and block bots that try to brute-force your login screen and bots that scan for vulnerabilities.

Bot Protection Strategy

Bots can be spotted because they exhibit non-human behavior. Let’s look at a specific behavior: web scraping. In that case, bots often browse a lot of unique pages very quickly in order to find the content or types of pages they’re looking for. A visitor that’s requesting dozens of unique pages per second is probably not human.

Our strategy is to set up the HAProxy load balancer to observe the number of requests each client is making. Then, we’ll check how many of those requests are for pages that the client is visiting for the first time. Remember, web scraping bots want to scan through many pages in a short time. If the rate at which they’re requesting new pages is above a threshold, we’ll flag that user and either deny their requests or route them to a different backend.

You’ll want to avoid blocking good bots like Googlebot though. So, you’ll see how to define whitelists that permit certain IP addresses through.

Detecting Web Scraping

Stick tables store and increment counters associated with clients as they make requests to your website. If you’d like an in-depth introduction, check out our blog post Introduction to HAProxy Stick Tables. To configure one, add a backend section to your HAProxy configuration file and then add a stick-table directive to it. Each backend can only have a single stick-table definition. We’re going to define two stick tables, as shown:

The first table, which is defined within your per_ip_and_url_rates backend, will track the number of times that a client has requested the current webpage during the last 24 hours. Clients are tracked by a unique key. In this case, the key is a combination of the client’s IP address and a hash of the path they’re requesting. Notice how the stick table’s type is binary so that the key can be this combination of data.

The second table, which is within a backend labelled per_ip_rates, stores a general-purpose counter called gpc0. You can increment a general-purpose counter when a custom-defined event occurs. We’re going to increment it whenever a client visits a page for the first time within the past 24 hours.

The gpc0_rate counter is going to tell us how fast the client is visiting new pages. The idea is that bots will visit more pages in less time than a normal user would. We’ve arbitrarily set the rate period to thirty seconds. Most of the time, bots are going to be fast. For example, the popular Scrapy bot is able to crawl about 3,000 pages per minute. On the other hand, bots can be configured to crawl your site at the same pace as a normal user would. Just keep in mind that you may want to change the rate period from thirty seconds to something longer, like 24 hours (24h), depending on how many pages a normal user is likely to look at within that amount of time.

Next, add a frontend section for receiving requests:

The line  http-request track-sc1 adds the client to the stick-table storage. It uses a combination of their IP address and page they’re visiting as the key, which you get with the built-in fetch method url32+src. A fetch method collects information about the current request.

Web pages these days pull in a lot of supporting files: JavaScript scripts, CSS stylesheets, images. By adding an unless statement to the end of your http-request track-sc1 line, you can exclude those file types from the count of new page requests. So, in this example, it won’t track requests for JavaScript, CSS, PNG, JPEG and GIF files.

The http-request track-sc1 line automatically updates any counters associated with the stick table, including the http_req_rate counter. So, in this case, the HTTP request count for the page goes up by one. When the count is exactly one for a given source IP address and page, it means the current user is visiting the page for the first time. When that happens, the conditional statement if { sc_http_req_rate(1) eq 1 } on the last line becomes true and the directive http-request sc-inc-gpc0(0) increments the gpc0 counter in our second stick table.

Now that you’re incrementing a general-purpose counter each time a client, identified by IP address, visits a new page, you’re also getting the rate at which that client is visiting new pages via the gpc0_rate(30s) counter. How many unique page visits over thirty seconds denotes too many? Tools like Google Analytics can help you here with its Pages / Session metric. Let’s say that 15 first-time page requests over that time constitutes bot-like behavior. You’ll define that threshold in the upcoming section.

Setting a Threshold

Now that you’re tracking data, it’s time to set a threshold that will separate the bots from the humans. Bots will request pages much faster, over a shorter time. Your first option is to block the request outright. Add an http-request deny directive to your frontend section:

With this, any user who requests more than 15 unique pages within the last thirty seconds will get a 403 Forbidden response. Optionally, you can use deny_status to pass an alternate code such as 429 Too Many Requests. Note that the user will only be banned for the duration of the rate period, or thirty seconds in this case, after which it will reset to zero. That’s because we’ve added !exceeds_limit to the end of the http-request sc-inc-gpc0(0) line so that if the user keeps accessing new pages within the time period, it won’t keep incrementing the counter.

To go even further, you could use a general-purpose tag (gpt0) to tag suspected bots so that they can be denied from then on, even after their new-page request rate has dropped. This ban will last until their entry in the stick table expires, or 24 hours in this case. Expiration of records is set with the expire parameter on the stick-table. Start by adding gpt0 to the list of counters stored by the per_ip_rates stick table:

Then, add http-request sc-set-gpt0(0) to your frontend to set the tag to 1, using the same condition as before. We’ll also add a line that denies all clients that have this flag set.

Alternatively, you can send any tagged IP addresses to a special backend by using the use_backend directive, as shown:

This backend could, for example, serve up a cached version of your site or have server directives with a lower maxconn limit to ensure that they can’t swamp your server resources. In other words, you could allow bot traffic, but give it less priority.

Observing the Data Collection

You can use the Runtime API to see the data as it comes in. If you haven’t used it before, check out our blog post Dynamic Configuration with the HAProxy Runtime API to learn about the variety of commands available. In a nutshell, the Runtime API listens on a UNIX socket and you can send queries to it using either socat or netcat.

The show table [table name] command returns the entries that have been saved to a stick table. After setting up your HAProxy configuration and then making a few requests to your website, take a look at the contents of the per_ip_and_url_rates stick table, like so:

I’ve made one request to /foo and five requests to /bar; all from a source IP of 127.0.0.1.  Although the key is in binary format, you can see that the first four bytes are different. Each key is a hash of the path I was requesting and my IP address, so it’s easy to see that I’ve requested different pages. The http_req_rate tells you how many times I’ve accessed these pages.

Did You Know? You can key off of IPv6 addresses with this configuration as well, by using the same url32+src fetch method.

Use the Runtime API to inspect the per_ip_rates table too. You’ll see the gpc0 and gpc0_rate values:

Here, the two requests for unique pages over the past 24 hours are reported as gpc0=2. The number of those that happened during the last thirty seconds was also two, as indicated by the gpc0_rate(30000) value.

If you’re operating more than one instance of HAProxy, combining the counters that each collects will be crucial to getting an accurate picture of user activity. HAProxy Enterprise provides cluster-wide tracking with a feature called the Stick Table Aggregator that does just that. This feature shares stick table data between instances using the peers protocol, adds the values together, and then returns the combined results back to each instance of HAProxy. In this way, you can detect patterns using a fuller set of data. Here’s a representation of how multiple peers can be synced:

Verifying Real Users

The risk in rate limiting is accidentally locking legitimate users out of your site. HAProxy Enterprise has the reCAPTCHA module that’s used to present a Google reCAPTCHA v2 challenge page. That way, your visitors can solve a puzzle and access the site if they’re ever flagged. In the next example, we use the reCAPTCHA Lua module so that visitors aren’t denied outright with no way to get back in.

Now, once an IP is marked as a bot, the client will just get reCAPTCHA challenges until such time as they solve one, at which point they can go back to browsing normally.

HAProxy Enterprise has another great feature: the Antibot module. When a client behaves suspiciously by requesting too many unique pages, HAProxy will send them a JavaScript challenge. Many bots aren’t able to parse JavaScript at all, so this will stop them dead in their tracks. The nice thing about this is that it isn’t disruptive to normal users, so customer experience remains good.

Beyond Scrapers

So far, we’ve talked about detecting and blocking clients that access a large number of unique pages very quickly. This method is especially useful against scrapers, but similar rules can also be applied to detecting bots attempting to brute-force logins and scan for vulnerabilities. It requires only a few modifications.

Brute-force Bots

Bots attempting to brute force a login page have a couple of unique characteristics: They make POST requests and they hit the same URL (a login URL), repeatedly testing numerous username and password combinations. In the last section, we were tracking HTTP request rates for a given URL on a per-IP basis with the following line:

We’ve been using http-request sc-inc-gpc0(0) to increment a general-purpose counter, gpc0, on the per_ip_rates stick table when the client is visiting a page for the first time.

You can use this same technique to block repeated hits on the same URL. The reasoning is that a bot that is targeting a login page will send an anomalous amount of POST requests to that page. You will want to watch for POST requests only.

First, because the per_ip_and_url_rates stick table is watching over a period of 24 hours and is collecting both GET and POST requests, let’s make a third stick table to detect brute-force activity. Add the following stick-table definition:

Then add an http-request track-sc2 and an http-request deny line to the frontend:

You now have a stick table and rules that will detect repeated POST requests to the /login URL, as would be seen when an attacker attempts to find valid logins. Note how the ACL { path /login } restricts this to a specific URL. This is optional, as you could rate limit all paths that clients POST to by omitting it. Read our post Introduction to HAProxy ACLs for more information about defining custom rules using ACLs.

In addition to denying the request, you can also use any of the responses discussed in the Unblocking Real Users section above in order to give valid users who happen to get caught in this net another chance.

Vulnerability Scanners

Vulnerability scanners are a threat you face as soon as you expose your site or application to the Internet. Generic vulnerability scanners will probe your site for many different paths, trying to determine whether you are running any known vulnerable, third-party applications.

Many site owners, appropriately, turn to a Web Application Firewall for such threats, such as the WAF that HAProxy Enterprise provides as a native module. However, many security experts agree that it’s beneficial to have multiple layers of protection. By using a combination of stick tables and ACLs, you’re able to detect vulnerability scanners before they are passed through to the WAF.

When a bot scans your site, it will typically try to access paths that don’t exist within your application, such as /phpmyadmin and /wp-admin. Because the backend will respond with 404’s to these requests, HAProxy can detect these conditions using the http_err_rate fetch. This keeps track of the rate of requests the client has made that resulted in a 4xx response code from the backend.

These vulnerability scanners usually make their requests pretty quickly. However, as high rates of 404’s are fairly uncommon, you can add the http_err_rate counter to your existing per_ip_rates table:

Now, with that additional counter, and the http-request track-sc0 already in place, you have—and can view via the Runtime API—the 4xx rate for clients. Block them simply by adding the following line:

You can also use the gpc0 counter that we are using for the scrapers to block them for a longer period of time:

Now the same limits that apply to scrapers will apply to vulnerability scanners, blocking them quickly before they succeed in finding vulnerabilities.

Alternatively, you can shadowban these clients and send their requests to a honeypot backend, which will not give the attacker any reason to believe that they have been blocked. Therefore, they will not attempt to evade the block. To do this, add the following in place of the http-request deny above. Be sure to define the backend be_honeypot:

Whitelisting Good Bots

Although our strategy is very effective at detecting and blocking bad bots, it will also catch Googlebot, BingBot, and other friendly search crawlers with equal ease. You will want to welcome these bots, not banish them.

The first step to fixing this is to decide which bots you want so that they don’t get blocked. You’ll build a list of good bot IP addresses, which you will need to update on a regular basis. The process takes some time, but is worth the effort! Google provides a helpful tutorial. Follow these steps:

  1. Make a list of strings found in the User-Agent headers of good bots (e.g. GoogleBot).
  2. Grep for the above strings in your access logs and extract the source IP addresses.
  3. Run a reverse DNS query to verify that the IP is indeed a valid good bot. There are plenty of bad bots masquerading as good ones.
  4. Check the forward DNS of the record you got in step 3 to ensure that it maps back to the bot’s IP, as otherwise an attacker could host fake reverse DNS records to confuse you.
  5. Use whois to extract the IP range from the whois listing so that you cover a larger number of IP’s. Most companies are good about keeping their search bots and proxies within their own IP ranges.
  6. Export this list of IP’s to a file with one IP or CIDR netmask per line (e.g. 192.168.0.0/24).

Now that you have a file containing the IP addresses of good bots, you will want to apply that to HAProxy so that these bots aren’t affected by your blocking rules. Save the file as whitelist.acl and then change the http-request track-sc1 line to the following:

Now, search engines won’t get their page views counted as scraping. If you have multiple files, such as another for whitelisting admin users, you can order them like this:

When using whitelist files, it’s a good idea to ensure that they are distributed to all of your HAProxy servers and that each server is updated during runtime. An easy way to accomplish this is to purchase HAProxy Enterprise and use its lb-update module. This lets you host your ACL files at a URL and have each load balancer fetch updates at a defined interval. In this way, all instances are kept in sync from a central location.

Identifying Bots By Their Location

When it comes to identifying bots, using geolocation data to place different clients into categories can be a big help. You might decide to set a different rate limit for China, for example, if you were able to tell which clients originated from there.

In this section, you’ll see how to read geolocation databases with HAProxy. This can be done with either HAProxy Enterprise or HAProxy Community, although in different ways.

Geolocation with HAProxy Enterprise

HAProxy Enterprise provides modules that will read MaxMind and Digital Element geolocation databases natively. You can also read them with HAProxy Community, but you must first convert them to map files and then load the maps into HAProxy.

Let’s see how to do this with MaxMind using HAProxy Enterprise.

Maxmind

First, load the database by adding the following directives to the global section of your configuration:

Within your frontend, use http-request set-header to add a new HTTP header to all requests, which captures the client’s country:

Now, requests to the backend will include a new header that looks like this:

You can also add the line maxmind-update url https://example.com/maxmind.mmdb to have HAProxy automatically update the database from a URL during runtime.

Digital Element

If you’re using Digital Element for geolocation, the same thing as we did for MaxMind can be done by adding the following to the global section of your configuration:

Then, inside of your frontend add an http-request set-header line:

This adds a header to all requests, which contains the client’s country:

To have HAProxy automatically update the Digital Element database during runtime, add netacuity-update url https://example.com/netacuity_db to your global section.

Read the next section if you’re using HAProxy Community, otherwise skip to the Using the Location Information section.

Geolocation with HAProxy Community

If you’re using HAProxy Community, you’ll first want to convert the geolocation database to a map file. In the following example, we will show converting the MaxMind city database into an HAProxy map file.

First, make a file named read_city_map.py with the following contents:

Next, download the Maxmind City database (with minor modifications this script will work for just country databases). Either the GeoLite City or paid City database CSV files will produce the same output. Then, extract the zip file.

When you run this script with a Blocks CSV as the first argument and the Locations CSV as the second argument, it will produce the files country_iso_code.map, city_name.map, and gps.map.

Use http-request set-header to add an HTTP header, as we did in the previous Enterprise examples:

Once again we will end up with a header that contains the client’s country.

We’ll use it in the next section.

Using the Location Information

Whether you used HAProxy Enterprise or HAProxy Community to get the geolocation information, you can now use it to make decisions. For example, you could route clients that trigger too many errors to a special, honeypot backend. With geolocation data, the threshold that you use might be higher or lower for some countries.

Since this information is stored in an HTTP header, your backend server will also have access to it, which gives you the ability to take further action from there. We won’t get into it here, but HAProxy also supports device detection and other types of application intelligence databases.

Conclusion

In this blog post, you learned how to identify and ban bad bots from your website by using the powerful configuration language within the HAProxy load balancer. Placing this type of bot protection in front of your servers will protect you from these crawlers as they attempt content scraping, brute forcing and mining for security vulnerabilities.

HAProxy Enterprise gives you several options in how you deal with these threats, allowing you to block them, send them to a dedicated backend, or present a challenge to them. Need help constructing an HAProxy configuration for bot detection and protection that accommodates your unique environment? Contact us to learn more or sign up for a free trial. HAProxy Technologies’ expert support team has many decades of experience mitigating many types of bot threats. They can help provide an approach tailored to your needs.

Are you using HAProxy for your bot defense? Let us know in the comment section below! Want to stay up to date as we publish similar topics? Subscribe to our blog and follow us on Twitter!

SHARE THIS ARTICLE