HAProxy Enterprise Documentation 2.1r1

Verify Crawler

HAProxy Enterprise offers a module that provides verification of search engine crawlers. If a client presents a User-Agent HTTP header that says that it is a search engine crawler, this program will verify whether that is true.

Here is how it works:

  1. When HAProxy Enterprise receives a request from a client that has a User-Agent string matching a known search engine crawler (e.g. Googlebot), it puts that client's information into a stick table called "unchecked crawler".

  2. Verify-Crawler runs in the background as a daemon. It polls the Runtime API "show table" command to get the list of unverified crawlers that HAProxy Enterprise has flagged.

  3. For each unverified crawler, the daemon does a reverse DNS lookup on the client's source IP address. If the IP address resolves to the expected search engine's domain name (e.g. googlebot.com), then it has passed the first test.

  4. The daemon then performs a forward DNS lookup on the search engine's domain name (e.g. googlebot.com) and gets back a list of IP addresses. If the client's source IP address is in this list, it passes the second test.

  5. The daemon puts the client's source IP address into either the "valid_crawler" or "invalid_crawler" stick table so that HAProxy Enterprise can know the client's status the next time it receives a request from them.

It is up to you whether to log, deny, or take another action on invalid crawlers.

Installing the Verify Crawler daemon

  1. Install the Verify Crawler package:

    $ # On Debian/Ubuntu
    $ sudo apt-get install hapee-extras-verify-crawler
    $ # On CentOS/RedHat/Oracle
    $ sudo yum install hapee-extras-verify-crawler
    $ # On SUSE
    $ sudo zypper install hapee-extras-verify-crawler
    $ # On FreeBSD
    $ sudo pkg install hapee-extras-verify-crawler
  2. Start the daemon.

    $ sudo /opt/hapee-extras/sbin/hapee-verify-crawler \
      -s /var/run/hapee-2.1/hapee-lb.sock \
      -d

    The -s flag points to the HAProxy Enterprise socket.

    The -d flag runs the program in daemon / background mode.

    Alternatively, add a program section that starts the daemon to your configuration file and then restart the load balancer. The daemon will be managed by the HAProxy Enterprise Process Manager and will run as a child process under the main load balancer process:

    program verifycrawler
      command /opt/hapee-extras/sbin/hapee-verify-crawler -s /var/run/hapee-2.1/hapee-lb.sock -d
      no option start-on-reload
  3. Add the following backend sections to your configuration. These set up the required stick tables.

    backend unchecked_crawler.local
      stick-table type string len 60 size 1m expire 24h store gpc0
    
    backend valid_crawler.local
      stick-table type ip size 1m expire 24h store gpc0
    
    backend invalid_crawler.local
      stick-table type ip size 1m expire 24h store gpc0
  4. In the frontend or listen section where you would like to enable crawler verification, add the following lines:

    frontend example
      acl crawler hdr_sub(user-agent) -M -f /etc/hapee-extras/crawler.map
      acl unchecked_crawler sc_get_gpc0(5) -m int eq 0
      acl invalid_crawler   src,table_gpc0(invalid_crawler.local) -m int gt 0
      acl valid_crawler     src,table_gpc0(valid_crawler.local) -m int gt 0
    
      http-request set-header X-crawler-ipdomain %[src]|%[hdr(user-agent),map_sub(/etc/hapee-extras/crawler.map)]
      http-request track-sc5 req.hdr(X-crawler-ipdomain) table unchecked_crawler.local if !valid_crawler !invalid_crawler crawler

    These flag search engine crawlers for verification. They do not take any action on invalid crawlers. The daemon will perform verification in the background and, ultimately, either valid_crawler or invalid_crawler will be set to true. This process will not block the current request, but will make these variables available for the next time the client makes a request.

  5. Add more search engine crawlers to /etc/hapee-extras/crawler.map for HAProxy Enterprise to recognize them as crawlers:

    Googlebot googlebot.com
    Yandexbot yandex.com
    YandexImages yandex.com
    bingbot msn.com
    Baiduspider baidu.com
    coccocbot coccoc.com
    SeznamBot seznam.cz

    Each line in this file is the User-Agent string followed by the DNS domain name to which the crawler's IP address should resolve.

  6. OPTIONAL: Capture the verification status of crawlers in the access logs by adding the following lines to your frontend or listen section:

    frontend example
      # Logs the User-Agent header
      http-request capture req.hdr(user-agent) len 10
    
      # Sets a variable named 'CrawlerStatus' to 'Unchecked Crawler'
      http-request set-var(txn.CrawlerStatus) str("Unchecked Crawler") if crawler !valid_crawler !invalid_crawler
    
      # Sets a variable named 'CrawlerStatus' to 'Invalid Crawler'
      http-request set-var(txn.CrawlerStatus) str("Invalid Crawler") if crawler invalid_crawler
    
      # Sets a variable named 'CrawlerStatus' to 'Valid Crawler'
      http-request set-var(txn.CrawlerStatus) str("Valid Crawler") if crawler valid_crawler
    
      # Sets a custom log format based on the default HTTP log format. The 'CrawlerStatus' variable is logged at the end.
      log-format "%ci:%cp [%tr] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %{+Q}[var(txn.CrawlerStatus)]"

    The snippet below shows an example log line:

    192.168.50.1:63620 [01/Sep/2020:19:49:11.535] fe_main be_servers/s1 0/0/0/4/4 200 479 - - ---- 1/1/0/0/0 0/0 {Googlebot} "GET / HTTP/1.1" "Invalid Crawler"

Next up

Response Policies