Verify Crawler
HAProxy Enterprise offers a module that provides verification of search engine crawlers. If a client presents a User-Agent HTTP header that says that it is a search engine crawler, this program will verify whether that is true.
Here is how it works:
When HAProxy Enterprise receives a request from a client that has a User-Agent string matching a known search engine crawler (e.g. Googlebot), it puts that client's information into a stick table called "unchecked crawler".
Verify-Crawler runs in the background as a daemon. It polls the Runtime API "show table" command to get the list of unverified crawlers that HAProxy Enterprise has flagged.
For each unverified crawler, the daemon does a reverse DNS lookup on the client's source IP address. If the IP address resolves to the expected search engine's domain name (e.g. googlebot.com), then it has passed the first test.
The daemon then performs a forward DNS lookup on the search engine's domain name (e.g. googlebot.com) and gets back a list of IP addresses. If the client's source IP address is in this list, it passes the second test.
The daemon puts the client's source IP address into either the "valid_crawler" or "invalid_crawler" stick table so that HAProxy Enterprise can know the client's status the next time it receives a request from them.
It is up to you whether to log, deny, or take another action on invalid crawlers.
Installing the Verify Crawler daemon
-
Install the Verify Crawler package:
$ # On Debian/Ubuntu $ sudo apt-get install hapee-extras-verify-crawler
$ # On CentOS/RedHat/Oracle/Photon OS $ sudo yum install hapee-extras-verify-crawler
$ # On SUSE $ sudo zypper install hapee-extras-verify-crawler
$ # On FreeBSD $ sudo pkg install hapee-extras-verify-crawler
-
Start the daemon.
$ sudo /opt/hapee-extras/sbin/hapee-verify-crawler \ -s /var/run/hapee-2.5/hapee-lb.sock \ -D
The
-s
flag points to the HAProxy Enterprise socket.The
-D
flag runs the program in daemon / background mode.Alternatively, add a
program
section that starts the daemon to your configuration file and then restart the load balancer. The daemon will be managed by the HAProxy Enterprise Process Manager and will run as a child process under the main load balancer process:global master-worker program verifycrawler command
/opt/hapee-extras/sbin/hapee-verify-crawler-s/var/run/hapee-2.5/hapee-lb.sock-D no option start-on-reload -
Add the following
backend
sections to your configuration. These set up the required stick tables.backend unchecked_crawler.local stick-table type string len 60 size 1m expire 24h store gpc0 backend valid_crawler.local stick-table type ip size 1m expire 24h store gpc0 backend invalid_crawler.local stick-table type ip size 1m expire 24h store gpc0
-
In the
frontend
orlisten
section where you would like to enable crawler verification, add the following lines:frontend example acl crawler hdr_sub(user-agent) -M -f
/etc/hapee-extras/crawler.mapacl unchecked_crawler sc_get_gpc0(5) -m int eq 0 acl invalid_crawler src,table_gpc0(invalid_crawler.local) -m int gt 0 acl valid_crawler src,table_gpc0(valid_crawler.local) -m int gt 0 http-request set-header X-crawler-ipdomain %[src]|%[hdr(user-agent),map_sub(/etc/hapee-extras/crawler.map)] http-request track-sc5 req.hdr(X-crawler-ipdomain) table unchecked_crawler.local if !valid_crawler !invalid_crawler crawlerThese flag search engine crawlers for verification. They do not take any action on invalid crawlers. The daemon will perform verification in the background and, ultimately, either
valid_crawler
orinvalid_crawler
will be set to true. This process will not block the current request, but will make these variables available for the next time the client makes a request. -
Add more search engine crawlers to /etc/hapee-extras/crawler.map for HAProxy Enterprise to recognize them as crawlers:
Googlebot googlebot.com Yandexbot yandex.com YandexImages yandex.com bingbot msn.com Baiduspider baidu.com coccocbot coccoc.com SeznamBot seznam.cz
Each line in this file is the User-Agent string followed by the DNS domain name to which the crawler's IP address should resolve.
-
OPTIONAL: Capture the verification status of crawlers in the access logs by adding the following lines to your
frontend
orlisten
section:frontend example # Logs the User-Agent header http-request capture req.hdr(user-agent) len 10 # Sets a variable named 'CrawlerStatus' to 'Unchecked Crawler' http-request set-var(txn.CrawlerStatus) str("Unchecked Crawler") if crawler !valid_crawler !invalid_crawler # Sets a variable named 'CrawlerStatus' to 'Invalid Crawler' http-request set-var(txn.CrawlerStatus) str("Invalid Crawler") if crawler invalid_crawler # Sets a variable named 'CrawlerStatus' to 'Valid Crawler' http-request set-var(txn.CrawlerStatus) str("Valid Crawler") if crawler valid_crawler # Sets a custom log format based on the default HTTP log format. The 'CrawlerStatus' variable is logged at the end. log-format "%ci:%cp [%tr] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC %CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r %{+Q}[var(txn.CrawlerStatus)]"
The snippet below shows an example log line:
192.168.50.1:63620 [01/Sep/2020:19:49:11.535] fe_main be_servers/s1 0/0/0/4/4 200 479 - - ---- 1/1/0/0/0 0/0 {Googlebot} "GET / HTTP/1.1" "Invalid Crawler"
Next up
Handling a DoS attack