This article describes how to block automated scanners from scanning your website. This should work with any modern web scanner parsing robots.txt (all popular web scanners do this).
Website owners use the robots.txt file to give instructions about their site to web robots, such as Google’s indexing bot. The /robots.txt file is a text file, with one or more records, and usually contains a list of directories/URLs that should not be crawled and indexed by search engines. Legitimate search engine robots follow the instructions in robots.txt. Any type of scanning software that does not follow these instructions can be considered malicious. This article proceeds to explain how the sources of these scan attempts can be logged and blocked.
In this example, search engines are instructed not to index the directory /cgi-bin/.
Web scanners parse and crawl all the URLs from robots.txt because administrators usually place administrative interfaces or other high-value resources (that could be very interesting from a security point of view) there.
To configure our automated scanner honeypot, we’ll start by adding an entry in robots.txt to a directory. This directory does not contain any other content used by the site.
Normal visitors would not know about this directory since it is not linked from the site. Also, search engines would not visit this directory since they are restricted from robots.txt. Only web scanners or curious people would know about this directory.
This file contains a hidden form that is only visible to web scanners as they ignore CSS stylesheets.
The HTML of this page is similar to:
In the sample code, we placed a div that includes a form where the div is set to be invisible. The form has a single input, and is pointing to a file named dontdoit.php.
When this page is visited by normal visitors with a browser, they will see the following:
A web scanner, on the other hand, will see something else because it ignores CSS style:
The web scanner will submit this form and start testing the form inputs with various payloads looking for vulnerabilities.
At this point if somebody visits the file dontdoit.php, it’s either a web scanner or a hacker. We have two options: either log the hacking attempt or automatically block the IP address making the request.
If we want to automatically block the IP address making the request we could use a tool such as Fail2ban. From the Fail2ban main page:
We could log all the requests to dontdoit.php into a log file and configure Fail2ban to parse this log file and automatically temporarily block from the firewall all IP addresses listed in this log file.
Sample Fail2ban configuration:
enabled = true
port = 80,443
filter = block-automated-scanners
# path to your logs
logpath = /var/log/apache2/automated-scanners.log
# max number of hits
maxretry = 1
# bantime in seconds - set negative to ban forever (18000 = 5 hours)
bantime = 18000
action = iptables-multiport[name=HTTP, port="80,443", protocol=tcp]
sendmail-whois[name=block-automated-scanners, firstname.lastname@example.org, email@example.com]
# Fail2Ban configuration file
# Author: Bogdan Calin
# Option: failregex
failregex = [hit from <HOST>]
# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
🔍 Scanning a site with a honeypot page
If you decide to scan a site with a honeypot page using a Web Application Security Scanner, you will need to configure the scanner not to access the honeypot page, as the scanner's IP address will be blocked too.