This blog post describes how to block automated scanners from scanning your website. This should work with any modern web scanner parsing robots.txt (all popular web scanners do this).

Website owners use the robots.txt file to give instructions about their site to web robots, such as Google’s indexing bot. The /robots.txt file is a text file, with one or more records, and usually contains a list of directories/URLs that should not be crawled and indexed by search engines. Legitimate search engine robots follow the instructions in robots.txt. Any type of scanning software that does not follow these instructions can be considered malicious. this blog post proceeds to explain how the sources of these scan attempts can be logged and blocked.

User-agent: *
Disallow: /cgi-bin/

In this example, search engines are instructed not to index the directory /cgi-bin/.

Web scanners parse and crawl all the URLs from robots.txt because administrators usually place administrative interfaces or other high-value resources (that could be very interesting from a security point of view) there.

To configure our automated scanner honeypot, we’ll start by adding an entry in robots.txt to a directory. This directory does not contain any other content used by the site.

User-agent: *
Disallow: /dontVisitMe/

Normal visitors would not know about this directory since it is not linked from the site. Also, search engines would not visit this directory since they are restricted from robots.txt. Only web scanners or curious people would know about this directory.

This file contains a hidden form that is only visible to web scanners as they ignore CSS stylesheets.

The HTML of this page is similar to:block-automated-scanners-html

In the sample code, we placed a div that includes a form where the div is set to be invisible. The form has a single input, and is pointing to a file named dontdoit.php.

When this page is visited by normal visitors with a browser, they will see the following:

block-automated-scanners-human

A web scanner, on the other hand, will see something else because it ignores CSS style:

block-automated-scanners-scanner

The web scanner will submit this form and start testing the form inputs with various payloads looking for vulnerabilities.

At this point if somebody visits the file dontdoit.php, it’s either a web scanner or a hacker. We have two options: either log the hacking attempt or automatically block the IP address making the request.

If we want to automatically block the IP address making the request we could use a tool such as Fail2ban.

Fail2ban scans log files (e.g. /var/log/apache/error_log) and bans IPs that show malicious signs, such as too many password failures, and exploit seeking.

We could log all the requests to dontdoit.php into a log file and configure Fail2ban to parse this log file and automatically temporary block from the firewall IP addresses listed in this log file .

Sample Fail2ban configuration:

/etc/fail2ban/jail.conf


[block-automated-scanners]
enabled = true
port = 80,443
filter = block-automated-scanners
# path to your logs
logpath = /var/log/apache2/automated-scanners.log
# max number of hits
maxretry = 1
# bantime in seconds - set negative to ban forever (18000 = 5 hours)
bantime = 18000
# action
action = iptables-multiport[name=HTTP, port="80,443", protocol=tcp]
sendmail-whois[name=block-automated-scanners, dest=admin@site.com, sender=fail2ban@site.com]

/etc/fail2ban/filter.d/block-automated-scanners.conf


# Fail2Ban configuration file
#
# Author: Bogdan Calin
#
[Definition]

# Option: failregex
failregex = [hit from <HOST>]

# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
#
ignoreregex =
SHARE THIS POST
THE AUTHOR
Bogdan Calin

Acunetix developers and tech agents regularly contribute to the blog. All the Acunetix developers come with years of experience in the web security sphere.