This blog post describes how to block automated scanners from scanning your website. This should work with any modern web scanner parsing robots.txt (all popular web scanners do this).

Website owners use the robots.txt file to give instructions about their site to web robots, such as Google’s indexing bot. The /robots.txt file is a text file, with one or more records, and usually contains a list of directories/URLs that should not be crawled and indexed by search engines. Legitimate search engine robots follow the instructions in robots.txt. Any type of scanning software that does not follow these instructions can be considered malicious. this blog post proceeds to explain how the sources of these scan attempts can be logged and blocked.

User-agent: *
Disallow: /cgi-bin/

In this example, search engines are instructed not to index the directory /cgi-bin/.

Web scanners parse and crawl all the URLs from robots.txt because administrators usually place administrative interfaces or other high-value resources (that could be very interesting from a security point of view) there.

To configure our automated scanner honeypot, we’ll start by adding an entry in robots.txt to a directory. This directory does not contain any other content used by the site.

User-agent: *
Disallow: /dontVisitMe/

Normal visitors would not know about this directory since it is not linked from the site. Also, search engines would not visit this directory since they are restricted from robots.txt. Only web scanners or curious people would know about this directory.

This file contains a hidden form that is only visible to web scanners as they ignore CSS stylesheets.

The HTML of this page is similar to:block-automated-scanners-html

In the sample code, we placed a div that includes a form where the div is set to be invisible. The form has a single input, and is pointing to a file named dontdoit.php.

When this page is visited by normal visitors with a browser, they will see the following:


A web scanner, on the other hand, will see something else because it ignores CSS style:


The web scanner will submit this form and start testing the form inputs with various payloads looking for vulnerabilities.

At this point if somebody visits the file dontdoit.php, it’s either a web scanner or a hacker. We have two options: either log the hacking attempt or automatically block the IP address making the request.

If we want to automatically block the IP address making the request we could use a tool such as Fail2ban.

Fail2ban scans log files (e.g. /var/log/apache/error_log) and bans IPs that show malicious signs, such as too many password failures, and exploit seeking.

We could log all the requests to dontdoit.php into a log file and configure Fail2ban to parse this log file and automatically temporary block from the firewall IP addresses listed in this log file .

Sample Fail2ban configuration:


enabled = true
port = 80,443
filter = block-automated-scanners
# path to your logs
logpath = /var/log/apache2/automated-scanners.log
# max number of hits
maxretry = 1
# bantime in seconds - set negative to ban forever (18000 = 5 hours)
bantime = 18000
# action
action = iptables-multiport[name=HTTP, port="80,443", protocol=tcp]


# Fail2Ban configuration file
# Author: Bogdan Calin

# Option: failregex
failregex = [hit from <HOST>]

# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
ignoreregex =
Bogdan Calin

Acunetix developers and tech agents regularly contribute to the blog. All the Acunetix developers come with years of experience in the web security sphere.

  • Hi, this is a great idea, but what if scrapers are using dynamic IPs which may be given to a normal user by time? That users won’t be able to view your page. OK, IP v4 has many IPs and IP v6 has many more IPs, so the chance is very tiny, but it has a probability is not null.

    I’am using something like yours a bit different. There are some hidden fields in my forms, scrapers would fill themout because they are marked as neccessary. If forms had this field filled, i log the IPs for banning. If the scraper tries to often do this, the IP will be banned. ^^ but i also have no idea how to deal with mentioned above.

    • Hi Viktor,

      The article addresses this by banning the IP address only for a 5 hours. This is specified in the ‘bantime’. This will minimize the risk of having legit users blocked too.

      Your idea is a very good alternative to what we have described. Thanks for sharing.

    • Yes, dynamic IPs are always a problem. It helps to block the IP for a short period of time (like 4-5 hours). This way, you still block the attack but cause less distractions for people with dynamic IPs.

      • True dynamic IPs become a concern, but since it’s a temporary blocking, it’ll just frustrate and slow down the hacker. This is a great post i must say. Will give it a try.

  • Hi, i wonder , blocking all traffic from the network of attacker is possible? i mean, attacker ip format is A.B.C.D (ipv4), and his/her network A.B.C/24.

    can we block all traffic from A.B.C/24 with fail2Ban?

    • I don’t think this is possible. You have no certain way to determine the subnet of which that IP address is of. It could /24, it could be /16, it could be /32. And even you would know, it wouldn’t make much sense to block a whole /16 network.

  • Love this tactic and am going to employ it for logging reasons and for curiousity on my sites. I wonder if you’d be able to automatically write to and append your htaccess file usinf fwrite to include a Deny IP Address…etc?

  • This is a great idea. I’m sure one could even make it work with hosted WAF solutions such as Incapsula, CloudFlare, Neustar Site Protect, etc… with a little bit of script tuning to call those web application firewall services public APIs.

  • Comments are closed.