How to Block Automated Scanners from Scanning your Site

This blog post describes how to block automated scanners from scanning your website. This should work with any modern web scanner parsing robots.txt (all popular web scanners do this).

Website owners use the robots.txt file to give instructions about their site to web robots, such as Google’s indexing bot. The /robots.txt file is a text file, with one or more records, and usually contains a list of directories/URLs that should not be crawled and indexed by search engines. Legitimate search engine robots follow the instructions in robots.txt. Any type of scanning software that does not follow these instructions can be considered malicious. this blog post proceeds to explain how the sources of these scan attempts can be logged and blocked.

User-agent: *
Disallow: /cgi-bin/

In this example, search engines are instructed not to index the directory /cgi-bin/.

Web scanners parse and crawl all the URLs from robots.txt because administrators usually place administrative interfaces or other high-value resources (that could be very interesting from a security point of view) there.

To configure our automated scanner honeypot, we’ll start by adding an entry in robots.txt to a directory. This directory does not contain any other content used by the site.

User-agent: *
Disallow: /dontVisitMe/

Normal visitors would not know about this directory since it is not linked from the site. Also, search engines would not visit this directory since they are restricted from robots.txt. Only web scanners or curious people would know about this directory.

This file contains a hidden form that is only visible to web scanners as they ignore CSS stylesheets.

The HTML of this page is similar to:block-automated-scanners-html

In the sample code, we placed a div that includes a form where the div is set to be invisible. The form has a single input, and is pointing to a file named dontdoit.php.

When this page is visited by normal visitors with a browser, they will see the following:

block-automated-scanners-human

A web scanner, on the other hand, will see something else because it ignores CSS style:

block-automated-scanners-scanner

The web scanner will submit this form and start testing the form inputs with various payloads looking for vulnerabilities.

At this point if somebody visits the file dontdoit.php, it’s either a web scanner or a hacker. We have two options: either log the hacking attempt or automatically block the IP address making the request.

If we want to automatically block the IP address making the request we could use a tool such as Fail2ban.

Fail2ban scans log files (e.g. /var/log/apache/error_log) and bans IPs that show malicious signs, such as too many password failures, and exploit seeking.

We could log all the requests to dontdoit.php into a log file and configure Fail2ban to parse this log file and automatically temporary block from the firewall IP addresses listed in this log file .

Sample Fail2ban configuration:

/etc/fail2ban/jail.conf


[block-automated-scanners]
enabled = true
port = 80,443
filter = block-automated-scanners
# path to your logs
logpath = /var/log/apache2/automated-scanners.log
# max number of hits
maxretry = 1
# bantime in seconds - set negative to ban forever (18000 = 5 hours)
bantime = 18000
# action
action = iptables-multiport[name=HTTP, port="80,443", protocol=tcp]
sendmail-whois[name=block-automated-scanners, dest=admin@site.com, sender=fail2ban@site.com]

/etc/fail2ban/filter.d/block-automated-scanners.conf


# Fail2Ban configuration file
#
# Author: Bogdan Calin
#
[Definition]

# Option: failregex
failregex = [hit from <HOST>]

# Option: ignoreregex
# Notes.: regex to ignore. If this regex matches, the line is ignored.
# Values: TEXT
#
ignoreregex =
  • Hi, this is a great idea, but what if scrapers are using dynamic IPs which may be given to a normal user by time? That users won’t be able to view your page. OK, IP v4 has many IPs and IP v6 has many more IPs, so the chance is very tiny, but it has a probability is not null.

    I’am using something like yours a bit different. There are some hidden fields in my forms, scrapers would fill themout because they are marked as neccessary. If forms had this field filled, i log the IPs for banning. If the scraper tries to often do this, the IP will be banned. ^^ but i also have no idea how to deal with mentioned above.

    • Hi Viktor,

      The article addresses this by banning the IP address only for a 5 hours. This is specified in the ‘bantime’. This will minimize the risk of having legit users blocked too.

      Your idea is a very good alternative to what we have described. Thanks for sharing.

    • Yes, dynamic IPs are always a problem. It helps to block the IP for a short period of time (like 4-5 hours). This way, you still block the attack but cause less distractions for people with dynamic IPs.

  • Leave a Reply

    Your email address will not be published.


    *