Site Crawler

Introduction

The Site Crawler analyses a target website and builds the site structure using the information collected, including the site’s directories and files / objects.

Description: screenshot

Screenshot – The crawler tool interface

The interface of the Site Crawler consists of:

  • Site structure window (left hand side) – Displays target site information fetched by the crawler, e.g., cookies, robots, files and directories.
  • Details window (right hand side) – Displays general information about a file selected in the site structure window (e.g., filename, file path etc.).
    A series of tabs at the bottom of the Details window display further information about the selected object.

Starting a Website Crawl

  1. Select ‘Tools > Site Crawler’
  2. Enter the URL of the target website (e.g. http://testphp.vulnweb.com/).  
  3. If you want to use a recorded login sequence during the crawl, select it from the ‘Login Sequence’ drop down menu.  
  4. Click on the start button to start the crawling process.  
  5. If the website or any parts of it require HTTP authentication to be accessed, a pop-up window will automatically appear for you to enter the correct credentials, unless they were already configured in the HTTP Authentication settings node.

The site structure will be displayed on the left hand side. For each directory found, a node will be created together with sub nodes for each file.  The site Crawler will also create a Cookies node, which displays information about the cookies used.

It is also possible to load the results of a previously saved crawl or save the results of a completed crawl.

Crawling Options

Crawler configuration settings can be modified by navigating to ‘Configuration > Scan Settings > Crawling Options’. The following Site Crawler options are available:

Screenshot - Crawling Options

  • Start HTTP Sniffer for manual crawling at the end of the scan process - This starts the HTTP Sniffer at the end of the crawl to allow the user to browse parts of the site that were not discovered by the crawler. Typically the Acunetix Web Vulnerability Scanner crawler is able to crawl the entire website though there are some scenarios where it fails to do so automatically. The crawler will update the website structure with the newly discovered links and pages.
  • Get first URL only - Scans the index or first page of the target site only and does not crawl any links.
  • Do not fetch anything above start folder - By enabling this option the crawler will not traverse any links that point to a location above the base link. E.g. if http://testphp.vulnweb.com/wvs/ is the base URL, the crawler will not crawl to links which point to a location above the base URL like http://testphp.vulnweb.com.
  • Fetch files below base folder - By enabling this option the crawler will follow links that point to locations outside the base folder. E.g. if http://testphp.vulnweb.com/ is the base URL, it will still traverse the links which point to an object which resides in a sub directory below the base folder, like http://testphp.acunetix.com/wvs/.  With this option disabled, the crawler will not crawl any objects from the root’s sub directories.
  • Fetch directory indexes even if not linked – When enabled the crawler will try to request the directory index for every discovered directory even if the directory index is not directly linked from another source.
  • Retrieve and process robots.txt, sitemap.xml - By enabling this option the crawler will search for a robots.txt or sitemap.xml file in the target website, and follow all the links specified if robots or sitemap are detected.
  • Ignore CASE differences in paths - By enabling this option the crawler will ignore any case difference in the links found on the website. E.g. “/Admin” will be considered the same as “/admin”.

Screenshot - Crawling Options

  • Enable CSA (analyze and execute JavaScript/AJAX) – The Client Script Analyzer (CSA) is enabled by default during crawling. This will execute JavaScript/AJAX code on the website to gather a more complete site structure.
  • Fetch external scripts – With this option enabled, the CSA engine will fetch all external resources linked through client scripts running on the target. The external resources will only be crawled and will not be scanned. If this option is not enabled and a client script uses external resources, the CSA engine will not be able to analyze the client script correctly, which might result in an incomplete crawl.
  • Fetch default index files (index.php, Default.asp …) - If this option is enabled, the crawler will try to fetch common default index filenames (such as index.php, Default.asp) for every folder, even if not directly linked.
  • Try to prevent infinite directory recursion – Certain websites are designed in a way which may cause the scanner to enter a loop when trying to fetch the same directory recursively (e.g. /images/images/images/images/…). This setting tries to prevent this situation by identifying repeated directory names in recursion.
  • Warn user if URL rewrite is detected – Enable this option to be notified if URL rewrite is detected during the crawling stage of a scan.
  • Ignore parameters on file extensions like .js, .css etc – When enabled, Acunetix Web Vulnerability Scanner will not scan parameters on files which are not typically accessed directly by a user, such as js, css etc.
  • Disable auto custom 404 detection –With this option enabled, Acunetix Web Vulnerability Scanner will not automatically detect 404 error pages, thereby requiring 404 recognition patterns to be configured manually. You can read more about Custom 404 Error Page rules from page  of this manual.
  • Consider www.domain.com and domain.com as the same host – If this option is enabled, Acunetix Web Vulnerability Scanner will scan both sites www.domain.com and domain.com and treat them as one instead of separate hosts.
  • Enable Input limitation heuristics – If this option is enabled and more than 20 identical input schemes are detected on files in the same directory, the crawler will only crawl the first 20 identical input schemes.

Screenshot – Crawling Options

  • Maximum number of variations – In this option you can specify the maximum number of variations for a file. E.g. index.asp has a GET parameter ID of which the crawler discovered 10 possible values from links requesting the page. Each of these links is considered a variation and each variation will appear under the file in the Scan Tree during crawling.
  • Link Depth Limitation – This option allows you to configure the maximum number of links to crawl from the root URL.  
  • Structure Depth Limitation – This option allows you to configure the maximum number of directories to crawl from the root URL.
  • Maximum number of sub-directories – This option allows you to configure the maximum number of sub directories Acunetix Web Vulnerability Scanner should crawl in a website.
  • Maximum number of files in a directory – In this option you can configure the maximum number of files in a directory.
  • Maximum number of path schemes – In this option you can specify the maximum number of path schemes that should be detected by the crawler. You should only tweak this setting if you are crawling a very large website and notice that some path schemes are not being crawled.
  • Crawler file limit – This option allows you to configure the maximum number of files the crawler should crawl during a website crawl.

Acunetix DeepScan

Most websites make use of or are totally implemented in JavaScript. These include websites that make use of AJAX or Single Page Applications (SPAs). Acunetix Web Vulnerability Scanner uses DeepScan technology which allows the Crawler to crawl any type of websites effectively, including the ones which make heavy use of JavaScript.

File Extension Filters

 

Screenshot - Crawling Options - File Extension Filters

It is possible to configure a list of file extensions to be included or excluded during a crawl. This is done by configuring the extensions in one of the following:

  • Include List - Process all files fitting the wildcard specified.
  • Exclude List - Ignore all files fitting the wildcard specified.

Note: Binary files such as images, movies and archives are excluded by default to avoid unnecessary traffic.


Directory and File Filters

This node enables you to specify a list of directories or filenames to be excluded from a crawl.  Filters can be configured according to directory or file names, as well as through the use of wildcards to match multiple directories or files with the same filter. Regular expressions can also be used to match a number of directories or files. If a regular expression is specified as a filter, toggle the value to Yes under the ‘Regex’ column by clicking on it.

Description: exclude

Screenshot – Directory and File Filter rules

To add a directory or file rule:

  1. Click the Add URL button and specify the address of the website where the directory or file is located.
  2. Click the Add Filter button and specify the directory or filename, a wild card, or a regular expression.  When specifying a directory, do not add a slash ‘/’ in front of the directory name. A trailing slash is automatically added to the end of the website URL.

Note: Directory and file filters specified for the root or any other directory of a website are not inherited by their sub directories, therefore a filters must be specified separately for sub-directories, as shown in the screenshot above.

URL Rewrite rules

Description: screenshot

Screenshot - URL Rewrite Configuration

Many web applications – such as shopping carts and off the shelf applications such as WordPress and Joomla – use URL rewrite rules. Acunetix needs to understand these rewrite rules in order to navigate and understand the website structure and actual files better, and to avoid crawling of inexistent objects.  

Screenshot - URL Rewrite Rule

Adding a URL rewrite rule manually:

  1. Navigate to the ‘Configuration > Scan Settings > Crawling Options > URL rewrite’ node.
  2. Click the Add Ruleset button to open up the URL rewrite editor window and enter the hostname of the target website for which the rule will be used. Click on the  button to open up the Add rule dialogue.
  3. Specify if the rule-set is generic for the whole website by ticking General rule. If for a specific directory only, tick Directory rule and specify the directory name.
  4. In the Regular Expression input field, specify a part of the URL including regular expressions (or a group of Regular expressions) which Acunetix Web Vulnerability Scanner should use to recognize a rewritten URL. E.g. “Details/.*/(\d+)” indicates that everything must be matched after the Details/ directory, as well as subsequent strings beginning with digits.
  5. In the Replace with input field, specify the URL Acunetix Web Vulnerability Scanner should request instead of the rewritten URL. E.g. /Mod_Rewrite_Shop/details.php?id=$1.  The $1 will be replaced with the value retrieved from the first regular expression group specified in the Regular Expression input field, in this case (\d+). For example, if Acunetix finds this URL; /Mod_Rewrite_Shop/Details/network-storage-d-link-dns-313-enclosure-1-x-sata/1, it will request the following; /Mod_Rewrite_Shop/details.php?id=1.
  6. Tick the Last rule option to indicate that no more rules should be executed after this one.
  7. Tick Case insensitive if the URLs are not case sensitive.
  8. Tick Match on the full URI option so that the regular expression is executed on the whole URI with the query, instead of the path only.
  9. Tick IIS URL rewrite rule if the target website is using Microsoft Windows IIS URL rewrite rules (http://www.iis.net/download/urlrewrite).
  10. To test the URL rewrite rule, enter a URL and click Test Rule.

Importing a URL Rewrite rule configuration from an Apache web server

To import the rewrite rule logic for Apache web servers:

  1. To open the Import Rewrite rules wizard, click Add Ruleset and then click Import rule Description: import_url_rewrite.  In the filename field, enter the path of the Apache httpd.conf or .htaccess file (the file which contains the URL rewrite rules).
  2. Select the type of configuration to import (httpd.conf or .htaccess). If .htaccess is used, it is important to specify the hostname of the website (e.g. www.acunetix.com) and webserver directory (e.g. sales) on which the URL rewrite configuration is set.

Importing a URL Rewrite rule configuration from an IIS web server

If using Microsoft IIS as your web server, you can automatically import the rewrite rule logic:

  1. To open the Import Rewrite rules wizard, click Add Ruleset and then click Import rule Description: import_url_rewrite. In the Filename field, enter the path of the web application web.config file that contains the URL rewrite rules.
  2. Select the ‘IIS URL Rewrite’ (web.config) node and specify the hostname of the website (e.g. www.acunetix.com) and web server directory (e.g. sales) on which the URL rewrite configuration is set.

Note: Every Scan Settings template can have different crawler settings. Refer to page  of this user manual to read more on how to modify or create new Scan Settings templates.

Custom Cookies

You can create a custom cookie, which can be used during a website crawl to emulate a user or to automatically login to a section of the website (without requiring the Login Sequence Recorder).

Screenshot - Custom Cookies

To add a custom cookie:

  1. Navigate to Configuration > Scan Settings > Custom cookies node
  1. Click on the  Add Cookie button to add a new blank cookie to the list.
  2. Enter the URL of the site for which the cookie will be used in the left hand URL column.
  3. Enter the custom string that will be sent with the cookie.  E.g. if cookie name is ‘Cookie_Name’ and content is ‘XYZ’ enter ‘Cookie_Name=XYZ’.
  4. Click Apply to save the changes.

Tick the option “Lock custom cookies during scanning and crawling” so to never overwrite the custom cookies with new ones sent from the website during a crawl or scan.

Configuring Input Fields to Traverse Web Form Pages

Many websites include web forms that capture visitor data, like download forms. Acunetix Web Vulnerability Scanner can be configured to automatically submit random data or specific values to web forms during the crawl and scan stages of a security audit.

Note: By default Acunetix Web Vulnerability Scanner uses a generic submit rule that will submit generic and random values to any kind of web form encountered during a crawl or scan.

Screenshot - Input Fields

To specify a list of predefined values that must be automatically entered on a web form or web service:

  1. Navigate to the Configuration > Scan Settings > Input Fields node.
  2. Enter the URL of the webpage or web service containing the specific form or list of operations to which pre-defined values must be passed, and click Parse from URL button.
  3. The resulting list will then be automatically completed with the form fields found in the given URL.
  4. Enter the values for the required fields by double clicking the respective value column. Click Apply to save changes.
  1. Input fields also support wildcards to match a broad range of data.  Below you can find a number of examples:
  • *cus* is used to match any number of characters before and after the pattern ‘cus’
  • *cus is used to match any number of characters before the pattern ‘cus’
  • cus* is used to match any number of characters after the pattern ‘cus’
  • ?cus is used to match a single character before the pattern ‘cus’
  • c?us is used to match a single character as a second character in the pattern specified
  1. Alternatively, you can configure Acunetix Web Vulnerability Scanner to automatically randomize the values for each input field by entering the bolded variable names below in the parameter’s value field:
  • ${alpharand} – Automatically submit random alphabetical characters (a –z )
  • $[numrand} – Automatically submit random numeric characters (0 - 9)
  • ${alphanumrand} – Automatically submit random alphabetical and numeric characters (a – z, 0 – 9)

You can also change the priority of a specific input field by highlighting it, and then using the Up and Down arrows to give it higher or lower priority respectively.

Note: If a unique set of data must be submitted to different forms, then a new rule-set must be created for each form respectively.

« Back to the Acunetix Support Page