David Jones

The anatomy of a robots.txt file

The purpose of the robots.txt file is to give instructions to web robots. This is beneficial because it allows the developer to block web robots completely or restrict what parts of their site the robot can access.

This is all well and good if the robot sees and abides by your robots.txt file but any robot could completely ignore your robots.txt file. This is usually the case of robots that are designed to scrape content from your site or scan for security vulnerabilities.

The directory where you save your robots.txt file is important as the robot will typically only look in one location. This is your websites root directory. For example if you have a web page with the URL http://www.example.co.uk/hello/world the path component (hello/world) would be ignored and the location of the robots.txt file will be http://www.example.co.uk/robots.txt.

I mentioned earlier that a robot could be used to scan sites for security vulnerabilities but this doesn't mean the robots.txt file should be used as a security measure. Anybody can view this file. As a demonstration why not view the robots.txt file for this website.

Lets look at the contents of a robots.txt file.

User-agent: *
Disallow: /

This configuration would block every robot that reads this file from accessing the entire website. This would be useful if you were running an internal intranet system that you did not want to be crawled by anything. If you wanted to allow every robot to access any area of your site you could just omit a robots.txt file altogether, but I would not recommend that.

User-agent: Google
Disallow:

User-agent: *
Disallow: /

In the example above you can see that we have defined User-agent twice. User agents can be defined in their own block. We are saying that a robot with a user agent of Google can have access to everything and every other robot is disallowed from everything.

You can have multiple user agents defined in one block. We can modify the previous example to look like this.

User-agent: googlebot
User-agent: YahooSeeker
Disallow:

User-agent: *
Disallow: /

Here is a list of known user agents as well as more information on each.

Last thing to mention is sub domains. A robots.txt file will only work on the current domain. In our previous example we learned that the robots.txt file for http://www.example.co.uk would be located http://www.example.co.uk/robots.txt. This means that if we had a sub domain of http://domain.example.co.uk our robots.txt will not be picked up. This is the same for different sub domains, port numbers and protocols.

For more information you can read the specification.