A crawler, robot (or “bot”), spider or wanderer is a computer program that searches the Internet and records information about web pages. For example, a “spam bot” will search the pages in web sites and record the e-mail addresses linked in home pages.
Although you can’t keep away all robots, you can give instructions to certain ones that follow the “Robot Exclusion Protocol”. This method requires that you create a plain text file called robots.txt and place it in your site’s root directory. For example, if your web site URL is http://www.mycompanydomain.com/ then the file robots.txt must be accessible from http://www.mycompanydomain.com/robots.txt in order to restrict your site. It is important to note that the file must be in your root directory (ie. ~/public_html/) and no other.
The contents of robots.txt consists of mainly two commands: “User-agent” and “Disallow”.
The “User-agent” command allows you to set restrictions on a robot with a particular name (or signature). You can set this to the asterisk (*) to specify that restrictions apply to all robots that aren’t identified elsewhere in the file.
The “Disallow” command allows you to deny access to certain directories in your web site.
Say your URL is http://www.mysite.com. If you put a robots.txt file into your ~/public_html/ directory containing the following:
Figure 1: sample robots.txt file
User-agent: *
Disallow: /neat_stuff/
Disallow: /my_pvt_stuff/
User-agent: WebCrawler
Disallow: /
In the example, all robots would have free access to the web site except for files contained in the /neat_stuff/ and /my_pvt_stuff/ directories, but the WebCrawler robot is denied all access to the site.
NOTE: Since not all robot authors acknowledge the “Robots Exclusion Protocol”, it is not possible to stop all robots using this method. However, most search engine robots do follow this protocol. Please refer to documentation on their sites for more information on this.
Using the META tag method, you can specify restrictions in your web pages individually. Unfortunately, this method is less recognized by robots than the robots.txt method. The META tag has two parameters in its content. They are INDEX or NOINDEX. and FOLLOW or NOFOLLOW. See the following examples:
Figure 2: sample of “ROBOTS” META tag
In this example, the robot is instructed to neither index the current page, nor to follow links in the page for indexing.
Figure 3: sample of “ROBOTS” META tag
In this example, the robot is instructed not to index the current page, but allows it to follow links in the page for indexing. The structure of the META tag should be clear by now.
NOTE: All META tags should be specified within the block of your HTML document.
- For best results, you should use the robots.txt method if possible. Simply FTP the file into your ~/public_html/ directory.
- The META tag method can be used if you can not use the robots.txt method.
- Both methods can be used together.
- If privacy of your web pages is essential, password protect your page using an .htaccess file (shell access is needed for this). Robots can’t enter a password protected page without the password.