View Full Version : The robots.txt file

09-08-2010, 01:58 PM
In earlier posts we have looked at length about how to give your website the best possible exposure on Google as well as on the other search engines and we have looked at the best ways to SEO (Search Engine Optimise) your site. It has been a great deal of work on your part to make sure that your website is accessible to Google and its Googlebot, that there are plenty of keywords, plenty of quality links and a sitemap for it to follow. Today however we are not making your website more accessible to the Googlebot and the other search engine spiders. Quite the opposite...

Today we will be discussing the unthinkable; how to keep search engine spiders off your website or restrict them so they can only look at (or, index) parts of your website. It may feel strange to you to have done so much SEO work only to hide it or parts of it. In this article we will be looking at the anti-sitemap: the robots.txt file (or “Robot Exclusion Standard / Robots Exclusion Protocol (http://en.wikipedia.org/wiki/Robot.txt)” if you are a fan of particularly long phrases...).

The robots.txt file is the opposite to your sitemap and exists to stop cooperating web spiders visiting all or part of your website (because it exists to tell them where they cannot go). It was started in the summer 1994 by agreement of the members of the robots mailing list because, quite simply, it seemed like a good idea. It was made more popular by Alta Vista, then the other big search engines caught on in the following years and started using the robots.txt standard too.

While it may seem that we are actually hurting ourselves by not letting web crawlers/ spiders/ robots look at our website in its entirety, this is actually not the case. There may be pages on your website that, while essential, do not actually help the SEO of your website. It might be a sales page that does not contain any of your keywords (maybe only: “Click Here To Confirm” or “Enter Your Credit Card Details”) and letting a robot look at those pages means a worse ranking on Google (more content; fewer keywords).

The information that you should be restricting using the robots.txt file is information that does not help in any way towards the SEO of your website, but we’ll discuss that again later.

So, let’s create a robots.txt file for your website...
It’s a simple plain text file (.txt), so we can create one using the most basic tools on your home computer. You should note that each domain should have it’s own robots.txt file and that includes sub-domains. Separate robots.txt file should be created for “yourwebsite.com” , “about.yourwebsite.com” as well as “waffles.yourwebsite.com”.

1)Open up a text editor...
For example: Notepad in Windows; TextEdit in Mac OSX

2) Start writing your robots.txt file...
Writing your robots.txt file is very straight forward. The first thing you do is specify which web crawler/ spider/ robot the text applies to. This is done using the “User-agent” statement. A “*” is a wildcard and it means EVERYBODY (all cooperating web crawlers/ spiders/ robots). You then make a “Disallow” statement telling the web crawler/ spider/ robot where it is not allowed to go.

As a result, the most simple form of the robots.txt file is as follows:

User-agent: *
Disallow: /

The above robots.txt file entry tells ALL cooperating web crawlers, spiders and robots to avoid ALL of your website. Obviously this is something you are never going to do... You can also do the exact opposite. The below robots.txt entry allows ALL cooperating web crawlers/ spiders/ robots to visit ALL of your website.

User-agent: *

Using the robots.txt you can keep cooperating away from specific files too as in the below example

User-agent: *
Disallow: /directory/file.html

Using the robots.txt files you can tell cooperating web crawlers/ spiders/ robots to stay away from one or several directories...

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

3) In this way, you can write more specific robots.txt documents...

In the below example, I want to keep the Googlebot out of my /images/ directory but I also want to keep Yahoo!’s bot out of the /videos/ directory. In addition I want to keep ALL cooperating bots out of my /cgi/ and /tmp/ directories. As a final stipulation, I also want VodaBot (okay, I made this one up) to stay away from an image file called pointless.jpg which is in my /images/ directory.

User-agent: Googlebot
Disallow: /images/

User-agent: yahoo
Disallow: /videos/

User-agent: *
Disallow: /cgi/
Dissallow: /tmp/

User-agent: VodaBot
Dissallow: /images/pointless.jpg

Finally, you will note that while the fictitious VodaBot cannot access the file pointless.jpg it can access the rest of my /images/ directory ... but what if I wanted it the other way round? What if I wanted the excellently named VodaBot to NOT be able to access anything in the /images/ directory EXCEPT an image file called “meaning-of-life.jpg”? Then I would use an Allow statement in my robots.txt file.

User-agent: VodaBot
Dissallow: /images/
Allow: /images/meaning-of-life.jpg

Note that Allow MUST come after a Dissallow statement

You should also be careful when using “/” as depending how you use it, it can mean different things. The following denotes a directory: “/images/” while “/images” (without “/” at the end) means any file in the root directory that begins with “images”. Writing: “Disallow: /images” does not limit access to the /images/ directory in any way, shape or form.

Have a look at wikipedia’s robots.txt file ( (http://en.wikipedia.org/robots.txt)http://en.wikipedia.org/robots.txt ) as an example. It uses comments (the # symbol) to explain how their robots.txt file works. This is a great resource if you’re writing your first robots.txt file.

4) Save and upload...
Save your document in plain text format, as robots.txt, making sure that the extension of the text document is .txt. The file you have can be uploaded straight to the root (home) directory of the website it applies to.

The robots.txt file is a double-edged sword however. You will notice that I make reference to the “cooperating” spiders. Many people have the assumption that the robots.txt file can be used to hide parts of their website from the search engines. I cannot stress how wrong this is.

There is no official standards body for the robots.txt protocol and there are very, very many search engines out there on the Internet and each has its own crawler/ spider or robot... These must be programmed to follow the instructions laid out in your robots.txt document. Imagine if a crawler or spider was programmed to visit ONLY the links that the robots.txt told it not to visit! There is nothing to stop it doing this, and in fact there are more and more instances of such bots being programmed to find such 'hidden' pages as hackers try to find a backdoor way to compromise websites....

Any parts of your website that you do not want to be visible to anybody should:
(a) Not be uploaded to your website at all
(b) Be password protected

Of these two options, (a) is by far the most effective.

In general the robots.txt file is not there for security in any way. It is there to improve the Search Engine Optimization of your site to make sure all the hard work that you have done SEOing your website is used in the best and optimum way. It is there to stop Googlebot finding things that would hurt the SEO of your website or are pointless as far as the theme or content of your website goes.

Suggested further reading:

How to make the googlebot love ya! (http://www.vodahost.com/vodatalk/google/58699-googlebot-love-ya.html)

Google Webmaster Tools 101 (http://www.vodahost.com/vodatalk/google/57792-google-webmaster-tools-101-a.html)

09-08-2010, 02:52 PM
Excellent ... should be updated somewhat though for maximum benefit, with the addition of auto-discovery coding below for the sitemap.xml:

Complete robots.txt example for XML sitemaps autodiscovery (with no 'disallow' parameters) by adding the "sitemap" line as shown below:

User-agent: *
Allow: (etc. for as many as allowing)
Sitemap: http://www.yoursitename.com/sitemap.xml (http://www.yoursitename.com/sitemap.xml)

If you have created a sitemap index file (where you specifically echo your donotfollow parameters by deleting the page/item entries manually that were auto-generated by the sitemap generator), you can also reference that by inserting this line of code instead of the above:

User-agent: *
Disallow: (enter specific files/pages not to be read)
Sitemap: Sitemap: http://www.yoursitename.com/sitemap-index.xml (http://www.example.com/sitemap-index.xml)

Basically, before you upload your sitemap.xml file, delete the coding that maps the pages you do not want spidered ... thus, it "mirrors" your 'disallow' instructions in your robots.txt file via simple omission, being sure to alter the robots.txt file as shown above by including the "auto-discovery" of the sitemap code so it becomes a 'Rule'!

09-09-2010, 03:15 AM
This is my first post here and I'm not that html brained, but I did understand the above post on the sitemap.
I have a google sitemap installed on my website, but it won't allow Googlebot-Images access to the images.

This is what I have in the sitemap for crawlwr access
User-agent: *

Disallow: /cgi-bin
Disallow: /admin
Disallow: /account.php
Disallow: /advanced_search.php
Disallow: /checkout_shipping.php
Disallow: /create_account.php
Disallow: /login.php
Disallow: /password_forgotten.php
Disallow: /shopping_cart.php
Disallow: /_vti_bin
Disallow: /_vti_cnf
Disallow: /_vti_log
Disallow: /_vti_pvt
Disallow: /_vti_txt

User-agent: Googlebot-Image

Disallow: /

Should I take out the "dissallow: /" or put under the dissallow "Allow: /images ?

I will be thankful for any replies.


09-18-2010, 05:29 PM
Vasili, I am having a problem reaching either of the two links in your post. I am using Firefox/3.6.9. Please advise if these are available elsewhere.

09-26-2010, 12:23 AM
You cannot have conflicting instructions between the files: the robots.txt file will need to clearly state any disallow, and in this case, you must specifically 'rule' that your images are disallowed to be cached.
Also, after auto-generating your sitemap.xml (I prefer not to use Google's version, as it is geared to the advantage of their overall scheme rather than purely W3C compliant), you must carefully delete the code "mention" of your image file/page, so there is no gap or spacing in the code as well as no mention of the file/page in existance: the robots.txt file creates a Rule based on a single-stated disallowing, but there is no "affirmation" of reference to a resource otherwise (no clearly noted mention of the file of page, since deleted from the xml sitemap, see?).
The above was in keeping with the context of the earlier article discussing "hiding" page views, but to answer your question directly, "Yes, in your case you would create a specific 'Agent' mention and a proper 'Allow' Rule, as you show above in your post."

The links above were SAMPLES (note the word "YourSiteName" in them?)
Replace "yoursitename" with your domain name ....

You can generate a compliant robots.txt and a sitemap.xml both at this site (http://www.xml-sitemaps.com).

06-28-2011, 12:02 PM
i'm adding it now as we speak hahahahahahahaha!!

07-03-2011, 05:38 PM

I'm a total beginner at creating websites, so forgive the dumb questions, please.

Where should the "robots.txt" information (Disallow, Allow) be placed?
Somewhere in the html below and on every page in the website? (I have about 35 pages):
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>xxxxx <META HTTP-EQUIV="Pragma" CONTENT="no-cache"> <META Name="Keywords" Content="xxxxx"> <META Name="Description" Content="xxxxx"> <META NAME="ROBOTS" CONTENT="ALL"> <META NAME="revisit-after" CONTENT="10 days"> <META NAME="author" content="xxxxx"> <META NAME="copyright" content="Copyright 1980-2007 by xxxxx. All Rights Reserved."> <META NAME="resource-type" content="document"> <META NAME="distribution" content="global">


Also, if there is another better way to create the above, I would really
appreciate knowing that.

Thank you so much!


07-03-2011, 06:50 PM
Where should the "robots.txt" information (Disallow, Allow) be placed?
Somewhere in the html below and on every page in the website? (I have about 35 pages)
Also, if there is another better way to create the above, I would really appreciate knowing that. Thank you so much!

L.N. As I explained above, the robots.txt is created and saved as a file that is to be uploaded to your public_html folder (Root, or Home Directory) and is not part of or intended to be imported into any web page as coding whatsoever!

The robots.txt, sitemap.xml, and the sitemap.html files can all be auto-created for you without errors in the format required at the site Vasili suggested: www.xml-sitemaps.com (http://www.xml-sitemaps.com)
Don't forget that the robots.txt and the sitemap.xml files should include the exact same rules. You can edit any of these files using Notebook, as I mentioned, being sure to save it in the same format with the proper extensions.

07-06-2011, 09:51 PM
I don't have any pages I wanted to 'disallow' but I did notice an increase in organic traffic when I uploaded the robot.txt blank file to folder.