Results 1 to 9 of 9

Thread: The robots.txt file
      
   

  1. #1
    Join Date
    Mar 2005
    Location
    Wilmington, Delaware USA
    Posts
    11,767

    Wink The robots.txt file

    In earlier posts we have looked at length about how to give your website the best possible exposure on Google as well as on the other search engines and we have looked at the best ways to SEO (Search Engine Optimise) your site. It has been a great deal of work on your part to make sure that your website is accessible to Google and its Googlebot, that there are plenty of keywords, plenty of quality links and a sitemap for it to follow. Today however we are not making your website more accessible to the Googlebot and the other search engine spiders. Quite the opposite...

    Today we will be discussing the unthinkable; how to keep search engine spiders off your website or restrict them so they can only look at (or, index) parts of your website. It may feel strange to you to have done so much SEO work only to hide it or parts of it. In this article we will be looking at the anti-sitemap: the robots.txt file (or “Robot Exclusion Standard / Robots Exclusion Protocol” if you are a fan of particularly long phrases...).

    GOOD BOTS
    The robots.txt file is the opposite to your sitemap and exists to stop cooperating web spiders visiting all or part of your website (because it exists to tell them where they cannot go). It was started in the summer 1994 by agreement of the members of the robots mailing list because, quite simply, it seemed like a good idea. It was made more popular by Alta Vista, then the other big search engines caught on in the following years and started using the robots.txt standard too.

    While it may seem that we are actually hurting ourselves by not letting web crawlers/ spiders/ robots look at our website in its entirety, this is actually not the case. There may be pages on your website that, while essential, do not actually help the SEO of your website. It might be a sales page that does not contain any of your keywords (maybe only: “Click Here To Confirm” or “Enter Your Credit Card Details”) and letting a robot look at those pages means a worse ranking on Google (more content; fewer keywords).

    The information that you should be restricting using the robots.txt file is information that does not help in any way towards the SEO of your website, but we’ll discuss that again later.

    So, let’s create a robots.txt file for your website...
    It’s a simple plain text file (.txt), so we can create one using the most basic tools on your home computer. You should note that each domain should have it’s own robots.txt file and that includes sub-domains. Separate robots.txt file should be created for “yourwebsite.com” , “about.yourwebsite.com” as well as “waffles.yourwebsite.com”.

    1)Open up a text editor...
    For example: Notepad in Windows; TextEdit in Mac OSX

    2) Start writing your robots.txt file...
    Writing your robots.txt file is very straight forward. The first thing you do is specify which web crawler/ spider/ robot the text applies to. This is done using the “User-agent” statement. A “*” is a wildcard and it means EVERYBODY (all cooperating web crawlers/ spiders/ robots). You then make a “Disallow” statement telling the web crawler/ spider/ robot where it is not allowed to go.

    As a result, the most simple form of the robots.txt file is as follows:

    ------
    User-agent: *
    Disallow: /
    ------

    The above robots.txt file entry tells ALL cooperating web crawlers, spiders and robots to avoid ALL of your website. Obviously this is something you are never going to do... You can also do the exact opposite. The below robots.txt entry allows ALL cooperating web crawlers/ spiders/ robots to visit ALL of your website.

    ------
    User-agent: *
    Disallow:
    ------

    Using the robots.txt you can keep cooperating away from specific files too as in the below example

    ------
    User-agent: *
    Disallow: /directory/file.html
    ------

    Using the robots.txt files you can tell cooperating web crawlers/ spiders/ robots to stay away from one or several directories...

    ------
    User-agent: *
    Disallow: /cgi-bin/
    Disallow: /images/
    Disallow: /tmp/
    Disallow: /private/
    ------

    3) In this way, you can write more specific robots.txt documents...

    In the below example, I want to keep the Googlebot out of my /images/ directory but I also want to keep Yahoo!’s bot out of the /videos/ directory. In addition I want to keep ALL cooperating bots out of my /cgi/ and /tmp/ directories. As a final stipulation, I also want VodaBot (okay, I made this one up) to stay away from an image file called pointless.jpg which is in my /images/ directory.

    ------
    User-agent: Googlebot
    Disallow: /images/

    User-agent: yahoo
    Disallow: /videos/

    User-agent: *
    Disallow: /cgi/
    Dissallow: /tmp/

    User-agent: VodaBot
    Dissallow: /images/pointless.jpg
    ------

    Finally, you will note that while the fictitious VodaBot cannot access the file pointless.jpg it can access the rest of my /images/ directory ... but what if I wanted it the other way round? What if I wanted the excellently named VodaBot to NOT be able to access anything in the /images/ directory EXCEPT an image file called “meaning-of-life.jpg”? Then I would use an Allow statement in my robots.txt file.

    ------
    User-agent: VodaBot
    Dissallow: /images/
    Allow: /images/meaning-of-life.jpg
    ------

    Note that Allow MUST come after a Dissallow statement

    You should also be careful when using “/” as depending how you use it, it can mean different things. The following denotes a directory: “/images/” while “/images” (without “/” at the end) means any file in the root directory that begins with “images”. Writing: “Disallow: /images” does not limit access to the /images/ directory in any way, shape or form.

    Have a look at wikipedia’s robots.txt file ( http://en.wikipedia.org/robots.txt ) as an example. It uses comments (the # symbol) to explain how their robots.txt file works. This is a great resource if you’re writing your first robots.txt file.

    4) Save and upload...
    Save your document in plain text format, as robots.txt, making sure that the extension of the text document is .txt. The file you have can be uploaded straight to the root (home) directory of the website it applies to.

    BAD BOTS
    The robots.txt file is a double-edged sword however. You will notice that I make reference to the “cooperating” spiders. Many people have the assumption that the robots.txt file can be used to hide parts of their website from the search engines. I cannot stress how wrong this is.

    There is no official standards body for the robots.txt protocol and there are very, very many search engines out there on the Internet and each has its own crawler/ spider or robot... These must be programmed to follow the instructions laid out in your robots.txt document. Image if a crawler or spider was programmed to visit ONLY the links that the robots.txt told it not to visit. There is nothing to stop it doing this.

    Any parts of your website that you do not want to be visible to anybody should:
    (a) Not be uploaded to your website at all
    (b) Be password protected

    Of these two options, (a) is by far the most effective.

    In general the robots.txt file is not there for security in any way. It is there to improve the Search Engine Optimization of your site to make sure all the hard work that you have done SEOing your website is used in the best and optimum way. It is there to stop Googlebot finding things that would hurt the SEO of your website or are pointless as far as the theme or content of your website goes.

    Suggested further reading:

    How to make the googlebot love ya!

    Google Webmaster Tools 101


    VodaHost

    Your Website People!
    1-302-283-3777 North America / International
    02036089024 / United Kingdom
    291916438 / Australia

    ------------------------

    Top 3 Best Sellers

    Web Hosting - Unlimited disk space & bandwidth.

    Reseller Hosting - Start your own web hosting business.

    Search Engine & Directory Submission - 300 directories + (Google,Yahoo,Bing)



  2. #2
    Join Date
    Mar 2006
    Posts
    14,586

    Red face Re: The robots.txt file

    Excellent ... should be updated somewhat though for maximum benefit, with the addition of auto-discovery coding below for the sitemap.xml:

    Complete robots.txt example for XML sitemaps autodiscovery (with no 'disallow' parameters) by adding the "sitemap" line as shown below:

    User-agent: *
    Allow:
    Allow: (etc. for as many as allowing)
    Sitemap: http://www.yoursitename.com/sitemap.xml


    If you have created a sitemap index file (where you specifically echo your donotfollow parameters by deleting the page/item entries manually that were auto-generated by the sitemap generator), you can also reference that by inserting this line of code instead of the above:

    User-agent: *
    Disallow: (enter specific files/pages not to be read)
    Sitemap: Sitemap: http://www.yoursitename.com/sitemap-index.xml


    Basically, before you upload your sitemap.xml file, delete the coding that maps the pages you do not want spidered ... thus, it "mirrors" your 'disallow' instructions in your robots.txt file via simple omission, being sure to alter the robots.txt file as shown above by including the "auto-discovery" of the sitemap code so it becomes a 'Rule'!
    . VodaWebs....Luxury Group
    * Success Is Potential Realized *

  3. #3
    Join Date
    Sep 2010
    Posts
    3

    Default Re: The robots.txt file

    Hi,
    This is my first post here and I'm not that html brained, but I did understand the above post on the sitemap.
    I have a google sitemap installed on my website, but it won't allow Googlebot-Images access to the images.

    This is what I have in the sitemap for crawlwr access
    User-agent: *

    Disallow: /cgi-bin
    Disallow: /admin
    Disallow: /account.php
    Disallow: /advanced_search.php
    Disallow: /checkout_shipping.php
    Disallow: /create_account.php
    Disallow: /login.php
    Disallow: /password_forgotten.php
    Disallow: /shopping_cart.php
    Disallow: /_vti_bin
    Disallow: /_vti_cnf
    Disallow: /_vti_log
    Disallow: /_vti_pvt
    Disallow: /_vti_txt

    User-agent: Googlebot-Image

    Disallow: /

    Should I take out the "dissallow: /" or put under the dissallow "Allow: /images ?

    I will be thankful for any replies.

    Jen

  4. #4
    Join Date
    Sep 2010
    Location
    SW Florida
    Posts
    3

    Default Re: The robots.txt file

    Vasili, I am having a problem reaching either of the two links in your post. I am using Firefox/3.6.9. Please advise if these are available elsewhere.
    Thanks.

  5. #5
    Join Date
    Mar 2006
    Posts
    14,586

    Arrow Re: The robots.txt file

    JENVIN
    You cannot have conflicting instructions between the files: the robots.txt file will need to clearly state any disallow, and in this case, you must specifically 'rule' that your images are disallowed to be cached.
    Also, after auto-generating your sitemap.xml (I prefer not to use Google's version, as it is geared to the advantage of their overall scheme rather than purely W3C compliant), you must carefully delete the code "mention" of your image file/page, so there is no gap or spacing in the code as well as no mention of the file/page in existance: the robots.txt file creates a Rule based on a single-stated disallowing, but there is no "affirmation" of reference to a resource otherwise (no clearly noted mention of the file of page, since deleted from the xml sitemap, see?).
    The above was in keeping with the context of the earlier article discussing "hiding" page views, but to answer your question directly, "Yes, in your case you would create a specific 'Agent' mention and a proper 'Allow' Rule, as you show above in your post."

    HALFDIME
    The links above were SAMPLES (note the word "YourSiteName" in them?)
    Replace "yoursitename" with your domain name ....


    You can generate a compliant robots.txt and a sitemap.xml both at this site.
    . VodaWebs....Luxury Group
    * Success Is Potential Realized *

  6. #6
    Join Date
    Jul 2008
    Location
    London United Kingdom
    Posts
    30

    Talking Re: The robots.txt file

    i'm adding it now as we speak hahahahahahahaha!!

  7. #7
    Join Date
    Jul 2011
    Posts
    3

    Default Re: The robots.txt file

    Hello,

    I'm a total beginner at creating websites, so forgive the dumb questions, please.

    Where should the "robots.txt" information (Disallow, Allow) be placed?
    Somewhere in the html below and on every page in the website? (I have about 35 pages):

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <HTML> <HEAD> <TITLE>xxxxx <META HTTP-EQUIV="Pragma" CONTENT="no-cache"> <META Name="Keywords" Content="xxxxx"> <META Name="Description" Content="xxxxx"> <META NAME="ROBOTS" CONTENT="ALL"> <META NAME="revisit-after" CONTENT="10 days"> <META NAME="author" content="xxxxx"> <META NAME="copyright" content="Copyright 1980-2007 by xxxxx. All Rights Reserved."> <META NAME="resource-type" content="document"> <META NAME="distribution" content="global">

    </HEAD>

    Also, if there is another better way to create the above, I would really
    appreciate knowing that.

    Thank you so much!

    L.N.

  8. #8
    Join Date
    Mar 2005
    Location
    Wilmington, Delaware USA
    Posts
    11,767

    Default Re: The robots.txt file

    just pop it into your public_html folder

    VodaHost

    Your Website People!
    1-302-283-3777 North America / International
    02036089024 / United Kingdom
    291916438 / Australia

    ------------------------

    Top 3 Best Sellers

    Web Hosting - Unlimited disk space & bandwidth.

    Reseller Hosting - Start your own web hosting business.

    Search Engine & Directory Submission - 300 directories + (Google,Yahoo,Bing)



  9. #9
    Join Date
    Jan 2010
    Location
    Tampa, FL
    Posts
    24

    Default Re: The robots.txt file

    I don't have any pages I wanted to 'disallow' but I did notice an increase in organic traffic when I uploaded the robot.txt blank file to folder.
    Jason Stallworth
    TheMuscleProgram.com (main/first site)
    http://www.themuscleprogram.com/

    My product site:
    http://www.hardcoremusclebuildingprogram.com/

    http://www.jasonstallworth.com

    Working on a metal/guitar site...coming soon!

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

     

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •