Jun 152012
 
The Robots Exclusion Protocol (REP) or simply called robots.txt is a text file which is created by bloggers and webmasters to instruct Search Engine robots on how to crawl and index their blog or its pages. A default robots.txt file is created by Content Management Systems (CMS) like WordPress or else users can manually create a robots.txt for their blog with any text editor, just be careful to make sure the text file is ASCII-encoded, not an HTML file.

Learning About Robots.txt:Robots Block What is Robots.txt and How to Configure It

Image Credit

Many newbie bloggers are unaware of robots.txt file and its advantages. It can be used to instruct search engine robots on whether to crawl a particular page or not. It helps to prevent duplicate content issues and also adds some layers of security to blogs by preventing search engines from indexing sensitive information of our blog or website.

Below is the default robots.txt which comes with WordPress installation:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

The syntax “Disallow: /wp-admin/” prevents all robots to index files and information inside WP-Admin folder.
The syntax “Disallow: /wp-includes/” prevents all robots to index files and information inside WP-Includes folder.

The default robots.txt adds some security to your blog but it still lacks some security seals and cannot save you from duplicate content issues.

Below is the robots.txt I use for my blog with detailed description about the elements in it:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /go/*
Disallow: /wp-includes/
Disallow: */trackback/
Disallow: /author/
Disallow: /cgi-bin/
Disallow: /?p=*

Allow: /wp-content/uploads/

User-agent: Mediapartners-Google*
Allow: /

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /

Sitemap: http://www.techblazes.com/sitemap-image.xml
Sitemap: http://www.techblazes.com/sitemapindex.xml

You have already learned about the syntax “Disallow: /wp-admin/” and “Disallow: /wp-includes/”, so we shall proceed with others.

Preventing Search Engine Robots from Indexing Plugins and Theme files:

The syntax for these are:

Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/

This instructs Search Engine robots from indexing the Plugin and Theme files, there by preventing the access of sensitive information to robots and also to hackers. By applying this syntax to your robots.txt, you are adding a security layer to your blog by hiding it form search engines.  Note that anyone who knows of a WordPress installation knows these default folders exists though.

Preventing Search Engine Robots from Indexing Affiliate Links:

Do you promote products as an affiliate? Then you need to prevent the affiliate links from being indexed by Google and other search engines. To do so, you need to install a plugin named “GoCodes” which allows you to cloak your affiliate URLs into a format “http://www.yourdomain.com/go/affiliatename

So we need to use the syntax “Disallow: /go/*”. Did you notice an asterisk? It is a wildcard which prevents search engine robots from indexing the URLs which start after /go/. So your affiliate URL’s are not indexed by Google and other search engines, there by maintaining your SERPs.

 

Note, many plugins use cloaking for affiliate URL’s including EasyAZON which helps users with Amazon affiliate sales, it is recommended you disallow your cloaked affiliate URI string.

Disallowing Trackbacks from being Indexed by Search Engines:

The syntax “Disallow: */trackback/” prevents search engines from indexing trackback and pingbacks from your site.

Preventing Search Engines from indexing Author pages:

WordPress, by default creates an author page for each and every author/contributor on a blog. The URL displays the username of the author and if the username is known to a hacker, then he can crack the password with less difficulty.

We can hide the author URL from our blog, but to prevent search engine robots from indexing it, we need to add the syntax “Disallow: /author/”. This instructs search engine bots, not to index the author URLs. By applying this syntax to your robots.txt, you are adding another security layer to your blog.

Disallowing cgi-bin files from being indexed:

The syntax “Disallow: /cgi-bin/” prevents all robots to index files and sensitive information inside cgi-bin folder.

Disallowing URLs that end with Post ID:

Do you know that the blog posts on your website can be accessed even by post codes. It does not matter whether you have changed the permalink structure or not. For example: “http://www.mydomain.com/my-post” can also be accessed by “http://www.mydomain.com/?p=1234″. This can create duplicate content issues. To prevent it, you must add the syntax “Disallow: /?p=*” and the URLs with post ID will not be indexed.

Some Bots by Google:

Google has 4 other bots that you can add to robots.txt besides the main bot, these include the AdSense bot, AdWords bot, Mobile Rendering bot and Image bot. Below are the codes you need to enter to activate respective bots.

For AdSense bot:

User-agent: Mediapartners-Google*
Allow: /

For Image bot:

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

For AdWords bot:

User-agent: Adsbot-Google
Allow: /

For Mobile Rendering bot:

User-agent: Googlebot-Mobile
Allow: /

Sitemap URLs:

If you have a sitemap and want search engines to index it, then you can add the URLs after the syntax “Sitemap:”

Do you think I missed some important things or codes to add to robots.txt? If yes, then please let me know via comments. icon smile What is Robots.txt and How to Configure It

 What is Robots.txt and How to Configure It

Naser

Hello friends, my name is Naser Mohd Baig from techiefusion. I also manage techiedrive. I am interested in reading and writing about technology and latest gadgets.
  • http://www.finditinhednesford.co.uk/ Mitchell

    Thanks for the read rather not get penalized for affiliate links on my webpage when it’s easily avoidable with the solution you’ve provided.

  • http://rat7.net/ Mel Rat 7

    Thanks for explaining this, but what is the actual reason that we don’t want google to index our affiliate links.  Because google will still know they are there…  Please let me know as I’m keen to find out.  I already cloaked my links with pretty link so would have to go back again and do then with gocodes so just wondering if it’s worth it. thanks

  • http://sharemarket.org.in/ share market

    One of my blog also has this in robots.txt
    Disallow: /wget/
    Disallow: /httpd/
    Disallow: /i/
    Disallow: /f/
    Disallow: /t/
    Disallow: /c/
    Disallow: /j/
    Disallow: /*?
     
    Have seen on some site about it.

  • http://tutorialmantap.blogspot.com/ dgunzsmoker

    thank a lot, I will try this tips to my blog

  • sintu kumar

    thanks for tips…great work

  • sintu kumar

    thanks for the tips

  • http://vialenet.wordpress.com/ ingress

    a very important File !

  • http://www.facebook.com/PipoftheWirral Philippa Wynn

    wow, I had no idea! Still not much wiser I’m afraid as I’m pretty new to blogging ans SEO stuff – think I need a beginners info guide, literally from step 1. Could anyone recommend anything I can read to clue me up a bit?
    Thanks, and GREAT blog btw, as per!