The Robots Exclusion Protocol (REP) or simply called robots.txt is a text file which is created by bloggers and webmasters to instruct Search Engine robots on how to crawl and index their blog or its pages. A default robots.txt file is created by Content Management Systems (CMS) like WordPress or else users can manually create a robots.txt for their blog with any text editor, just be careful to make sure the text file is ASCII-encoded, not an HTML file.

Learning About Robots.txt:Robots Block

Image Credit

Many newbie bloggers are unaware of robots.txt file and its advantages. It can be used to instruct search engine robots on whether to crawl a particular page or not. It helps to prevent duplicate content issues and also adds some layers of security to blogs by preventing search engines from indexing sensitive information of our blog or website.

Below is the default robots.txt which comes with WordPress installation:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

The syntax “Disallow: /wp-admin/” prevents all robots to index files and information inside WP-Admin folder.
The syntax “Disallow: /wp-includes/” prevents all robots to index files and information inside WP-Includes folder.

The default robots.txt adds some security to your blog but it still lacks some security seals and cannot save you from duplicate content issues.

Below is the robots.txt I use for my blog with detailed description about the elements in it:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Disallow: /go/*
Disallow: /wp-includes/
Disallow: */trackback/
Disallow: /author/
Disallow: /cgi-bin/
Disallow: /?p=*

Allow: /wp-content/uploads/

User-agent: Mediapartners-Google*
Allow: /

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

User-agent: Adsbot-Google
Allow: /

User-agent: Googlebot-Mobile
Allow: /

Sitemap: http://www.techblazes.com/sitemap-image.xml
Sitemap: http://www.techblazes.com/sitemapindex.xml

You have already learned about the syntax “Disallow: /wp-admin/” and “Disallow: /wp-includes/”, so we shall proceed with others.

Preventing Search Engine Robots from Indexing Plugins and Theme files:

The syntax for these are:

Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/

This instructs Search Engine robots from indexing the Plugin and Theme files, there by preventing the access of sensitive information to robots and also to hackers. By applying this syntax to your robots.txt, you are adding a security layer to your blog by hiding it form search engines.  Note that anyone who knows of a WordPress installation knows these default folders exists though.

Preventing Search Engine Robots from Indexing Affiliate Links:

Do you promote products as an affiliate? Then you need to prevent the affiliate links from being indexed by Google and other search engines. To do so, you need to install a plugin named “GoCodes” which allows you to cloak your affiliate URLs into a format “http://www.yourdomain.com/go/affiliatename

So we need to use the syntax “Disallow: /go/*”. Did you notice an asterisk? It is a wildcard which prevents search engine robots from indexing the URLs which start after /go/. So your affiliate URL’s are not indexed by Google and other search engines, there by maintaining your SERPs.

 

Note, many plugins use cloaking for affiliate URL’s including EasyAZON which helps users with Amazon affiliate sales, it is recommended you disallow your cloaked affiliate URI string.

Disallowing Trackbacks from being Indexed by Search Engines:

The syntax “Disallow: */trackback/” prevents search engines from indexing trackback and pingbacks from your site.

Preventing Search Engines from indexing Author pages:

WordPress, by default creates an author page for each and every author/contributor on a blog. The URL displays the username of the author and if the username is known to a hacker, then he can crack the password with less difficulty.

We can hide the author URL from our blog, but to prevent search engine robots from indexing it, we need to add the syntax “Disallow: /author/”. This instructs search engine bots, not to index the author URLs. By applying this syntax to your robots.txt, you are adding another security layer to your blog.

Disallowing cgi-bin files from being indexed:

The syntax “Disallow: /cgi-bin/” prevents all robots to index files and sensitive information inside cgi-bin folder.

Disallowing URLs that end with Post ID:

Do you know that the blog posts on your website can be accessed even by post codes. It does not matter whether you have changed the permalink structure or not. For example: “http://www.mydomain.com/my-post” can also be accessed by “http://www.mydomain.com/?p=1234”. This can create duplicate content issues. To prevent it, you must add the syntax “Disallow: /?p=*” and the URLs with post ID will not be indexed.

Some Bots by Google:

Google has 4 other bots that you can add to robots.txt besides the main bot, these include the AdSense bot, AdWords bot, Mobile Rendering bot and Image bot. Below are the codes you need to enter to activate respective bots.

For AdSense bot:

User-agent: Mediapartners-Google*
Allow: /

For Image bot:

User-agent: Googlebot-Image
Allow: /wp-content/uploads/

For AdWords bot:

User-agent: Adsbot-Google
Allow: /

For Mobile Rendering bot:

User-agent: Googlebot-Mobile
Allow: /

Sitemap URLs:

If you have a sitemap and want search engines to index it, then you can add the URLs after the syntax “Sitemap:”

Do you think I missed some important things or codes to add to robots.txt? If yes, then please let me know via comments. 🙂

Naser

Naser

Hello friends, my name is Naser Mohd Baig. I blog about Outdoor activities. Check out my recent post here: best entry level mountain bike