How to Create a Good Robots.txt File for Your Site

Robots.txt – It is more of a technical topic. Robots.txt file could be a new term for most of the people. Actually, it is a small text that decides your website future.

How is that possible?

It is possible. This small text can control your site traffic. If you enter it as wrong then your page might not be in the search result. So, it is important to know how to use it properly.

It is one of the simple and easiest SEO methods you can apply to your site. It doesn’t need any technical knowledge to control the power of robots.txt. If you can find the source code then it is easy.

Robots.txt File

Also, by placing robots.txt anywhere on the site won’t help. For that, you have to first find source code and keep it there. Then only the web crawler will be able to identify your instruction and act accordingly.

From this article you will get the answer to the following questions:

  • What is a robots.txt file?
  • Uses of robot.txt file
  • How does it work?
  • How to create it?
  • Importance of robots.txt file?
  • What to include in this file?

First, let me explain the term

What is a Robots.txt File?

Robots.txt is a text file that locates in a site’s root directory. It controls search engine crawlers and spiders in visiting a particular website. That means it tells the search engine about website pages that want to visit or not visit.

Every website owners try to get noticed nowadays. You can do this using this small text. It helps to include or exclude a particular page from the search result. You will get an idea about this after reading this article.

When a crawler accesses a site, the first thing it demands is the ‘robots.txt’ file. If there is such a file then it goes to indexation instructions for further procedure.

If you haven’t added a robots.txt file then the search engine can easily crawl into your site anywhere and index everything find on your site. But it is a good practice to specify your sitemap. It makes easy for search engine to find new contents without any delay.

Uses of robots.txt:

  • You can avoid duplicate pages using this text
  • If you don’t want the search engine to index your internal search result page you can use this text
  • Use it if you don’t want search engines to index certain areas of your webpage or the whole site
  • You can avoid indexing certain images or files
  • You can navigate search engine to your sitemap
  • You can use a crawl delay in order to prevent servers from being overloaded when crawlers load multiple contents at the same time.

Only use robots.txt whenever you need to control access to any particular page. If there isn’t anything like that you don’t have to use it

How Robots.txt file works:

A search engine has two main functions.

  1. Crawling the website to discover the content
  2. Indexing that content to serve searchers who look for a particular information

The search engine crawls from one site to another site. Thus it crawls across billions of sites. Crawling process is also known as spidering.

After arriving at a website and before crawling from one site to another search crawler looks for the robots.txt file. If it finds one then the crawler read it first before continuing on that site. This robots.txt file contains instruction for a web crawler. It says whether to proceed or not. If crawler could not find any directions or information on what to do then it will proceed for further activity.

Where will robots.txt go?

Robots.txt is the first thing a WebCrawler or search engine look when it visits a site. It only looks in the main directory. If it is not found there, the crawler proceeds with everything in the site. So it is essential to place a robot.txt file in the main directory or root domain.

To explain this let’s take an example of wordpress.com. If the user agent visits www.wordpress.com/robots.txt and if there is no robot file then it assumes that the site hasn’t any instructions. So it starts to index each and every pages. the If robot file exists at www.wordpress.com /index/robots.text or www.wordpress.com/homepage/robots.txt user agent will not find it. It will be treated as a site without robot.txt.

Steps to Create a Robots.txt file?

A robots.txt file contains two fields; one line is with a user agent name or several lines with the directive. The second line indicates what action the crawler has to do on a website. Let’s check how to create a robots.txt file

  • The first step is to open a new text file. You can use Notepad for PCs and text editor for Mac and save it as a text enclosed file
  • Upload it to your root directory. It is a root level folder called ‘htdocs’ or ‘www’. So this comes just after to your domain name.
  • If subdomain is there create it for each subdomain

Here is the basic format of robots.txt

User-agent: [user-agent-name]

Disallow: [name of the URL string that not to be crawled]

This is basically known as a robots.txt file. There could be multiple user lines and directives. It could be anything from allowing, disallow, crawl-delays etc

Technical terms in robots.txt:

There are some common words related to robots.txt language. They are known as robots.txt syntax. Five main words are commonly used in the robots.txt file. They are:

User-agent :

User-agent is the web crawler or search engine to which you are giving instruction.

Disallow:

This command gives instruction to the crawler not to crawl a particular URL. Each URL can use the only one disallow line.

Allow:

This command is used only for Google Bot. By giving this command Google bot can access that subfolder or page even if its parent page is disallowed.

Crawl-delay:

It indicates the waiting time before loading and crawling page content. It won’t work for Google bot but you can set time for Google Search console

Sitemap:

It is used to call out the location of any XML sitemap associated with a URL. It only supported by Google, Yahoo, Bing and Ask.

These are the most common term you should know in robot.txt syntax. Now you can predict the command by just seeing a robots.txt file

What to include in a Robots.txt file?

Robot.txt just gives an instruction for web robots about accessing or not accessing anything. If you don’t want to show any webpage to users you can give direction to crawler using the robots.txt file. Otherwise, you can protect it using a password. Like this, you can hide the location of any admin or private pages. It prevents crawling of robots to those private pages.

Now let’s check how to do it with some examples

  • Allow everything and submit sitemap:

This is a good option for all sites. This allows the search engine to crawl everywhere and index all data. It also allows showing XML location so that the crawler can easily access new pages

User-agent:*

Allow: /

#sitemap reference

Sitemap: www.wordpress.com/sitemap.xml

  • Allow everything except one subdirectory

Sometimes there will be an area in your page that you don’t want to show in the search results. It could be anything like an image, checkout area, files, audit section etc. You can disallow it

User-agent: *

Allow: /

# disallowed subdirectory

Disallow: /checkout/

Disallow: /images/

Disallow:/audit report/

  • Allow everything apart from certain files:-

Sometimes you may want to show media or an image on your website or show documents. But you don’t want them to appear on search results. You can hide animated files, gifs, pdf or PHP files as shown below

User-agent:*

Allow: /

#Disallow file types

Disallow: /*.gif$

Disallow: /*.pdf$

Disallow: /*.php$

  • Allow everything apart from certain WebPages:-

Sometimes you may want to hide some pages which are not suitable to read, it could be anything from your terms and condition or any sensitive topics that you don’t want to show others. You can hide them as follows

User-agent: *

Allow: /

#disallow web pages

Disallow: /terms.html

Disallow:/ secret-list-of contacts.php

  • Allow everything except certain patterns of URL

Sometimes you may want to disallow certain URL patterns. It could be a test page, any internal search page etc

User-agent: *

Allow: /

#disallow URL patterns

Disallow: /*search=

Disallow: /*test.php$

In these above conditions, you found many symbols and characters. Here I am explaining what each of them actually means

  • Star symbol (*) represents any number of characters or a single character.
  • Dollar symbol ($) indicates the end of the URL. If you forgot to put it then you will block a huge number of URL accidentally

Note: – be careful not to disallow the whole domain. Sometimes you can see a command like this

User-agent: *

Disallow: /

Do you know what this means? You are saying the search engine to disallow your whole domain. So, it won’t index any of your webpages and you cannot be in any search result. So be careful not to put this accidentally.

Final Testing:

It is important to check whether your robots.txt file is working or not. Even if you have done it right a proper checking is recommended

You can use Google’s robots.txt tool to find whether everything is ok with your file. First, you need to register the site where you apply robots.txt file in Google webmaster tool. After registering log into that tool and select your particular site. Now, Google will display you all notes to show error.

How to check whether your site has got a robot.txt file?

You can check this easily. Let’s take the previous example of word press. Type your website address www.wordpress.com and add /robots.txt with it. ie, www.wordpress.com/robots.txt. Now, you can see whether your site has got a roborts.txt file or not.

Other quick robot.txt tips:

  • If you place robots.txt in a website’s top-level directory it is easy to get noted
  • If you disallowed any subdirectory then any file or webpage within the subdirectory will be disallowed
  • Robots.txt is case sensitive. You must enter it as robots.txt. Otherwise, it won’t work
  • Some user agents may ignore your robots.txt file. Some crawlers like email scrapers or malware robots etc may ignore this file
  • /robots.txt is publically available. So it is better not to hide any private user information. If you add /robots.txt to the end of any root domain you can see the pages you want to crawl or don’t want to crawl, if it has a robot.txt file.
  • It takes several days for a search engine to identify a disallowed URL and remove it from its index
  • Each subdomain in a root uses a separate robots.txt file. For example, blog.wordpress.com and wordpress.com use separate robots.txt files. ie, blog.wordpress.com/robots.txt and wordpress.com/robots.txt
  • It is better to add location to any sitemap at the bottom of a robots.txt file

Have you got an idea of the concept? It’s a simple one right? You can apply this to your site and improve its performance. It is not necessary to show everything on your site. You can hide your admin pages or terms and condition etc from users. Robots.txt file will help you on that. Use it wisely to indicate sitemap and make your site indexing faster.

Robot.txt is not only about disallowing unwanted contents or files. It is very essential for faster downloading too. You can do this easily. There is nothing related to technical knowledge to do this task. Anybody can do this after a very good analysis. After applying this don’t forget to test it with Google.robot.txt tool. It helps you to identify whether there are any errors in your added text or not.

It is very essential to update yourself on all aspects of SEO. As you are in a market where new changes are happening daily you have to know about everything happening around you. Try to implement most modern techniques to make your site a huge success.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.