What is Robots.txt & Why You Need One

Robots.txt

What is a Robots.txt File?

Robots.txt file is an instruction file which tells Google and other search engines what page has to index. In other words you can say it is a document that provides access to crawling website content to Google and other search engines.

Example Robots.txt formats:

These are very common formats that use to control indexing of website on search engines.

Allow indexing of everything to all Search Engine

User-agent: *
Disallow:

Or

User-agent: *
Allow: /

Blocked or Disallow to indexing everything from website to all Search Engine

User-agent: *
Disallow: /

Disallow indexing of a specific folder to all Search Engine

User-agent: *
Disallow: /folder_name/

Disallow Googlebot from indexing of a folder, except for allowing the indexing of one file in that folder

User-agent: Googlebot
Disallow: /folder1/
Allow: /folder1/myfile.html

Robots.txt Wildcard Matching

Google and Microsoft's Bing allow the use of wildcards in robots.txt files.

To block access to all URLs that include a question mark (?), you could use the following entry:

User-agent: *
Disallow: /*?

You can use the $ character to specify matching the end of the URL. For instance, to block URLs that end with .php, you could use the following entry:

User-agent: Googlebot
Disallow: /*.php$

What is a user-agent in Robots.txt?

A user agent in Robots.txt is a search engine bot which we have to tell what has crawl and indexed.

Here listed of top 10 Search Engine Bots:

Search Engine Name Search Engine Bot
Google Googlebot
Bing and Yahoo Bingbot
Yahoo (Yahoo Mobile Search results) Slurp
DuckDuckGo DuckDuckBot
Baidu (Chinese Search Engine) Baiduspider
Yandex (Russian Search Engines) YandexBot
Sogou.com Sogou
Exalead (France) Exabot
Facebook Facebot or FacebookExternalHit
Amazon’s Alexa Internet Rankings Ia_archiver (Alexa Crawler)

Why you need a Robots.txt file?

Before a search engine start indexing (crawling) you website, it will have a look to your robots.txt file as directions on where they are permitted to crawl (visit) and index (save) on the search engine results. Robots.txt documents are valuable:

1. If you need search engines to disregard any duplicate pages on your site

2. If you don’t need search engines to index your internal search results pages

3. If you don’t need search engines to index certain areas of your website or a whole website

4. If you don’t need search engines to index certain files on your website (images, PDFs, etc.)

5. If you need to tell web search tools where your sitemap is found

If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all.

Checking if you have a robots.txt file

Not sure if you have a robots.txt file? Simply type in your root domain, then add /robots.txt to the end of the URL.

For instance, Sonu Prasad Gupta’s robots file is located at http://www.sonuprasadgupta.com/robots.txt. If no .txt page appears, you do not currently have a (live) robots.txt page.

How to create a robots.txt file?

If you found you didn’t have a robots.txt file or want to alter yours, creating one is a simple process. This article from Google walks through the robots.txt file creation process, and this tool allows you to test whether your file is set up correctly.

If you have and question or suggestion related this post and related to Robots.txt and SEO you should ask to me and leave you comment. If you enjoyed this post, I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter or Facebook.

Thank you!