robots.txt - why is it needed and how to create it?

Started by arthyk, Aug 24, 2022, 05:03 AM

Previous topic - Next topic

arthykTopic starter

I learned that for the site to function properly in terms of interaction with search engines, it is necessary to create a robots.txt file. Allegedly, it is necessary to interact with search robots and contains instructions for them. Can you briefly explain how to create it, and is it necessary to do it manually? ???
  •  

Harry_99

This is perhaps the most important file for search engines. If suddenly traffic on your site has fallen, then first of all you need to start with robots.txt. The file must meet several requirements.

1. The file must be written in UTF-8 encoding. Other encoding is not recognized or misread by search engines.
2. The location of the file is only in the root directory of the site https://site.com/robots.txt.
And most importantly, without robots.txt, important information that should be hidden from prying eyes can get into the distribution, and because of this, both you and your site will suffer.

Observe digital hygiene!
  •  

xerbotdev

File robots.txt , also known as the exception standard for robots, is a text file that stores certain instructions for search engine robots.
Before web site gets into the search results and occupies a certain place there, it is examined by robots. It is they who transmit information to search engines, and then your resource is displayed in the search bar.
Robots.txt performs an important function — it can protect the entire site or some of its sections from indexing. This is especially true for online stores and other resources through which online payments are made. You don't want your customers' credit accounts to suddenly become known all over the Internet, do you? That's what the file is for robots.txt .

Search robots scan all links in a row by default, unless they are restricted. To do this, in the file robots.txt they make up certain commands or instructions for action. Such instructions are called directives.
The main greeting directive with which file indexing begins is the user-agent
It may look like this:
User-agent: Bing
Or so:
User-agent: *

Or like this:
User-agent: GoogleBot

The user-agent refers to a specific robot, and further action guides will apply only to it.
Thus, in the first case, the instructions will apply only to robots Bing, the second — robots of all search engines, the latter commands are the main robot.

It is reasonable to ask: why contact robots separately? The fact is that different search "messengers" approach file indexing differently. So, Google robots implicitly comply with the sitemap directive (described below), while Bing robots treat it neutrally.
But the clean-param directive, which allows you to exclude duplicates of pages, works exclusively for Bing search engines.
However, if you have a simple web site with simple sections, we recommend not making exceptions and contacting all robots at once using the * symbol.
The second most important directive is disallow
It prohibits robots from scanning certain pages. As a rule, administrative files, duplicate pages and confidential data are closed with disallow.

In our opinion, any personal or corporate information should be protected more strictly, that is, require authentication. But, nevertheless, for prevention purposes, we recommend prohibiting indexing of such pages and in robots.txt .

The directive may look like this:
User-agent: *
Disallow: /wp-admin/

Or so:
User-Agent: Googlebot
Disallow: */index.php
Disallow: */section.php

In the first example, we closed the system panel of web site from indexing, and in the second we prohibited robots from scanning pages index.php and section.php . The * sign is translated for robots as "any text", / is a prohibition sign.
The following directive is allow
In contrast to the previous one, this command allows indexing information.
It may seem strange: why allow something if the search robot is ready to scan everything by default? It turns out that this is necessary for selective access. For example, you want to ban a section of web site with the name /korsobka/.

Then the command will look like this:
User-agent: *
Disallow: /korsobka/

But at the same time, there is a bag and an umbrella in the boxes section, which you would not mind showing to other users.
Then:
User-agent: *
Disallow: /korsobka/
Allow: /korsobka/sumka/
Allow: /korsobka/zont/

Thus, you have closed the general section of korsobka, but have opened access to pages with a bag and an umbrella.
Sitemap is another important directive. The name suggests that this instruction is somehow connected with web site map. And that's right.

If you want search robots to first of all enter certain sections when scanning your site, you need to place your sitemap map file in the root directory of the site. Unlike robots.txt , this file is stored in xml format.
If you imagine that a search robot is a tourist who came to your city (aka a website), it is logical to assume that he will need a map. With it, he will better navigate the terrain and know which places to visit (that is, index) in the first place.
The sitemap directive will serve as a pointer to the robot - they say, the map is over there. And then he will easily understand the navigation on your site.
How to create and verify robots.txt
The exception standard for robots is usually created in a simple text editor (for example, in Notepad). The file is given the name robots and saved in txt format.
Next, it must be placed in the root directory of web site. If you do everything correctly, it will be available at the address "name of your site"/robots.txt .
Help services will help you to prescribe directives yourself and figure everything out.
Use any choice of: Bing or Google. With their help, even an inexperienced user will be able to understand the basics in 1 hour.

When the file is ready, it should definitely be checked for errors. To do this, the main search engines have special web workshops.
  •