How is the text source determined by search engines?

Started by Adam Greer, Aug 20, 2022, 01:32 AM

Previous topic - Next topic

Adam GreerTopic starter

Question: Let's say we have two websites. I publish a unique article on site A, then I publish the same article on website B. Suppose that the article page on site B is indexed faster than the article page on website A. Question: Will the search engine consider an article on site B as unique, and on website A as plagiarism?


The search engine will do just that. But in your story, that no longer matters, since the articles were not originally yours. So I don't think it's worth worrying too much about it. If you want search engines to index articles faster, then eliminate all errors on the website and fill it more often.

Then the bots will come in more often, and the likelihood that this article will be indexed ahead of you will increase. But generally, I don't think that there will be such a wild hype for a specific article from a specific "dead" website and the fact that this particular article will be published by someone else on their site at exactly the same moment as you. According to probability theory, the probability of such an event tends to zero.


A search robot is a program that performs the main function of a search engine – the search for new data sources (pages). Since this program moves freely on the World Wide Web, it became known as the "spider".
The principle of operation of the "spider" is quite simple: getting on one web page, it looks for links to other pages on it and goes to each of them, repeating the previous actions. At the same time, the robot indexes (stores basic information about the site in the database) and sends a copy of each page found to the archive.
It should be understood that the word "each" means a page corresponding to the search parameters. Before getting into the index, new site pages are checked for viruses, technical errors and plagiarism. Bad pages are immediately eliminated. And, of course, based on the principle of the "spider", it is obvious that the more links to the site (internal and external are taken into account), the faster it will get into the index.

Information collection and indexing

In addition to qualitative checks, there are quantitative restrictions on writing pages to the index. The search engine has limited resources and cannot scan all sites on the Internet instantly or even in a month. Therefore, each site has a "crawling budget" - the number of pages that a search robot can crawl at a time, and the maximum number of indexed documents from this site.
For large websites, this may be the main reason for the delay in updating data in the index. The most effective solution in this case is to configure sitemap.xml . This is a specially designed site map designed to guide the work of the "spider". In this map, it is necessary to indicate which pages of the site are most often updated, what is more priority to add to the index, what information the search robot should pay attention to, and what it has already checked.

In any case, page indexing does not happen instantly, since the search robot cannot traverse the entire network in a second. Now the indexing process takes no more than 1-3 weeks, and for high-quality, useful and properly optimized web sites it can take several days.
Work to reduce the time of indexing the site is an important condition for development. There are more resources on the Internet every second, and search engines cannot improve at the same speed.