The first thing you need to understand is what a Web Crawler or Spider is and how it works. A Search Engine Spider (also known as a crawler, Robot, SearchBotÂ or simply a Bot) is a program thatÂ most search engines use to find whatâ€™s new on the Internet. Google’s web crawler is known as GoogleBot. There are many types of web spiders in use, but for now, weâ€™re only interested in theÂ BotÂ that actually â€œcrawlsâ€ the web and collects documents to build a searchable index for the different search engines. The program starts at a website and follows every hyperlink on each page.
So we can say that everything on the web will eventually be found and spidered, as the so called “spider” crawls from one website to another. Search engines may run thousands of instances of their web crawling programs simultaneously, on multiple servers. When a web crawler visits one of your pages, it loads the siteâ€™s content into a database. Once a page has been fetched, the text of your page is loaded into the search engineâ€™s index, which is a massive database of words, and where they occur on different web pages. All of this may sound too technical for most people, but it’s important to understand the basics of how a Web Crawler works.
So there are basicallyÂ three steps that are involved in the web crawling procedure.Â First, the search botÂ startsÂ by crawling the pages of your site. Then it continuesÂ indexing the words and content of the site, and finallyÂ it visitÂ the links (web page addresses or URLs) that are found in your site. When the spider doesn’t find a page, it will eventually be deleted from the index. However, some of the spiders will check again for a second time to verify that the page really is offline.
The first thing a spider is supposed to do when it visits your website is look for a file called “robots.txt”. This file contains instructions for the spider on which parts of the website to index, and which parts to ignore. The only way to control what a spider sees on your site is by using a robots.txt file. All spiders are supposed to followÂ some rules, and the major search engines do follow these rules for the most part. Fortunately, the major search engines like Google or Bing are finally working together on standards.