How does a Web Crawler work?

The first thing you need to understand is what a Web Crawler or Spider is and how it works. A Search Engine Spider (also known as a crawler, Robot, SearchBot or simply a Bot) is a program that most search engines use to find what’s new on the Internet. Google’s web crawler is known as GoogleBot.

There are many types of web spiders in use, but for now, we’re only interested in the Bot that actually “crawls” the web. This Bot collects documents to build a searchable index for the different search engines. The program starts at a website and follows every hyperlink on each page.

So we can say that everything on the web will eventually be found and spidered. The so called “spider” crawls from one website to another. Search engines may run thousands of instances of their web crawling programs simultaneously, on multiple servers.

When a web crawler visits one of your pages, it loads the site’s content into a database. Once a page has been fetched, the text of your page is loaded into the search engine’s index. The index is a massive database of words, and where they occur on different web pages. All of this may sound too technical for most people, but it’s important to understand the basics of how a Web Crawler works.

The 3 steps of the Google Crawler

So there are basically three steps that are involved in the web crawling procedure. First, the search bot starts by crawling the pages of your site. Then it continues indexing the words and content of the site. It will finally visit the links (web page addresses or URLs) that are found in your site. When the spider doesn’t find a page, it will eventually be deleted from the index. However, some of the spiders will check again for a second time to verify that the page really is offline.

The first thing a spider is supposed to do when it visits your website is look for a file called “robots.txt”. This file contains instructions for the spider on which parts of the website to index, and which parts to ignore. The only way to control what a spider sees on your site is by using a robots.txt file.

All spiders are supposed to follow some rules, and the major search engines do follow these rules for the most part. Fortunately, the major search engines like Google or Bing are finally working together on standards.

14 Comments

jack parler
September 28, 2009 at 5:33 PM

Very nice information. Thanks for this.
Pingback: Link building campaign | Wp Themes Planet
Pingback: Research For Experience Design
Potenzmittel
March 20, 2010 at 11:58 PM

Thanks for Sharing the Information, and keep Working like that!
marck_don
April 27, 2010 at 10:07 AM

very nice Information. and i think i got a good information for ur articales
Abdul munam
June 22, 2011 at 5:32 PM

This is a good information about web crawler.
shilpa
August 17, 2011 at 11:31 PM

thanx for such nice information
Karen
April 17, 2012 at 11:14 AM

Thanks!
Emily Desre
April 26, 2012 at 11:28 PM

I haven’t heard of this very wonderful bot spider on google. I was very impress on it’s capabilities on searching hyperlinks and the newest details posted. This is another thing that google surpasses my expectation. Awesome!
pravalli
January 31, 2013 at 12:03 PM

how the crawler can index the words?If it is a program…how it works…?can u plz rply me.!
adam
February 4, 2013 at 8:23 AM

Please give answer for the following question:
bot or spiders are installed on web server where web sites are hosted or they are installed on google servers from where crawl the web sites????
Roberto
February 8, 2013 at 3:24 PM

Really useful informations. Thanks!
I was trying to think to a system to prevent crawlers to vote a poll accidentally, but I think it may be enough to prevent using the “a” tag, and use just only javascript to activate commands.
Tanveer Ahmed
February 22, 2013 at 1:56 AM

The information was so easy to understand. Great Work!
chandan
March 1, 2013 at 4:30 AM

good job ,keep it up………..