Skip to main content

How search engine spiders work13 September 2005

How search engin spiders work

Search engines gather their information by using 'spiders', which are robots that crawl the Internet identifying websites. They collect information from web pages which they then analyse and index - the resulting 'directory' of content is what the search engine scans when a user types in a query.

Spiders regularly return to web pages they have crawled before to look for changes. You can also submit your website for the spiders to crawl so that they can index your content and include it in their databases.

What is a 'spider'?

A spider is a piece of software which allows a search engine to travel to individual websites and take the information it needs to build a directory for use by people searching the web.

Spiders work in different ways and at different speeds, so a basic knowledge of their behaviour is useful for businesses trying to design effective and popular websites.

What do spiders do?

Spiders start from a central point, such as a popular homepage, and then follow the hyperlinks to other pages on the same website or travel out to other websites. The biggest search engines have many spiders all working at the same time, looking at hundreds of sites every second.

When it arrives at a web page, the spider reads and stores the information on it, starting from the top and moving downwards.

What happens next depends on how the search engine treats the information. Some ignore all the HTML code and concentrate on the text; others, such as Alta Vista, take into account meta information such as keywords (for more about meta information see this guide to meta tags.

Website owners wanting to publicise a new site can encourage the spiders to visit by submitting their pages to the search engines. For example, see Google's submission page or Alta Vista's page for adding a new site.

Tips for improving your site for spiders

  • spiders vary in how deeply they burrow into a site. For example, Google's spiders can index a homepage very soon after it's been submitted - sometimes even the same day. But if your site directory includes a lot of deep links (for example www.site.com/sub-directory/sub-directory/page), it will take longer and may never index some pages. So it makes sense to keep your site directory reasonably shallow.
  • Frame pages, especially where these are the homepage of a website, can create obstacles for some spiders. They arrive on the page, take one look, and decide there is nothing else there. You need to ensure spiders can find the other pages on your site either by submitting a different page or by making sure your frames have outgoing links to the rest of the site.
  • Sometimes you may prefer not to have the spider follow certain links, perhaps because certain pages are not ready to be indexed. In that case, the 'no follow' tag can be used. Spiders can be given further instructions using a 'robots' file; you can find more details about this at robotstxt.org.

Finally, a word of warning: spider behaviour can change from one month to the next. You should follow developments by monitoring sites such as SearchEngineWatch, and be flexible enough to adapt to new circumstances.

How was this article for you?

Very useful Useful Okay Not very useful Not at all useful

Get faster broadband!

Up to 8Mbps

FREE connection & FREE hub. From just £19.99/ month.