Information Security Blog

Thursday, June 21, 2012

The basics: Crawler

The Web consists of many billions of pages. Each of these pages has a unique URL, content (text, pictures, video) and links that connect them to other pages. One page connects to another, which connects to another and so on and so on. This set up creates a huge “web” of interconnected pages. Web crawler is a computer program that gathers and categorizes information on the Internet.

Crawling – Indexing – Retrieving – Ranking

Crawling pages is done by search engine automated robots, commonly referred to as “spiders”, and is one of the main functions of a search engine. The spiders “read” one page and then follow any links from that page to another page. Through links the spiders can reach billions of interconnected documents.

Indexing is the process by which search engines select pieces of relevant code (including keywords and surrounding text) from the web page and catalogue them. They store that code and related information organized in massive data centers that are located all around the world. This is no small task.

Retrieving comes into play when a search engine user types in a keyword or a string of keywords. The search engine goes into action retrieving all of the urls that it has stored which are relevant to the keyword and returns this information to the user.

Ranking of web pages is essential for the satisfaction of the user’s query. Search engines rank each web page that they find according to things like trust factors, page rank and even go as far as considering the user’s search history and where they are geographically. Hundreds of factors are weighted and considered by the engine in concert with one another.

Information Security Blog

Pages

Thursday, June 21, 2012

The basics: Crawler

Crawling – Indexing – Retrieving – Ranking

No comments:

Post a Comment

Visitors

Research Gate

Who is Online