The basics: Crawler
The Web consists of many billions
of pages. Each of these pages has a unique URL, content (text, pictures, video)
and links that connect them to other pages. One page connects to another, which
connects to another and so on and so on. This set up creates a huge “web” of
interconnected pages. Web crawler is a computer program that gathers and categorizes information on the Internet.
Crawling – Indexing – Retrieving – Ranking
Crawling pages is done by search
engine automated robots, commonly referred to as “spiders”, and is one of the
main functions of a search engine. The spiders “read” one page and then follow
any links from that page to another page. Through links the spiders can reach
billions of interconnected documents.
Indexing is the process by which
search engines select pieces of relevant code (including keywords and
surrounding text) from the web page and catalogue them. They store that code
and related information organized in massive data centers that are located all
around the world. This is no small task.
Retrieving comes into play when a
search engine user types in a keyword or a string of keywords. The search
engine goes into action retrieving all of the urls that it has stored which are
relevant to the keyword and returns this information to the user.
Ranking of web pages is essential
for the satisfaction of the user’s query. Search engines rank each web page
that they find according to things like trust factors, page rank and even go as
far as considering the user’s search history and where they are geographically.
Hundreds of factors are weighted and considered by the engine in concert with
one another.
No comments:
Post a Comment