Web Crawling: Web Crawling is the process in which Web Spider from different Search Engines index the websites throughout the World Wide Web (www). Most Search engines limit their searches and focus mainly on the content of the web pages. Before the search engine can tell you where the file and document is kept (in the form of URL) it needs to be found in their database. To find the information, Search Engines employs a special software robot called spiders to build lists of words and keywords found on the websites. When the spider builds its web (lists), the process is called web crawling. In order to maintain and build a useful list of words, a search engine spider needs to take a look at a lot of pages. All information are generally stored in a database. When some one is searching for a particular keyword the search engine generally looks for that keyword in their database.
How does the Spider start its journey across the web? Good Question. Generally the Spider’s starting points are heavily used servers which contains a lot of website that are being hosted. The spider will begin with a popular site, indexing every keyword and word on its pages and following every link found.
Most Popular Search Engines Google, one of the oldest and most popular search engines around, began as an academic search engine. When Google Spider (bot) looked into an HTML page, it took a note for the following things.
(1) The words within the main content.
(2) Where the words were found – area of importance.
Words appearing on important HTML tags — like title, meta, H1, H2 and other positions of relative importance are indexed for special considerations during a subsequent search on Google’s interface by the user. The algorithm that Google uses while indexing a web page is that it indexes every word on the page, leaving out the articles – “a”, “an” and “the’. Other Spiders have different approaches. Each Spider is designed taking into consideration that it should perform fast and allow the users to search more efficiently or both.ome Spiders will keep track of the words in the title, the subheading, and links along with 100 most frequently used words on the page (keyword density). Lycos used this approach while indexing the web. Other systems like AltaVista go in other directions, every word on the page including the articles. In the next article which will be the continuation of this I will discuss some very important aspect of the webpage which the popular search engines like Google, Yahoo, MSN, AltaVista take very seriously —– Meta tags, Building the Index, Building a Search, Future Search etc.