Definition
A web crawler (also known as a web spider or web robot) is a program or automated script that browses the internet seeking web pages to process. Many applications, mostly search engines, crawl websites every day in order to find up-to-date data. Most web crawlers save a copy of the visited page so they can easily index it later. The rest crawl the pages for page search purposes only, such as searching for emails (for SPAM).
How Does It Work?
A crawler needs a starting point, which would be a web address, a URL. In order to browse the internet, we use the HTTP network protocol, which allows us to talk to web servers and download or upload data from and to it. The crawler browses this URL and then seeks for hyperlinks (a tag in the HTML language). Then the crawler browses those links and moves on the same way. Up to here, it was the basic idea. Now, how we move on it completely depends on the purpose of the software itself.
If we only want to grab emails, then we would search the text on each web page (including hyperlinks) and look for email addresses. This is the easiest type of software to develop. Search engines are much more difficult to develop. When building a search engine, we need to take care of a few other things.
- Size - Some websites are very large and contain many directories and files. It may consume a lot of time harvesting all of the data.
- Change Frequency - A web site may change very often even a few times a day. Pages can be deleted and added each day. We need to decide when to revisit each site and each page per site.
- How do we process the HTML output? If we build a search engine, we would want to understand the text rather than just treat it as plain text. We must tell the difference between a caption and a simple sentence. We must look for bold or italic text, font colors, font size, paragraphs and tables. This means we must know HTML very good and we need to parse it first. What we need for this task is a tool called "HTML TO XML Converters."
History
- 28 November, 2006: Article posted