Sample Image - Crawler.jpg

Contents

Introduction
Crawler Overview
Crawler Views
- Threads view
- Requests view
Crawler Settings
- MIME types
- Output
- Connections
- Advanced
Points of Interest
References
Thanks to...

Introduction

A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links, or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

Crawler Overview

In this article, I will introduce a simple Web crawler with a simple interface, to describe the crawling story in a simple C# program. My crawler takes the input interface of any Internet navigator to simplify the process. The user just has to input the URL to be crawled in the navigation bar, and click "Go".

Web Crawler Architecture from Wikipedia, the free encyclopedia

The crawler has a URL queue that is equivalent to the URL server in any large scale search engine. The crawler works with multiple threads to fetch URLs from the crawler queue. Then the retrieved pages are saved in a storage area as shown in the figure.

The fetched URLs are requested from the Web using a C# Sockets library to avoid locking in any other C# libraries. The retrieved pages are parsed to extract new URL references to be put in the crawler queue, again to a certain depth defined in the Settings.

In the next sections, I will describe the program views, and discuss some technical points related to the interface.

Crawler Views

My simple crawler contains three views that can follow the crawling process, check the details, and view the crawling errors.

Threads view

Threads view is just a window to display all the threads' workout to the user. Each thread takes a URI from the URIs queue, and starts connection processing to download the URI object, as shown in the figure.

Threads tab view .

Requests view

Requests view displays a list of the recent requests downloaded in the threads view, as in the following figure:

Requests tab view

This view enables you to watch each request header, like:

GET / HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive

You can watch each response header, like:

HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-Encoding,User-Agent
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

And a list of found URLs is available in the downloaded page:

Parsing page ...
Found: 356 ref(s)
http://www.cnn.com/
http://www.cnn.com/search/
http://www.cnn.com/linkto/intl.html

Crawler Settings

Crawler settings are not complicated, they are selected options from many working crawlers in the market, including settings such as supported MIME types, download folder, number of working threads, and so on.

MIME types

MIME types are the types that are supported to be downloaded by the crawler, and the crawler includes the default types to be used. The user can add, edit, and delete MIME types. The user can select to allow all MIME types, as in the following figure:

Files Matches Settings

Output

Output settings include the download folder, and the number of requests to keep in the requests view for reviewing request details.

Output Setings

Connections

Connection settings contain:

Thread count: the number of concurrent working threads in the crawler.
Thread sleep time when the refs queue is empty: the time that each thread sleeps when the refs queue is empty.
Thread sleep time between two connections: the time that each thread sleeps after handling any request, which is a very important value to prevent hosts from blocking the crawler due to heavy loads.
Connection timeout: represents the send and receive timeout to all crawler sockets.
Navigate through pages to a depth of: represents the depth of navigation in the crawling process.
Keep same URL server: to limit the crawling process to the same host of the original URL.
Keep connection alive: to keep the socket connection opened for subsequent requests to avoid reconnection time.

Connections Settings

Advanced

The advanced settings contain:

Code page to encode the downloaded text pages.
List of a user defined list of restricted words to enable the user to prevent any bad pages.
List of a user defined list of restricted host extensions to avoid blocking by these hosts.
List of a user defined list of restricted file extensions to avoid parsing non-text data.

Advanced Settings

Points of Interest

Keep Alive connection:

Keep-Alive is a request form the client to the server to keep the connection open after the response is finished for subsequent requests. That can be done by adding an HTTP header in the request to the server, as in the following request:

GET /CNN/Programs/nancy.grace/ HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive

The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket regarding its decision. So the server can keep telling the client that it will keep it open, by including "Connection: Keep-Alive" in its reply, as follows:

HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Vary: Accept-Encoding,User-Agent
Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

Or it can tell the client that it refuses, as follows:

HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-Control: max-age=60, private
Connection: Close
Server: Apache
Vary: Accept-Encoding,User-Agent
Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

WebRequest and WebResponse problems:

When I started this article code, I was using the WebRequest class and WebResponse, like in the following code:

WebRequest request = WebRequest.Create(uri);
WebResponse response = request.GetResponse();
Stream streamIn = response.GetResponseStream();
BinaryReader reader = new BinaryReader(streamIn, TextEncoding);
byte[] RecvBuffer = new byte[10240];
int nBytes, nTotalBytes = 0;
while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)
{
    nTotalBytes += nBytes;
    ...
}
reader.Close();
streamIn.Close();
response.Close();

This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all other processes, the WebRequest tells the retrieved response as closed, as in the last line in the previous code. So I noticed that always only one thread is downloading while others are waiting to GetResponse. To solve this serious problem, I implemented my two classes, MyWebRequest and MyWebResponse.

MyWebRequest and MyWebResponse use the Socket class to manage connections, and they are similar to WebRequest and WebResponse, but they support concurrent responses at the same time. In addition, MyWebRequest supports a built-in flag, KeepAlive, to support Keep-Alive connections.

So, my new code would be like:

request = MyWebRequest.Create(uri, request/*to Keep-Alive*/, KeepAlive);
MyWebResponse response = request.GetResponse();
byte[] RecvBuffer = new byte[10240];
int nBytes, nTotalBytes = 0;
while((nBytes = response.socket.Receive(RecvBuffer, 0, 
                10240, SocketFlags.None)) > 0)
{
    nTotalBytes += nBytes;
    ...
    if(response.KeepAlive && nTotalBytes >= response.ContentLength 
                          && response.ContentLength > 0)
        break;
}
if(response.KeepAlive == false)
    response.Close();

Just replace the GetResponseStream with a direct access to the socket member of the MyWebResponse class. To do that, I did a simple trick to make the socket next read, to start, after the reply header, by reading one byte at a time to tell header completion, as in the following code:

/* reading response header */
Header = "";
byte[] bytes = new byte[10];
while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)
{
    Header += Encoding.ASCII.GetString(bytes, 0, 1);
    if(bytes[0] == '\n' && Header.EndsWith("\r\n\r\n"))
        break;
}

So, the user of the MyResponse class will just continue receiving from the first position of the page.

Thread management:

The number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the Settings tab, Connections. The crawler code handles this change using the property ThreadCount, as in the following code:

// number of running threads
private int nThreadCount;
private int ThreadCount
{
    get    {    return nThreadCount;    }
    set
    {
        Monitor.Enter(this.listViewThreads);
        try
        {
            for(int nIndex = 0; nIndex < value; nIndex ++)
            {
                // check if thread not created or not suspended
                if(threadsRun[nIndex] == null || 
                   threadsRun[nIndex].ThreadState != ThreadState.Suspended)
                {    
                    // create new thread
                    threadsRun[nIndex] = new Thread(new 
                           ThreadStart(ThreadRunFunction));
                    // set thread name equal to its index
                    threadsRun[nIndex].Name = nIndex.ToString();
                    // start thread working function
                    threadsRun[nIndex].Start();
                    // check if thread dosn't added to the view
                    if(nIndex == this.listViewThreads.Items.Count)
                    {
                        // add a new line in the view for the new thread
                        ListViewItem item = 
                          this.listViewThreads.Items.Add(
                          (nIndex+1).ToString(), 0);
                        string[] subItems = { "", "", "", "0", "0%" };
                        item.SubItems.AddRange(subItems);
                    }
                }
                // check if the thread is suspended
                else if(threadsRun[nIndex].ThreadState == 
                                 ThreadState.Suspended)
                {
                    // get thread item from the list
                    ListViewItem item = this.listViewThreads.Items[nIndex];
                    item.ImageIndex = 1;
                    item.SubItems[2].Text = "Resume";
                    // resume the thread
                    threadsRun[nIndex].Resume();
                }
            }
            // change thread value
            nThreadCount = value;
        }
        catch(Exception)
        {
        }
        Monitor.Exit(this.listViewThreads);
    }
}

If TheadCode is increased by the user, the code creates a new thread or suspends suspended threads. Else, the system leaves the process of suspending extra working threads to threads themselves, as follows. Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount, it continues its job and goes for the suspension mode.

Crawling depth:
It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent's depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page is inserted at the end of the URL queue, which means a "first in first out" operation. And all the threads can be inserted in to the queue at any time, as shown in the following code:
```
void EnqueueUri(MyUri uri)
{
    Monitor.Enter(queueURLS);
    try
    {
        queueURLS.Enqueue(uri);
    }
    catch(Exception)
    {
    }
    Monitor.Exit(queueURLS);
}
```
And each thread can retrieve the first URL in the queue to request it, as shown in the following code:
```
MyUri DequeueUri()
{
    Monitor.Enter(queueURLS);
    MyUri uri = null;
    try
    {
        uri = (MyUri)queueURLS.Dequeue();
    }
    catch(Exception)
    {
    }
    Monitor.Exit(queueURLS);
    return uri;
}
```

References

Web crawler from Wikipedia, the free encyclopedia.
RFC766.

Thanks to...

Thanks god!