Click here to Skip to main content
15,867,488 members
Articles / Programming Languages / C#
Article

A Simple Crawler Using C# Sockets

Rate me:
Please Sign up or sign in to vote.
4.87/5 (112 votes)
19 Mar 20066 min read 899.2K   40.8K   384   156
A multi-threaded simple crawler with C# sockets to solve the WebRequest.GetResponse() locking problem.

Sample Image - Crawler.jpg

Contents

Introduction

A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links, or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

Crawler Overview

In this article, I will introduce a simple Web crawler with a simple interface, to describe the crawling story in a simple C# program. My crawler takes the input interface of any Internet navigator to simplify the process. The user just has to input the URL to be crawled in the navigation bar, and click "Go".

Web Crawler Architecture from Wikipedia, the free encyclopedia

The crawler has a URL queue that is equivalent to the URL server in any large scale search engine. The crawler works with multiple threads to fetch URLs from the crawler queue. Then the retrieved pages are saved in a storage area as shown in the figure.

The fetched URLs are requested from the Web using a C# Sockets library to avoid locking in any other C# libraries. The retrieved pages are parsed to extract new URL references to be put in the crawler queue, again to a certain depth defined in the Settings.

In the next sections, I will describe the program views, and discuss some technical points related to the interface.

Crawler Views

My simple crawler contains three views that can follow the crawling process, check the details, and view the crawling errors.

Threads view

Threads view is just a window to display all the threads' workout to the user. Each thread takes a URI from the URIs queue, and starts connection processing to download the URI object, as shown in the figure.

Threads tab view.

Requests view

Requests view displays a list of the recent requests downloaded in the threads view, as in the following figure:

Requests tab view

This view enables you to watch each request header, like:

GET / HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive

You can watch each response header, like:

HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-Encoding,User-Agent
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

And a list of found URLs is available in the downloaded page:

Parsing page ...
Found: 356 ref(s)
http://www.cnn.com/
http://www.cnn.com/search/
http://www.cnn.com/linkto/intl.html

Crawler Settings

Crawler settings are not complicated, they are selected options from many working crawlers in the market, including settings such as supported MIME types, download folder, number of working threads, and so on.

MIME types

MIME types are the types that are supported to be downloaded by the crawler, and the crawler includes the default types to be used. The user can add, edit, and delete MIME types. The user can select to allow all MIME types, as in the following figure:

Files Matches Settings

Output

Output settings include the download folder, and the number of requests to keep in the requests view for reviewing request details.

Output Setings

Connections

Connection settings contain:

  • Thread count: the number of concurrent working threads in the crawler.
  • Thread sleep time when the refs queue is empty: the time that each thread sleeps when the refs queue is empty.
  • Thread sleep time between two connections: the time that each thread sleeps after handling any request, which is a very important value to prevent hosts from blocking the crawler due to heavy loads.
  • Connection timeout: represents the send and receive timeout to all crawler sockets.
  • Navigate through pages to a depth of: represents the depth of navigation in the crawling process.
  • Keep same URL server: to limit the crawling process to the same host of the original URL.
  • Keep connection alive: to keep the socket connection opened for subsequent requests to avoid reconnection time.

Connections Settings

Advanced

The advanced settings contain:

  • Code page to encode the downloaded text pages.
  • List of a user defined list of restricted words to enable the user to prevent any bad pages.
  • List of a user defined list of restricted host extensions to avoid blocking by these hosts.
  • List of a user defined list of restricted file extensions to avoid parsing non-text data.

Advanced Settings

Points of Interest

  1. Keep Alive connection:

    Keep-Alive is a request form the client to the server to keep the connection open after the response is finished for subsequent requests. That can be done by adding an HTTP header in the request to the server, as in the following request:

    GET /CNN/Programs/nancy.grace/ HTTP/1.0
    Host: www.cnn.com
    Connection: Keep-Alive

    The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket regarding its decision. So the server can keep telling the client that it will keep it open, by including "Connection: Keep-Alive" in its reply, as follows:

    HTTP/1.0 200 OK
    Date: Sun, 19 Mar 2006 19:38:15 GMT
    Content-Length: 29025
    Content-Type: text/html
    Expires: Sun, 19 Mar 2006 19:39:15 GMT
    Cache-Control: max-age=60, private
    Connection: keep-alive
    Proxy-Connection: keep-alive
    Server: Apache
    Vary: Accept-Encoding,User-Agent
    Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
    Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

    Or it can tell the client that it refuses, as follows:

    HTTP/1.0 200 OK
    Date: Sun, 19 Mar 2006 19:38:15 GMT
    Content-Length: 29025
    Content-Type: text/html
    Expires: Sun, 19 Mar 2006 19:39:15 GMT
    Cache-Control: max-age=60, private
    Connection: Close
    Server: Apache
    Vary: Accept-Encoding,User-Agent
    Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
    Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
  2. WebRequest and WebResponse problems:

    When I started this article code, I was using the WebRequest class and WebResponse, like in the following code:

    C#
    WebRequest request = WebRequest.Create(uri);
    WebResponse response = request.GetResponse();
    Stream streamIn = response.GetResponseStream();
    BinaryReader reader = new BinaryReader(streamIn, TextEncoding);
    byte[] RecvBuffer = new byte[10240];
    int nBytes, nTotalBytes = 0;
    while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)
    {
        nTotalBytes += nBytes;
        ...
    }
    reader.Close();
    streamIn.Close();
    response.Close();

    This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all other processes, the WebRequest tells the retrieved response as closed, as in the last line in the previous code. So I noticed that always only one thread is downloading while others are waiting to GetResponse. To solve this serious problem, I implemented my two classes, MyWebRequest and MyWebResponse.

    MyWebRequest and MyWebResponse use the Socket class to manage connections, and they are similar to WebRequest and WebResponse, but they support concurrent responses at the same time. In addition, MyWebRequest supports a built-in flag, KeepAlive, to support Keep-Alive connections.

    So, my new code would be like:

    C#
    request = MyWebRequest.Create(uri, request/*to Keep-Alive*/, KeepAlive);
    MyWebResponse response = request.GetResponse();
    byte[] RecvBuffer = new byte[10240];
    int nBytes, nTotalBytes = 0;
    while((nBytes = response.socket.Receive(RecvBuffer, 0, 
                    10240, SocketFlags.None)) > 0)
    {
        nTotalBytes += nBytes;
        ...
        if(response.KeepAlive && nTotalBytes >= response.ContentLength 
                              && response.ContentLength > 0)
            break;
    }
    if(response.KeepAlive == false)
        response.Close();

    Just replace the GetResponseStream with a direct access to the socket member of the MyWebResponse class. To do that, I did a simple trick to make the socket next read, to start, after the reply header, by reading one byte at a time to tell header completion, as in the following code:

    C#
    /* reading response header */
    Header = "";
    byte[] bytes = new byte[10];
    while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)
    {
        Header += Encoding.ASCII.GetString(bytes, 0, 1);
        if(bytes[0] == '\n' && Header.EndsWith("\r\n\r\n"))
            break;
    }

    So, the user of the MyResponse class will just continue receiving from the first position of the page.

  3. Thread management:

    The number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the Settings tab, Connections. The crawler code handles this change using the property ThreadCount, as in the following code:

    C#
    // number of running threads
    private int nThreadCount;
    private int ThreadCount
    {
        get    {    return nThreadCount;    }
        set
        {
            Monitor.Enter(this.listViewThreads);
            try
            {
                for(int nIndex = 0; nIndex < value; nIndex ++)
                {
                    // check if thread not created or not suspended
                    if(threadsRun[nIndex] == null || 
                       threadsRun[nIndex].ThreadState != ThreadState.Suspended)
                    {    
                        // create new thread
                        threadsRun[nIndex] = new Thread(new 
                               ThreadStart(ThreadRunFunction));
                        // set thread name equal to its index
                        threadsRun[nIndex].Name = nIndex.ToString();
                        // start thread working function
                        threadsRun[nIndex].Start();
                        // check if thread dosn't added to the view
                        if(nIndex == this.listViewThreads.Items.Count)
                        {
                            // add a new line in the view for the new thread
                            ListViewItem item = 
                              this.listViewThreads.Items.Add(
                              (nIndex+1).ToString(), 0);
                            string[] subItems = { "", "", "", "0", "0%" };
                            item.SubItems.AddRange(subItems);
                        }
                    }
                    // check if the thread is suspended
                    else if(threadsRun[nIndex].ThreadState == 
                                     ThreadState.Suspended)
                    {
                        // get thread item from the list
                        ListViewItem item = this.listViewThreads.Items[nIndex];
                        item.ImageIndex = 1;
                        item.SubItems[2].Text = "Resume";
                        // resume the thread
                        threadsRun[nIndex].Resume();
                    }
                }
                // change thread value
                nThreadCount = value;
            }
            catch(Exception)
            {
            }
            Monitor.Exit(this.listViewThreads);
        }
    }

    If TheadCode is increased by the user, the code creates a new thread or suspends suspended threads. Else, the system leaves the process of suspending extra working threads to threads themselves, as follows. Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount, it continues its job and goes for the suspension mode.

  4. Crawling depth:

    It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent's depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page is inserted at the end of the URL queue, which means a "first in first out" operation. And all the threads can be inserted in to the queue at any time, as shown in the following code:

    C#
    void EnqueueUri(MyUri uri)
    {
        Monitor.Enter(queueURLS);
        try
        {
            queueURLS.Enqueue(uri);
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
    }

    And each thread can retrieve the first URL in the queue to request it, as shown in the following code:

    C#
    MyUri DequeueUri()
    {
        Monitor.Enter(queueURLS);
        MyUri uri = null;
        try
        {
            uri = (MyUri)queueURLS.Dequeue();
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
        return uri;
    }

References

  1. Web crawler from Wikipedia, the free encyclopedia.
  2. RFC766.

Thanks to...

Thanks god!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Software Developer (Senior)
Egypt Egypt

Comments and Discussions

 
Generalabout "class function GetResponse locks the access to all other processes" Pin
fr_chris22-Mar-07 22:52
fr_chris22-Mar-07 22:52 
GeneralSecure websites Pin
cub712-Mar-07 5:29
cub712-Mar-07 5:29 
QuestionI want insert a Message ("Finish!") when the application is terminated Pin
Dennysoft24-Jan-07 6:37
Dennysoft24-Jan-07 6:37 
AnswerRe: I want insert a Message ("Finish!") when the application is terminated Pin
Kevin_Lye2-Feb-07 21:34
Kevin_Lye2-Feb-07 21:34 
GeneralGarbage collection problem Pin
Kevin_Lye11-Jan-07 2:06
Kevin_Lye11-Jan-07 2:06 
GeneralRe: Garbage collection problem Pin
KKV100025-Jan-07 0:24
KKV100025-Jan-07 0:24 
QuestionCan be developed into multi-task and multi-thread process. Pin
yd152010-Dec-06 4:30
yd152010-Dec-06 4:30 
AnswerRe: Can be developed into multi-task and multi-thread process. Pin
Hatem Mostafa11-Dec-06 4:32
Hatem Mostafa11-Dec-06 4:32 
QuestionSub-thread discussions with the two main threads of interactive listview. Pin
yd152010-Dec-06 3:56
yd152010-Dec-06 3:56 
GeneralWow Pin
Davolous6-Dec-06 8:18
Davolous6-Dec-06 8:18 
GeneralDownloading deadlinks Pin
CodingDragon23-Nov-06 11:07
CodingDragon23-Nov-06 11:07 
Generalfollowing Https Pin
ulicifer19-Nov-06 23:02
ulicifer19-Nov-06 23:02 
GeneralDirectory tree Pin
lucky.ceid16-Nov-06 6:16
lucky.ceid16-Nov-06 6:16 
GeneralConcurent downloads Pin
rammi.cz23-Sep-06 1:10
rammi.cz23-Sep-06 1:10 
GeneralThread management Pin
lucky.ceid21-Sep-06 9:04
lucky.ceid21-Sep-06 9:04 
GeneralRe: Thread management Pin
virasm13@web.de6-Sep-07 0:01
virasm13@web.de6-Sep-07 0:01 
GeneralRe: Thread management Pin
lucky.ceid6-Sep-07 3:29
lucky.ceid6-Sep-07 3:29 
GeneralRe: Thread management Pin
virasm13@web.de7-Sep-07 11:04
virasm13@web.de7-Sep-07 11:04 
QuestionError... Pin
abluemoon25-Aug-06 4:08
abluemoon25-Aug-06 4:08 
AnswerRe: Error... Pin
Hatem Mostafa25-Aug-06 23:47
Hatem Mostafa25-Aug-06 23:47 
AnswerRe: Error... Pin
lucky.ceid19-Sep-06 23:47
lucky.ceid19-Sep-06 23:47 
GeneralRe: Error... Pin
lucky.ceid20-Sep-06 6:34
lucky.ceid20-Sep-06 6:34 
GeneralRe: Error... Pin
Hatem Mostafa22-Sep-06 23:51
Hatem Mostafa22-Sep-06 23:51 
GeneralRe: Error... Pin
CodingDragon23-Nov-06 9:42
CodingDragon23-Nov-06 9:42 
GeneralRe: Error... Pin
mr.lobbster8-Jan-07 23:27
mr.lobbster8-Jan-07 23:27 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.