Click here to Skip to main content
Click here to Skip to main content

A Simple Crawler Using C# Sockets

, 19 Mar 2006
Rate this:
Please Sign up or sign in to vote.
A multi-threaded simple crawler with C# sockets to solve the WebRequest.GetResponse() locking problem.

Sample Image - Crawler.jpg

Contents

Introduction

A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links, or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

Crawler Overview

In this article, I will introduce a simple Web crawler with a simple interface, to describe the crawling story in a simple C# program. My crawler takes the input interface of any Internet navigator to simplify the process. The user just has to input the URL to be crawled in the navigation bar, and click "Go".

Web Crawler Architecture from Wikipedia, the free encyclopedia

The crawler has a URL queue that is equivalent to the URL server in any large scale search engine. The crawler works with multiple threads to fetch URLs from the crawler queue. Then the retrieved pages are saved in a storage area as shown in the figure.

The fetched URLs are requested from the Web using a C# Sockets library to avoid locking in any other C# libraries. The retrieved pages are parsed to extract new URL references to be put in the crawler queue, again to a certain depth defined in the Settings.

In the next sections, I will describe the program views, and discuss some technical points related to the interface.

Crawler Views

My simple crawler contains three views that can follow the crawling process, check the details, and view the crawling errors.

Threads view

Threads view is just a window to display all the threads' workout to the user. Each thread takes a URI from the URIs queue, and starts connection processing to download the URI object, as shown in the figure.

Threads tab view.

Requests view

Requests view displays a list of the recent requests downloaded in the threads view, as in the following figure:

Requests tab view

This view enables you to watch each request header, like:

GET / HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive

You can watch each response header, like:

HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-Encoding,User-Agent
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

And a list of found URLs is available in the downloaded page:

Parsing page ...
Found: 356 ref(s)
http://www.cnn.com/
http://www.cnn.com/search/
http://www.cnn.com/linkto/intl.html

Crawler Settings

Crawler settings are not complicated, they are selected options from many working crawlers in the market, including settings such as supported MIME types, download folder, number of working threads, and so on.

MIME types

MIME types are the types that are supported to be downloaded by the crawler, and the crawler includes the default types to be used. The user can add, edit, and delete MIME types. The user can select to allow all MIME types, as in the following figure:

Files Matches Settings

Output

Output settings include the download folder, and the number of requests to keep in the requests view for reviewing request details.

Output Setings

Connections

Connection settings contain:

  • Thread count: the number of concurrent working threads in the crawler.
  • Thread sleep time when the refs queue is empty: the time that each thread sleeps when the refs queue is empty.
  • Thread sleep time between two connections: the time that each thread sleeps after handling any request, which is a very important value to prevent hosts from blocking the crawler due to heavy loads.
  • Connection timeout: represents the send and receive timeout to all crawler sockets.
  • Navigate through pages to a depth of: represents the depth of navigation in the crawling process.
  • Keep same URL server: to limit the crawling process to the same host of the original URL.
  • Keep connection alive: to keep the socket connection opened for subsequent requests to avoid reconnection time.

Connections Settings

Advanced

The advanced settings contain:

  • Code page to encode the downloaded text pages.
  • List of a user defined list of restricted words to enable the user to prevent any bad pages.
  • List of a user defined list of restricted host extensions to avoid blocking by these hosts.
  • List of a user defined list of restricted file extensions to avoid parsing non-text data.

Advanced Settings

Points of Interest

  1. Keep Alive connection:

    Keep-Alive is a request form the client to the server to keep the connection open after the response is finished for subsequent requests. That can be done by adding an HTTP header in the request to the server, as in the following request:

    GET /CNN/Programs/nancy.grace/ HTTP/1.0
    Host: www.cnn.com
    Connection: Keep-Alive

    The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket regarding its decision. So the server can keep telling the client that it will keep it open, by including "Connection: Keep-Alive" in its reply, as follows:

    HTTP/1.0 200 OK
    Date: Sun, 19 Mar 2006 19:38:15 GMT
    Content-Length: 29025
    Content-Type: text/html
    Expires: Sun, 19 Mar 2006 19:39:15 GMT
    Cache-Control: max-age=60, private
    Connection: keep-alive
    Proxy-Connection: keep-alive
    Server: Apache
    Vary: Accept-Encoding,User-Agent
    Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
    Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

    Or it can tell the client that it refuses, as follows:

    HTTP/1.0 200 OK
    Date: Sun, 19 Mar 2006 19:38:15 GMT
    Content-Length: 29025
    Content-Type: text/html
    Expires: Sun, 19 Mar 2006 19:39:15 GMT
    Cache-Control: max-age=60, private
    Connection: Close
    Server: Apache
    Vary: Accept-Encoding,User-Agent
    Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
    Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
  2. WebRequest and WebResponse problems:

    When I started this article code, I was using the WebRequest class and WebResponse, like in the following code:

    WebRequest request = WebRequest.Create(uri);
    WebResponse response = request.GetResponse();
    Stream streamIn = response.GetResponseStream();
    BinaryReader reader = new BinaryReader(streamIn, TextEncoding);
    byte[] RecvBuffer = new byte[10240];
    int nBytes, nTotalBytes = 0;
    while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)
    {
        nTotalBytes += nBytes;
        ...
    }
    reader.Close();
    streamIn.Close();
    response.Close();

    This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all other processes, the WebRequest tells the retrieved response as closed, as in the last line in the previous code. So I noticed that always only one thread is downloading while others are waiting to GetResponse. To solve this serious problem, I implemented my two classes, MyWebRequest and MyWebResponse.

    MyWebRequest and MyWebResponse use the Socket class to manage connections, and they are similar to WebRequest and WebResponse, but they support concurrent responses at the same time. In addition, MyWebRequest supports a built-in flag, KeepAlive, to support Keep-Alive connections.

    So, my new code would be like:

    request = MyWebRequest.Create(uri, request/*to Keep-Alive*/, KeepAlive);
    MyWebResponse response = request.GetResponse();
    byte[] RecvBuffer = new byte[10240];
    int nBytes, nTotalBytes = 0;
    while((nBytes = response.socket.Receive(RecvBuffer, 0, 
                    10240, SocketFlags.None)) > 0)
    {
        nTotalBytes += nBytes;
        ...
        if(response.KeepAlive && nTotalBytes >= response.ContentLength 
                              && response.ContentLength > 0)
            break;
    }
    if(response.KeepAlive == false)
        response.Close();

    Just replace the GetResponseStream with a direct access to the socket member of the MyWebResponse class. To do that, I did a simple trick to make the socket next read, to start, after the reply header, by reading one byte at a time to tell header completion, as in the following code:

    /* reading response header */
    Header = "";
    byte[] bytes = new byte[10];
    while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)
    {
        Header += Encoding.ASCII.GetString(bytes, 0, 1);
        if(bytes[0] == '\n' && Header.EndsWith("\r\n\r\n"))
            break;
    }

    So, the user of the MyResponse class will just continue receiving from the first position of the page.

  3. Thread management:

    The number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the Settings tab, Connections. The crawler code handles this change using the property ThreadCount, as in the following code:

    // number of running threads
    private int nThreadCount;
    private int ThreadCount
    {
        get    {    return nThreadCount;    }
        set
        {
            Monitor.Enter(this.listViewThreads);
            try
            {
                for(int nIndex = 0; nIndex < value; nIndex ++)
                {
                    // check if thread not created or not suspended
                    if(threadsRun[nIndex] == null || 
                       threadsRun[nIndex].ThreadState != ThreadState.Suspended)
                    {    
                        // create new thread
                        threadsRun[nIndex] = new Thread(new 
                               ThreadStart(ThreadRunFunction));
                        // set thread name equal to its index
                        threadsRun[nIndex].Name = nIndex.ToString();
                        // start thread working function
                        threadsRun[nIndex].Start();
                        // check if thread dosn't added to the view
                        if(nIndex == this.listViewThreads.Items.Count)
                        {
                            // add a new line in the view for the new thread
                            ListViewItem item = 
                              this.listViewThreads.Items.Add(
                              (nIndex+1).ToString(), 0);
                            string[] subItems = { "", "", "", "0", "0%" };
                            item.SubItems.AddRange(subItems);
                        }
                    }
                    // check if the thread is suspended
                    else if(threadsRun[nIndex].ThreadState == 
                                     ThreadState.Suspended)
                    {
                        // get thread item from the list
                        ListViewItem item = this.listViewThreads.Items[nIndex];
                        item.ImageIndex = 1;
                        item.SubItems[2].Text = "Resume";
                        // resume the thread
                        threadsRun[nIndex].Resume();
                    }
                }
                // change thread value
                nThreadCount = value;
            }
            catch(Exception)
            {
            }
            Monitor.Exit(this.listViewThreads);
        }
    }

    If TheadCode is increased by the user, the code creates a new thread or suspends suspended threads. Else, the system leaves the process of suspending extra working threads to threads themselves, as follows. Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount, it continues its job and goes for the suspension mode.

  4. Crawling depth:

    It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent's depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page is inserted at the end of the URL queue, which means a "first in first out" operation. And all the threads can be inserted in to the queue at any time, as shown in the following code:

    void EnqueueUri(MyUri uri)
    {
        Monitor.Enter(queueURLS);
        try
        {
            queueURLS.Enqueue(uri);
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
    }

    And each thread can retrieve the first URL in the queue to request it, as shown in the following code:

    MyUri DequeueUri()
    {
        Monitor.Enter(queueURLS);
        MyUri uri = null;
        try
        {
            uri = (MyUri)queueURLS.Dequeue();
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
        return uri;
    }

References

  1. Web crawler from Wikipedia, the free encyclopedia.
  2. RFC766.

Thanks to...

Thanks god!

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author


Comments and Discussions

 
GeneralMy vote of 4 PinmemberThomas Maierhofer (Tom)25-Nov-14 21:35 
QuestionImplementation is not good for high performance / high throughput crawlers [modified] PinmemberThomas Maierhofer (Tom)25-Nov-14 21:24 
Questionit always stopped PinmemberMember 1111647128-Sep-14 19:39 
GeneralAwesome work ! PinmemberFerouk25-Aug-14 14:08 
GeneralMy vote of 1 Pinmemberyrbyogi10-Aug-14 22:55 
GeneralRe: My vote of 1 PinmemberThomas Maierhofer (Tom)25-Nov-14 21:33 
QuestionUnable to download the code and demo PinmemberMember 1073588115-Apr-14 6:45 
Questionشكرا جزيلا.. PinmemberMember 1060074322-Feb-14 0:44 
SuggestionGood Work PinmemberI.A.Qureshi29-Jan-14 22:34 
QuestionCould you please give me a docment for this project. PinmemberMember 1043183128-Nov-13 1:38 
SuggestionNice Presentation PinmemberDebopam Pal17-Nov-13 22:58 
AnswerUpdated Crawler PinmemberTeufel121224-Mar-13 9:05 
QuestionVery useful Pinmembermortal20025-Feb-13 8:37 
AnswerRe: Very useful Pinmember123vinhnguyen5-Apr-13 17:43 
QuestionDoes not Work Pinmemberzerocool.383029-Dec-12 8:11 
GeneralRe: Does not Work PinmemberPIEBALDconsult29-Dec-12 8:22 
BugDoesn't work!!! [modified] Pinmemberdatenkabel11-Dec-12 2:05 
QuestionInvalid OperationException unhandled Pinmemberanupama.k.sinha20-Oct-12 7:17 
GeneralMy vote of 5 PinmemberAnwar A Moon3-Jul-12 8:16 
Questioncode throwing exception Pinmemberanjireddy615-May-12 0:17 
QuestionError :( Pinmemberthuymvb13-Feb-12 17:25 
GeneralThats good article Pinmemberabhibhalani5-Jan-12 17:28 
GeneralMy vote of 5 Pinmembersnakewine23-Jun-11 7:16 
QuestionLogon Failed due to server configuration Pinmembersnakewine23-Jun-11 7:05 
AnswerRe: Logon Failed due to server configuration PinmemberLFD-Mike23-May-12 8:15 
GeneralGetting an error everytime. PinmemberVIPULPANCHAL9-May-11 7:09 
Generalnot only a website PinmemberMember 78102855-Apr-11 17:08 
GeneralMy vote of 5 Pinmemberunix20108-Mar-11 3:23 
QuestionLong file names? PinmemberEdMan19615-Nov-10 23:40 
GeneralMy vote of 4 PinmemberSyeda Hina Aleee31-Oct-10 7:08 
QuestionNetCrawler has stopped working? PinmemberMember 353296020-Oct-10 18:21 
GeneralMy vote of 5 Pinmembermjkhan7862-Sep-10 3:35 
GeneralMy vote of 4 Pinmembermatthewdavar24-Aug-10 16:03 
GeneralRe: My vote of 4 PinmemberEsbenCarlsen26-Aug-10 11:08 
GeneralInteresting... [modified] PinmemberScotKing31-Jul-10 16:44 
GeneralHi sir i am getting error Pinmembersantosh d5-Feb-11 21:56 
GeneralRe: Hi sir i am getting error PinmemberCorrection :) JAK5-Feb-11 22:10 
Questionhow to save the output into sql database. [modified] Pinmemberakmalizhar24-May-10 18:46 
Generalpause Pinmemberakbarhus7-May-10 22:23 
GeneralPlease help me craw the website http://www.vietnamgiapha.com PinmemberMember 16068136-May-10 7:38 
GeneralPlz give me some tips for web crawler as a thesis topics Pinmemberobaidul05316-Apr-10 22:15 
QuestionPerformance PinmemberQuaQua6-Apr-10 10:16 
Questionhow to stop the web crawler Pinmembersheharbano28-Feb-10 2:22 
GeneralSolution to Cross-Thread calls not allowed Pinmembersheharbano27-Feb-10 23:03 
GeneralEmail harvesting using web crawler Pinmembersheharbano24-Feb-10 3:36 
Generalsimultaneous connections with WebRequest and WebResponse classes is possible PinmemberH.Matin6-Dec-09 11:28 
GeneralVery nice work. Pinmemberpoulraider2-Dec-09 23:52 
GeneralRe: Very nice work. Pinmemberpoulraider3-Dec-09 1:12 
Generalweb crawler as console application Pinmemberxxsaxx1-Dec-09 3:20 
GeneralRe: web crawler as console application PinmemberHatem Mostafa1-Dec-09 18:56 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.141220.1 | Last Updated 19 Mar 2006
Article Copyright 2006 by Hatem Mostafa
Everything else Copyright © CodeProject, 1999-2014
Layout: fixed | fluid