Click here to Skip to main content
Email Password   helpLost your password?

Sample Image - Crawler.jpg

Contents

Introduction

A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links, or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

Crawler Overview

In this article, I will introduce a simple Web crawler with a simple interface, to describe the crawling story in a simple C# program. My crawler takes the input interface of any Internet navigator to simplify the process. The user just has to input the URL to be crawled in the navigation bar, and click "Go".

Web Crawler Architecture from Wikipedia, the free encyclopedia

The crawler has a URL queue that is equivalent to the URL server in any large scale search engine. The crawler works with multiple threads to fetch URLs from the crawler queue. Then the retrieved pages are saved in a storage area as shown in the figure.

The fetched URLs are requested from the Web using a C# Sockets library to avoid locking in any other C# libraries. The retrieved pages are parsed to extract new URL references to be put in the crawler queue, again to a certain depth defined in the Settings.

In the next sections, I will describe the program views, and discuss some technical points related to the interface.

Crawler Views

My simple crawler contains three views that can follow the crawling process, check the details, and view the crawling errors.

Threads view

Threads view is just a window to display all the threads' workout to the user. Each thread takes a URI from the URIs queue, and starts connection processing to download the URI object, as shown in the figure.

Threads tab view.

Requests view

Requests view displays a list of the recent requests downloaded in the threads view, as in the following figure:

Requests tab view

This view enables you to watch each request header, like:

GET / HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive

You can watch each response header, like:

HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-Encoding,User-Agent
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

And a list of found URLs is available in the downloaded page:

Parsing page ...
Found: 356 ref(s)
http://www.cnn.com/
http://www.cnn.com/search/
http://www.cnn.com/linkto/intl.html

Crawler Settings

Crawler settings are not complicated, they are selected options from many working crawlers in the market, including settings such as supported MIME types, download folder, number of working threads, and so on.

MIME types

MIME types are the types that are supported to be downloaded by the crawler, and the crawler includes the default types to be used. The user can add, edit, and delete MIME types. The user can select to allow all MIME types, as in the following figure:

Files Matches Settings

Output

Output settings include the download folder, and the number of requests to keep in the requests view for reviewing request details.

Output Setings

Connections

Connection settings contain:

Connections Settings

Advanced

The advanced settings contain:

Advanced Settings

Points of Interest

  1. Keep Alive connection:

    Keep-Alive is a request form the client to the server to keep the connection open after the response is finished for subsequent requests. That can be done by adding an HTTP header in the request to the server, as in the following request:

    GET /CNN/Programs/nancy.grace/ HTTP/1.0
    Host: www.cnn.com
    Connection: Keep-Alive

    The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket regarding its decision. So the server can keep telling the client that it will keep it open, by including "Connection: Keep-Alive" in its reply, as follows:

    HTTP/1.0 200 OK
    Date: Sun, 19 Mar 2006 19:38:15 GMT
    Content-Length: 29025
    Content-Type: text/html
    Expires: Sun, 19 Mar 2006 19:39:15 GMT
    Cache-Control: max-age=60, private
    Connection: keep-alive
    Proxy-Connection: keep-alive
    Server: Apache
    Vary: Accept-Encoding,User-Agent
    Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
    Via: 1.1 webcache (NetCache NetApp/6.0.1P3)

    Or it can tell the client that it refuses, as follows:

    HTTP/1.0 200 OK
    Date: Sun, 19 Mar 2006 19:38:15 GMT
    Content-Length: 29025
    Content-Type: text/html
    Expires: Sun, 19 Mar 2006 19:39:15 GMT
    Cache-Control: max-age=60, private
    Connection: Close
    Server: Apache
    Vary: Accept-Encoding,User-Agent
    Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
    Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
  2. WebRequest and WebResponse problems:

    When I started this article code, I was using the WebRequest class and WebResponse, like in the following code:

    WebRequest request = WebRequest.Create(uri);
    WebResponse response = request.GetResponse();
    Stream streamIn = response.GetResponseStream();
    BinaryReader reader = new BinaryReader(streamIn, TextEncoding);
    byte[] RecvBuffer = new byte[10240];
    int nBytes, nTotalBytes = 0;
    while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)
    {
        nTotalBytes += nBytes;
        ...
    }
    reader.Close();
    streamIn.Close();
    response.Close();

    This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all other processes, the WebRequest tells the retrieved response as closed, as in the last line in the previous code. So I noticed that always only one thread is downloading while others are waiting to GetResponse. To solve this serious problem, I implemented my two classes, MyWebRequest and MyWebResponse.

    MyWebRequest and MyWebResponse use the Socket class to manage connections, and they are similar to WebRequest and WebResponse, but they support concurrent responses at the same time. In addition, MyWebRequest supports a built-in flag, KeepAlive, to support Keep-Alive connections.

    So, my new code would be like:

    request = MyWebRequest.Create(uri, request/*to Keep-Alive*/, KeepAlive);
    MyWebResponse response = request.GetResponse();
    byte[] RecvBuffer = new byte[10240];
    int nBytes, nTotalBytes = 0;
    while((nBytes = response.socket.Receive(RecvBuffer, 0, 
                    10240, SocketFlags.None)) > 0)
    {
        nTotalBytes += nBytes;
        ...
        if(response.KeepAlive && nTotalBytes >= response.ContentLength 
                              && response.ContentLength > 0)
            break;
    }
    if(response.KeepAlive == false)
        response.Close();

    Just replace the GetResponseStream with a direct access to the socket member of the MyWebResponse class. To do that, I did a simple trick to make the socket next read, to start, after the reply header, by reading one byte at a time to tell header completion, as in the following code:

    /* reading response header */
    Header = "";
    byte[] bytes = new byte[10];
    while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)
    {
        Header += Encoding.ASCII.GetString(bytes, 0, 1);
        if(bytes[0] == '\n' && Header.EndsWith("\r\n\r\n"))
            break;
    }

    So, the user of the MyResponse class will just continue receiving from the first position of the page.

  3. Thread management:

    The number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the Settings tab, Connections. The crawler code handles this change using the property ThreadCount, as in the following code:

    // number of running threads
    
    private int nThreadCount;
    private int ThreadCount
    {
        get    {    return nThreadCount;    }
        set
        {
            Monitor.Enter(this.listViewThreads);
            try
            {
                for(int nIndex = 0; nIndex < value; nIndex ++)
                {
                    // check if thread not created or not suspended
    
                    if(threadsRun[nIndex] == null || 
                       threadsRun[nIndex].ThreadState != ThreadState.Suspended)
                    {    
                        // create new thread
    
                        threadsRun[nIndex] = new Thread(new 
                               ThreadStart(ThreadRunFunction));
                        // set thread name equal to its index
    
                        threadsRun[nIndex].Name = nIndex.ToString();
                        // start thread working function
    
                        threadsRun[nIndex].Start();
                        // check if thread dosn't added to the view
    
                        if(nIndex == this.listViewThreads.Items.Count)
                        {
                            // add a new line in the view for the new thread
    
                            ListViewItem item = 
                              this.listViewThreads.Items.Add(
                              (nIndex+1).ToString(), 0);
                            string[] subItems = { "", "", "", "0", "0%" };
                            item.SubItems.AddRange(subItems);
                        }
                    }
                    // check if the thread is suspended
    
                    else if(threadsRun[nIndex].ThreadState == 
                                     ThreadState.Suspended)
                    {
                        // get thread item from the list
    
                        ListViewItem item = this.listViewThreads.Items[nIndex];
                        item.ImageIndex = 1;
                        item.SubItems[2].Text = "Resume";
                        // resume the thread
    
                        threadsRun[nIndex].Resume();
                    }
                }
                // change thread value
    
                nThreadCount = value;
            }
            catch(Exception)
            {
            }
            Monitor.Exit(this.listViewThreads);
        }
    }

    If TheadCode is increased by the user, the code creates a new thread or suspends suspended threads. Else, the system leaves the process of suspending extra working threads to threads themselves, as follows. Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount, it continues its job and goes for the suspension mode.

  4. Crawling depth:

    It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent's depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page is inserted at the end of the URL queue, which means a "first in first out" operation. And all the threads can be inserted in to the queue at any time, as shown in the following code:

    void EnqueueUri(MyUri uri)
    {
        Monitor.Enter(queueURLS);
        try
        {
            queueURLS.Enqueue(uri);
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
    }

    And each thread can retrieve the first URL in the queue to request it, as shown in the following code:

    MyUri DequeueUri()
    {
        Monitor.Enter(queueURLS);
        MyUri uri = null;
        try
        {
            uri = (MyUri)queueURLS.Dequeue();
        }
        catch(Exception)
        {
        }
        Monitor.Exit(queueURLS);
        return uri;
    }

References

  1. Web crawler from Wikipedia, the free encyclopedia.
  2. RFC766.

Thanks to...

Thanks god!

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
Generalhow to stop the web crawler
sheharbano
2:22 28 Feb '10  
the demo works just perfectly....but the source files work very unpredictably....when i press the stop button, sometimes it works and most of the times, the application gets hung up?
anyone else experienced similar problem?
solution anyone?
GeneralSolution to Cross-Thread calls not allowed
sheharbano
23:03 27 Feb '10  
like mentioned in a couple of other thread here, i added the line ComboBox.CheckIllegalCrossThreadCalls=false

but the main problem encountered, no matter what i did, was that only one thread did all the work when i hit the GO button for the first time. if i paused and resumed or stopped and resumed, then it worked perfectly ok.

what solved the problem for me was adding this line right in the beginning of void StartParsing()method

ThreadCount = Settings.GetValue("Threads count", 10);
GeneralEmail harvesting using web crawler
sheharbano
3:36 24 Feb '10  
Hello.Your program is very well written. i am doing a project that involves harvesting or extracting email addresses from a website/URL. can you please point out where exactly should i put in my "email extraction" logic??? i have tried alot on my own but out of sheer desperation plus frustration, i turn to you!
Generalsimultaneous connections with WebRequest and WebResponse classes is possible
H.Matin
11:28 6 Dec '09  
Just make this piece of configuration in your application

<?xml version=“1.0“ encoding=“utf-8“ ?>
<configuration>
    <system.net>
        <connectionManagement>
            <add address=“*“ maxconnection=“100“ />
        </connectionManagement>
    </system.net>
</configuration>

This allows for up to 100 simultaneous connections to any server.

“Clients that use persistent connections SHOULD limit the number of simultaneous connections that they maintain to a given server. A single-user client SHOULD NOT maintain more than 2 connections with any server or proxy. A proxy SHOULD use up to 2*N connections to another server or proxy, where N is the number of simultaneously active users. These guidelines are intended to improve HTTP response times and avoid congestion.”

Oops, Ah, this explains why Internet Explorer only allows to download two files at the same time!
GeneralVery nice work.
poulraider
23:52 2 Dec '09  
If anyone is laying around with simular code that also support POST requests and cookie handling. That would be awesome.

Good work.
GeneralRe: Very nice work.
poulraider
1:12 3 Dec '09  
Im having a little issue, lets say im only reading the header and not the content of the request. Then i create a new request based on the old keept alive request, then the read header will get the content first before it get to the header of the new request. Anyway way to like flush out the stuff not read from the socket in the create method?
Generalweb crawler as console application
xxsaxx
3:20 1 Dec '09  
is there any way to use this as console application instead of win app?
GeneralRe: web crawler as console application
Hatem Mostafa
18:56 1 Dec '09  
Good idea. I'll try to do that but with c++, I forgot c# at all.
Generalweb crawler
Adi1708
0:27 1 Dec '09  
Hello,

I am trying to incorporate your source code in one of my projects. It's really awesome what you have done here, but I have a problem because it's hard to find what I am looking for. I need to locate the function that takes the input string = url. What I am trying to do is to call the function, pass the string to it, and then initiate the crawling process. Would you please help me navigate to the right function in your source code. Where can I manually assign the url string and call for the crawling to begin.

I look forward to hearing from you!
GeneralRe: web crawler
Hatem Mostafa
18:59 1 Dec '09  
bool AddURL(ref MyUri uri)

is the function u need. Just fill uri.uriString with ur URL

Thanks
Hatem
GeneralHow to add proxy request !?
Tran Van Huy
4:05 6 Aug '09  
Thks first, it work very good. But with any site, he check request. So I want to add proxy to request header to past many block from administrator's site. please help me!
Generalweb crawler
amina konsaw
4:54 4 Jul '09  
I want to know some information about what happpen after the user enter URL to crawler.
I know that take URL then input in the queue but which thread take URL and and parse it to find URLs?
this is process infinit when it finished?
please reply to me for impotance
GeneralRe: web crawler
Hatem Mostafa
8:31 4 Jul '09  
check
A Simple Crawler Using C# Sockets[^]
It includes the answer.
General[Message Deleted]
it.ragester
22:42 2 Apr '09  
[Message Deleted]
Generalhttp://arachnode.net
Teufel1212
20:50 17 Mar '09  
http://arachnode.net[^] Very helpful... Also a good resource.
Generalproblem loading the exe file....could anyone help please...
mav058
1:06 16 Mar '09  
..
QuestionAny one has solution to exception "Cross-thread operation not valid"?
raghavlyon
20:37 23 Feb '09  
Cross-thread operation not valid: Control 'comboBoxWeb' accessed from a thread other than the thread it was created on.
This is a problem when running with VS2005. If anyone has solution to this then please post the code for the same. I will really appericate the efforts. Delegate seems not working in this case.
AnswerRe: Any one has solution to exception "Cross-thread operation not valid"?
Hatem Mostafa
21:46 23 Feb '09  
add

ComboBox.CheckForIllegalCrossThreadCalls = false;
ListView.CheckForIllegalCrossThreadCalls = false;

after

InitializeComponent();
GeneralRe: Any one has solution to exception "Cross-thread operation not valid"?
raghavlyon
22:03 23 Feb '09  
Thanks. I was trying the same solution but I using it on a object. being a new .net person I didn't realize that CheckForIllegalCrossThreadCalls is a static property, should be called directly on Control not on object.
Thanks you have saved the day.
GeneralRe: Any one has solution to exception "Cross-thread operation not valid"?
raghavlyon
22:06 23 Feb '09  
It works though but executing only one thread at a one time. Now it is not multi-threaded
GeneralRe: Any one has solution to exception "Cross-thread operation not valid"?
Hatem Mostafa
22:09 23 Feb '09  
try this only
ComboBox.CheckForIllegalCrossThreadCalls = false;
GeneralIt shows list of error in the mainform design->errors
vibhav010
16:08 30 Jan '09  
Whenever i click on go button it always gives me errors in i.e.
ID Date error
1 1/31/2009 www.cnn.com

Is there any change in port number or any other setting?
thanks
GeneralDemo 'works' but Compiled Project doesn't
rowifi
14:16 31 Dec '08  
I like this code to learn from, but the Demo program only begins to work after it is 'paused' then 'restarted'.

The source code - imported into latest C# gives warnings about depreciated methods, but compiles ok apart from one cross thread error. Commenting that out - the program 'works' but no web sites are captured or passed - no messages at all.

I've been trying to work through it, but would appreciate any clues!

Thanks
Rob
GeneralRe: Demo 'works' but Compiled Project doesn't
ionden
4:03 30 Sep '09  
Try moving the line:
nThreadCount = value;

before starting the threads (before 'for')
GeneralIs there any option to set the Crawler property to "True" while making the Web Request?
Nagaraj Muthuchamy
0:05 15 Dec '08  
Thanks,
nagaraj



Last Updated 19 Mar 2006 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010