|

Contents
Introduction
A web crawler (also known as a web spider or ant) is a program, which browses the World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a web site, such as checking links, or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
Crawler Overview
In this article, I will introduce a simple Web crawler with a simple interface, to describe the crawling story in a simple C# program. My crawler takes the input interface of any Internet navigator to simplify the process. The user just has to input the URL to be crawled in the navigation bar, and click "Go".

The crawler has a URL queue that is equivalent to the URL server in any large scale search engine. The crawler works with multiple threads to fetch URLs from the crawler queue. Then the retrieved pages are saved in a storage area as shown in the figure.
The fetched URLs are requested from the Web using a C# Sockets library to avoid locking in any other C# libraries. The retrieved pages are parsed to extract new URL references to be put in the crawler queue, again to a certain depth defined in the Settings.
In the next sections, I will describe the program views, and discuss some technical points related to the interface.
Crawler Views
My simple crawler contains three views that can follow the crawling process, check the details, and view the crawling errors.
Threads view
Threads view is just a window to display all the threads' workout to the user. Each thread takes a URI from the URIs queue, and starts connection processing to download the URI object, as shown in the figure.
.
Requests view
Requests view displays a list of the recent requests downloaded in the threads view, as in the following figure:

This view enables you to watch each request header, like: GET / HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive
You can watch each response header, like: HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:39:05 GMT
Content-Length: 65730
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:40:05 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Last-Modified: Sun, 19 Mar 2006 19:38:58 GMT
Vary: Accept-Encoding,User-Agent
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
And a list of found URLs is available in the downloaded page: Parsing page ...
Found: 356 ref(s)
http://www.cnn.com/
http://www.cnn.com/search/
http://www.cnn.com/linkto/intl.html
Crawler Settings
Crawler settings are not complicated, they are selected options from many working crawlers in the market, including settings such as supported MIME types, download folder, number of working threads, and so on.
MIME types
MIME types are the types that are supported to be downloaded by the crawler, and the crawler includes the default types to be used. The user can add, edit, and delete MIME types. The user can select to allow all MIME types, as in the following figure:

Output
Output settings include the download folder, and the number of requests to keep in the requests view for reviewing request details.

Connections
Connection settings contain:
- Thread count: the number of concurrent working threads in the crawler.
- Thread sleep time when the refs queue is empty: the time that each thread sleeps when the refs queue is empty.
- Thread sleep time between two connections: the time that each thread sleeps after handling any request, which is a very important value to prevent hosts from blocking the crawler due to heavy loads.
- Connection timeout: represents the send and receive timeout to all crawler sockets.
- Navigate through pages to a depth of: represents the depth of navigation in the crawling process.
- Keep same URL server: to limit the crawling process to the same host of the original URL.
- Keep connection alive: to keep the socket connection opened for subsequent requests to avoid reconnection time.

Advanced
The advanced settings contain:
- Code page to encode the downloaded text pages.
- List of a user defined list of restricted words to enable the user to prevent any bad pages.
- List of a user defined list of restricted host extensions to avoid blocking by these hosts.
- List of a user defined list of restricted file extensions to avoid parsing non-text data.

Points of Interest
- Keep Alive connection:
Keep-Alive is a request form the client to the server to keep the connection open after the response is finished for subsequent requests. That can be done by adding an HTTP header in the request to the server, as in the following request: GET /CNN/Programs/nancy.grace/ HTTP/1.0
Host: www.cnn.com
Connection: Keep-Alive
The "Connection: Keep-Alive" tells the server to not close the connection, but the server has the option to keep it opened or close it, but it should reply to the client socket regarding its decision. So the server can keep telling the client that it will keep it open, by including "Connection: Keep-Alive" in its reply, as follows: HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-Control: max-age=60, private
Connection: keep-alive
Proxy-Connection: keep-alive
Server: Apache
Vary: Accept-Encoding,User-Agent
Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
Or it can tell the client that it refuses, as follows: HTTP/1.0 200 OK
Date: Sun, 19 Mar 2006 19:38:15 GMT
Content-Length: 29025
Content-Type: text/html
Expires: Sun, 19 Mar 2006 19:39:15 GMT
Cache-Control: max-age=60, private
Connection: Close
Server: Apache
Vary: Accept-Encoding,User-Agent
Last-Modified: Sun, 19 Mar 2006 19:38:15 GMT
Via: 1.1 webcache (NetCache NetApp/6.0.1P3)
- WebRequest and WebResponse problems:
When I started this article code, I was using the WebRequest class and WebResponse, like in the following code: WebRequest request = WebRequest.Create(uri);
WebResponse response = request.GetResponse();
Stream streamIn = response.GetResponseStream();
BinaryReader reader = new BinaryReader(streamIn, TextEncoding);
byte[] RecvBuffer = new byte[10240];
int nBytes, nTotalBytes = 0;
while((nBytes = reader.Read(RecvBuffer, 0, 10240)) > 0)
{
nTotalBytes += nBytes;
...
}
reader.Close();
streamIn.Close();
response.Close();
This code works well but it has a very serious problem as the WebRequest class function GetResponse locks the access to all other processes, the WebRequest tells the retrieved response as closed, as in the last line in the previous code. So I noticed that always only one thread is downloading while others are waiting to GetResponse. To solve this serious problem, I implemented my two classes, MyWebRequest and MyWebResponse.
MyWebRequest and MyWebResponse use the Socket class to manage connections, and they are similar to WebRequest and WebResponse, but they support concurrent responses at the same time. In addition, MyWebRequest supports a built-in flag, KeepAlive, to support Keep-Alive connections.
So, my new code would be like: request = MyWebRequest.Create(uri, request, KeepAlive);
MyWebResponse response = request.GetResponse();
byte[] RecvBuffer = new byte[10240];
int nBytes, nTotalBytes = 0;
while((nBytes = response.socket.Receive(RecvBuffer, 0,
10240, SocketFlags.None)) > 0)
{
nTotalBytes += nBytes;
...
if(response.KeepAlive && nTotalBytes >= response.ContentLength
&& response.ContentLength > 0)
break;
}
if(response.KeepAlive == false)
response.Close();
Just replace the GetResponseStream with a direct access to the socket member of the MyWebResponse class. To do that, I did a simple trick to make the socket next read, to start, after the reply header, by reading one byte at a time to tell header completion, as in the following code:
Header = "";
byte[] bytes = new byte[10];
while(socket.Receive(bytes, 0, 1, SocketFlags.None) > 0)
{
Header += Encoding.ASCII.GetString(bytes, 0, 1);
if(bytes[0] == '\n' && Header.EndsWith("\r\n\r\n"))
break;
}
So, the user of the MyResponse class will just continue receiving from the first position of the page.
- Thread management:
The number of threads in the crawler is user defined through the settings. Its default value is 10 threads, but it can be changed from the Settings tab, Connections. The crawler code handles this change using the property ThreadCount, as in the following code:
private int nThreadCount;
private int ThreadCount
{
get { return nThreadCount; }
set
{
Monitor.Enter(this.listViewThreads);
try
{
for(int nIndex = 0; nIndex < value; nIndex ++)
{
if(threadsRun[nIndex] == null ||
threadsRun[nIndex].ThreadState != ThreadState.Suspended)
{
threadsRun[nIndex] = new Thread(new
ThreadStart(ThreadRunFunction));
threadsRun[nIndex].Name = nIndex.ToString();
threadsRun[nIndex].Start();
if(nIndex == this.listViewThreads.Items.Count)
{
ListViewItem item =
this.listViewThreads.Items.Add(
(nIndex+1).ToString(), 0);
string[] subItems = { "", "", "", "0", "0%" };
item.SubItems.AddRange(subItems);
}
}
else if(threadsRun[nIndex].ThreadState ==
ThreadState.Suspended)
{
ListViewItem item = this.listViewThreads.Items[nIndex];
item.ImageIndex = 1;
item.SubItems[2].Text = "Resume";
threadsRun[nIndex].Resume();
}
}
nThreadCount = value;
}
catch(Exception)
{
}
Monitor.Exit(this.listViewThreads);
}
}
If TheadCode is increased by the user, the code creates a new thread or suspends suspended threads. Else, the system leaves the process of suspending extra working threads to threads themselves, as follows. Each working thread has a name equal to its index in the thread array. If the thread name value is greater than ThreadCount, it continues its job and goes for the suspension mode.
- Crawling depth:
It is the depth that the crawler goes in the navigation process. Each URL has an initial depth equal to its parent's depth plus one, with a depth 0 for the first URL inserted by the user. The fetched URL from any page is inserted at the end of the URL queue, which means a "first in first out" operation. And all the threads can be inserted in to the queue at any time, as shown in the following code: void EnqueueUri(MyUri uri)
{
Monitor.Enter(queueURLS);
try
{
queueURLS.Enqueue(uri);
}
catch(Exception)
{
}
Monitor.Exit(queueURLS);
}
And each thread can retrieve the first URL in the queue to request it, as shown in the following code: MyUri DequeueUri()
{
Monitor.Enter(queueURLS);
MyUri uri = null;
try
{
uri = (MyUri)queueURLS.Dequeue();
}
catch(Exception)
{
}
Monitor.Exit(queueURLS);
return uri;
}
References
- Web crawler from Wikipedia, the free encyclopedia.
- RFC766.
Thanks to...
Thanks god!
| You must Sign In to use this message board. |
|
| | Msgs 1 to 25 of 82 (Total in Forum: 82) (Refresh) | FirstPrevNext |
|
 |
|
|
can anybody tell me how t write a simple parser. i am new to c#. i want to write a parser that can parse a website and i can save specific information in my database. i am using visual studio 2005 and sql server 2005. i tried it but failed to do that.
given below is my code. any help in this context will be really great full for me.this code is shown when we click on start button after entering URL of website to be parsed. thanx.
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Text; using System.Windows.Forms;
using System.IO; using System.Net; using System.Data.SqlClient;
SqlConnection conn = null; public parser() { InitializeComponent(); }
private void parser_Load(object sender, EventArgs e) {
try { conn = new SqlConnection(); conn.ConnectionString = "Data Source=localhost;User Id=sa; Password=indian; Initial Catalog=pizzahut"; conn.Open(); }catch(Exception exp){ exp.ToString(); } }
private String getMethodResponse(string URL) { string responseFromServer = null; try { WebRequest request = WebRequest.Create(URL); Stream dataStream = null; WebResponse response = request.GetResponse(); dataStream = response.GetResponseStream(); // Open the stream using a StreamReader for easy access. StreamReader reader = new StreamReader(dataStream); // Read the content. responseFromServer = reader.ReadToEnd(); // Display the content. Console.WriteLine(responseFromServer); // Clean up the streams. reader.Close(); dataStream.Close(); response.Close(); } catch (Exception exp) { exp.ToString(); }
return responseFromServer;
}
private void writeToFile(string dataToWrite){ System.IO.StreamWriter file = new System.IO.StreamWriter("test.html"); file.WriteLine(dataToWrite); file.Close();
}
private void btnStart_Click(object sender, EventArgs e) { string responseFromServer = null; txtURL.Enabled = false;
if (txtURL.Text != "") { responseFromServer = getMethodResponse(txtURL.Text); } else { MessageBox.Show("You must enter a URL.", "URL Entry Error", MessageBoxButtons.OK, MessageBoxIcon.Exclamation);
}
if (responseFromServer != null) { // writeToFile(responseFromServer); startParser(responseFromServer); } txtURL.Enabled = true; }
private void startParser(string data){ string str = data; insertIntoDatabase(); while (str.IndexOf("class=\"tfontb slideshow\">") != -1) { str = str.Substring(data.IndexOf("class=\"tfontb slideshow\">") + "class=\"tfontb slideshow\">".Length);
// parse all fields to be insert in database
// Insert into database
}
}
private void insertIntoDatabase() { string query = "INSERT INTO [pizzahut].[dbo].[add_detail] ([address],[area] ,[city],[contactnumber],[delivery],[dinin],[carryout] ,[amexrewards],[birthdaybang])"; query += " VALUES "; query += " ('B-62, National Apt','Dwarka','New Delhi','997176',1 ,1 ,0 ,0,1 )"; SqlCommand cmd = new SqlCommand(query, conn); cmd.ExecuteNonQuery();
}
} }
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
i am a green man ,I admire you article. but i got a problem, altoughte you are list all source code , but i spend A lot of time to read you article .
we can't know the one page including how munch hyperlink, how i can distribute hyperlink number to the muilty-threading
or another solution.
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
Sir, I afraid I can't help u, as I forgot all things related to the article. I left C# from 5 years. But, I'll try.
Suppose that the thread which parses any page put found URLs in a URL server, and each URL take a serial ID. So, if we have a thread manager that control which thread should take which URL. If you have N thread, and assume that URL number now is m so we can choose the suitable thread like that:
nThreadIndex = m % N;
So, can do a load balancing, as first N URLs with be distributed for all threads
Tell me if that is the answer of ur question.
thanks Hatem
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
At work a proxy is used to connect to the internet and the tool is unable to connect to the URL. Do we have to configure the tool. I looked there were no options. the error is "No such host is known" when we type the cnn.com
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
The problem with HTTPWebRequest mentoined in this article can be solved. It is true that by default only two connections are allowed. However, you may send a number of simultaneous connections by modifying the ServicePointManager.DefaultConnectionLimit. It is a little obscure
|
| Sign In·View Thread·PermaLink | 4.40/5 (2 votes) |
|
|
|
 |
|
|
Yes, u r right. But, the article is written from a long time where it was undocumented as I think.
Thanks
|
| Sign In·View Thread·PermaLink | 1.50/5 (2 votes) |
|
|
|
 |
|
|
Hello everyone! Could anyone please tell me which part of the code should I change to get for each visited page the links that this page contains?
I would like to populate a database but it really doesn't matter for now. I could start by writing them to a text file.
All you have to decide is what to do with the time that is given to you.-
|
| Sign In·View Thread·PermaLink | 2.00/5 (2 votes) |
|
|
|
 |
|
|
 |
|
|
Hi Hatem,
I downloaded your Net Crawler - good work! Would it be possible to rewrite the app with following functionality?: - To crawler only first page of a website - And additionally to crawler disclaimer page of the website
Thanks Jan
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Cross-thread operation not valid: Control 'comboBoxWeb' accessed from a thread other than the thread it was created on.
I get the above message when try to build this under VS 2008. Any suggestions?
Saeed Darya
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
This is caused because Mustafa uses the worker threads to access the members of the form. (The form is in a seperate UI thread). It is possible to disable the exception (check msdn). Some people might say the worker threads should not generally interact with the UI controls, but this may require large changes to the code base.
|
| Sign In·View Thread·PermaLink | 3.25/5 (3 votes) |
|
|
|
 |
|
|
 |
|
|
Hi, If you have a problem in Cross-thread. In VS2005 or later you have a force to fix it. If you want work with code like a VS2003 you can add this line to mainclass constructor. CheckForIllegalCrossThreadCalls = false;
Good Lock, Mehreeizad
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Hello great post just a small problem that i had when downloading a flash file ".swf" the loof that search for the end of the header will run infintly thanks again for your effort
jinji
|
| Sign In·View Thread·PermaLink | 5.00/5 (2 votes) |
|
|
|
 |
|
|
I means I want to download the asp forms after my login the web site? I don't know how to keep the session in program? Could you help to check it? Thanks in advanced.
Jami
|
| Sign In·View Thread·PermaLink | 2.00/5 (2 votes) |
|
|
|
 |
|
|
 |
|
|
Some website return "The remote server returned an error: (407) Proxy Authentication Required.", how to solved it? Thanks!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi, The 407 error is relevant to proxy authentication (your proxy), you need to add your credentials before any connection opening request.
System.Net.NetworkCredential nc = new System.Net.NetworkCredential("userName", "password", "domain"); System.Net.WebProxy wp = new System.Net.WebProxy("webProxyAdress", 1080); wp.Credentials = nc; System.Net.WebRequest myWebRequest = System.Net.WebRequest.Create("htttp://someadress"); myWebRequest.Proxy = wp; //start request jobs //. //. //. //end jobs
If it doesn't work, it means that you need to configure your windows for a proxy : proxycfg -i wich will read the params from ie properties.
else I guess that you have to connect to the proxy first using sockets, authentify and make your web requests using sockets.
am I wrong?
:: YOU make history ::
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hello Hatem,
Can you provide simple example based on you source “A Simple Crawler Using C# Sockets”, which just shows how to make multithreading c# application for downloading web files/links and it takes the urls from a text file (Comma separated for example). (No parser included).
Thank you, YA
|
| Sign In·View Thread·PermaLink | 2.00/5 (3 votes) |
|
|
|
 |
|
|
Hi,
Is it Possible to use the same socket if the host is same for more than one URLs
for example.
http://www.freshersworld.com/ http://www.freshersworld.com/imgnew/stylz.css http://www.freshersworld.com/imgnew/logo.gif http://www.freshersworld.com/imgnew/h0.gif http://www.freshersworld.com/imgnew/s1.gif
I am processing each request as individual get.
But when i tried,
After Processing the first request, my connection is automatically closed.
Though the server responser header for connection is keep alive.
Please Help.
Thanks
Sakthi
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
General News Question Answer Joke Rant Admin
|