Download source
Suggestions have been incorporated. Kindly Suggest, Vote, Comment to improve it
Introduction
*All the code examples are for learning purpose. Any misuse is not encouraged.
* Project with Source Code of most of the Examples has been added.
Web Scraping involves obtaining information of interest from the webpages. I tried to make a step by step guide starting from basic of webscraping using WebBrowser
to a little bit advance topics like performing login and maintaining sessions via HTTPWebRequest
. This is the first release of the article and there may be errors/mistakes. I welcome all the suggestions and would try to include them ASAP.
I have used the tutorial based Step by Step approach and web scrapping work starts from the first line of the tutorial. I have taken Example/Task
Oriented method
to keep it interested, 2-3 examples are followed by 2-3 Tasks to keep the learner motivated. I am assuming the users have basic knowledge of C# and Visual Studio Programming Environment.
Contents
The Contents I have covered are:
- WebBrowser
- WebBrowser Download Event
- Navigating To Olx's first Page
- Accessing All Adds Shown at ...
- Yahoo Signin Form Filling & Submission
- Modifying WebBrowser Headers
- Saving All Images of a WebPage
- Solving Captcha Using API ...
- Setting Proxy For WebBrowser
- Regular Expressions
- Finding a Number in Text
- Regex Operators
- Finding Words in a Sentence
- Numbers of Format ddd-ddddd
- Finding Email Addresses in Text
- Finding IP Addresses in Text
- A Regex Utility
- WebClient
- Downloading HTML as String
- Downloading & Saving an Image
- Blocking Mode of WebClient
- Non - Blocking Mode of WebClient
- Read / Write Streams
- Query String for WebClient
- Uploading File to URL
- Little more about WebClient ...
- BackGroundWorker
- Running Time Consuming Function
- Work Completion Report
- Updating the Progress
- Stopping the Worker
- Multi Threaded App to Download Images ...
- HttpWebRequest/HttpWebResponse
- HTTP Request Headers
- How the Sessions Work
- HTTP Response Headers
- Mozilla Live HTTP Headers
- User Agent Strings
- Getting Facebook Login Page HTML ...
- Performing Login by HTTP requests
- Custom HTTPWebRequest for Login...
- Understanding HTML Form Get/POST
- Getting Form Hidden Fields
- Preparing HTTP POST Data
- Picture Upload by HTTP to Facebook ...
And a lot of relevant Tasks to keep the learner motivated to explore his innovation.
WebBrowser Control
Top This control provides a built in full browser as a control. It enables the user to navigate Web pages inside your form.
Example: WebBrowser Download Event
Add WebBrowser Control to the Form. Make it Dock in Parent Container
Double click the WebBowser Control to Add WebDocumentCompleted Event
Navigate Function is used to navigate to the given address
Document completed event is fired once the document is completed loaded
Now Run the program
There are many solutions available to solve the above problem, like counting i-frames and then counting number of times the Document Completed Event Fire. This is pretty much complex and the easiest one is to Maintain History.
-
Add a List<string>
hist to the program, and modify the Document Competed Event as below:
Example: Navigate to OLX’s 1st Add’s Page
Before making any web scrapper, click bot etc,
understanding of that website’s layout is necessary. After that following is important
Finding Fields of Interest
-
Narrowing Down the Text of Interest
Finding tags with ids near the interested tags
First Install Mozilla Firefox 15. Navigate to
http://www.olx.com/cars-cat-378
Write click on the First Add Link and click on the Inspect Element.
You will see something like the image below
To Visit the 1st Add, we needs its link address.
The anchor tag highlighted in above picture has no id, so if we use
GetElementByTag(“a”)
function, we will get a list of all the anchor tags, which
will include links of other pages of olx, help, contact us etc.(so its not good option)
So, try to Find the nearest Tag which has ID.
On the Tags Bar, Keep Selecting Tags toward the Left until you find some tag with ID
Once you reach the div tag with id the-list, you will see it is the container for all the Adds Links
So all the anchor tags in div#the-list, are links to the individual add pages
Following is the code to get it programmatically.
HtmlElement he = webBrowser1.Document.GetElementById("the-list");
HtmlElementCollection hec = he.GetElementsByTagName("a");
We want to navigate to 1st adds page
webBrowser1.Navigate(hec[0].GetAttribute("href"));
Example: Navigate to All the Adds shown on
http://www.olx.com/cars-cat-378
You have seen how to Navigate to the 1st Add.
To navigate to all the Pages, We need to store all the Adds Links in a List, so that later on we can visit those Adds
To Do This Make A List That Stores href values of all the Add’s Links
List<string>
urls = new List<string>();
Modify the Document Completed Event to add all the links to the URLs
HtmlElement he = webBrowser1.Document.GetElementById("the-list");
HtmlElementCollection hec = he.GetElementsByTagName("a");
foreach(HtmlElement a in hec)
{
string href = a.GetAttribute("href");
if(href != "http://www.olx.com/cars-cat-378")
{
if(!urls.Contains(href))
urls.Add(href);
}
}
Why we are checking href != "http://www.olx.com/cars-cat-378" ?
Because each individual Add Block contains a Link to the page on which
it is being shown (that means to make accurate scrapper, you need to understand well what all is there and where is it).
All the links are stored in urls list, now we need to make the browser automatically
navigate to all of these
if(urls.Count > 0)
{
string u = urls[0];
urls.RemoveAt(0);
webBrowser1.Navigate(u);
this.Text = "Links Remaining" + urls.Count.ToString();
}
else
{ MessageBox.Show("Complete"); }
Task 1: Modify The Above Code, make it browse next pages
Like: http://www.olx.com/cars-cat-378-p-2
http://www.olx.com/cars-cat-378-p-3
http://www.olx.com/cars-cat-378-p-4
and so on
Task 2: On Each Add Page, scrape owner name and Number(if given)
Task 3: Make The App, which scrapes specified number of individual adds from the given url of olx categorey.
Example: Yahoo Signin Form Filling and Submission
Navigate to http://mail.yahoo.com/
Check the ID of the username and password textboxes (Use Inspect Element)
Make a Button Click Event in the app
htmlElement hu = webBrowser1.Document.GetElementById("username");
hu.Focus();
hu.SetAttribute("Value","userName");
HtmlElement hp = webBrowser1.Document.GetElementById("passwd");
hp.Focus();
hp.SetAttribute("Value", "password");
For Sign in Button Click (actually we need to submit the form, so find Signin Form ID. Get its Element, and invoke submit function on it
HtmlElement hf = webBrowser1.Document.GetElementById("login_form");
hf.InvokeMember("submit");
Task 1: Findout how to Select Value of Dropdown List, CheckBox, Radio Button. You can try Filling Yahoo Signup Page
Task 2: Perform Click on the Hyperlink
Example: In WebBrowser Control We can Add/Change the Headers. The Most important Header’s are Referrer and User-Agent.
User Agent header tells the Web Server about the Browser From which the Request was sent
Referrer Tells the Web Server, that From which web page the user was sent to the current web page
To Change User-Agent Header
webBrowser1.Navigate("url", "_blank", null, "Referrer: sample user agent");
Task1: Browse to webBrowser1.Navigate("logme.mobi");
To see HTTPHeaders, then try modifying your User-Agent and Referrer
You can get a Complete List of User Agent Strings at
http://www.useragentstring.com/pages/useragentstring.phpTask2: Vist www.google.com in C# app, with some Apple, Linux browser User Agent. Get Google Search Results non-javscript page.
Example: Saving All the Images of the Web Page
Add Reference to using mshtml;
You can use Yahoo Sign up Page for Practice
IHTMLDocument2 doc = (IHTMLDocument2)webBrowser1.Document.DomDocument;
IHTMLControlRange imgRange = (IHTMLControlRange)((HTMLBody)doc.body).createControlRange();
foreach (IHTMLImgElement img in doc.images) {
imgRange.add((IHTMLControlElement)img);
imgRange.execCommand("Copy", false, null);
try{
using(Bitmap bmp = (Bitmap)Clipboard.GetDataObject().GetData(DataFormats.Bitmap))
bmp.Save(img.nameProp + ".jpg");
}
catch (System.Exception ex)
{
MessageBox.Show(ex.Message);
}
}
The Above code will save all the images of the Webpage in current directory
Task: Find Pattern in the captcha name, modify the code to only save captcha
Example: Solving Captcha using DeathByCaptcha Api
Add Reference to using DeathByCaptcha;
Following code solves the captcha.
Client client = (Client)new SocketClient(capUser, capPwd);
try
{
Captcha captcha = client.Decode(path + capName, 50);
if (null != captcha)
{
MessageBox.Show(captcha.Text);
}
else
{
}
}
catch(DeathByCaptcha.Exception ex) {
MessageBox.Show(ex.Message);
}
Study How to Report that Captcha.Text was wrong
Example: Setting Prxoy For WebBrowser
using Microsoft.Win32;
RegistryKey reg = Registry.CurrentUser.OpenSubKey(
"Software\\Microsoft\\Windows\\CurrentVersion\\InternetSettings", true);
registry.SetValue("ProxyEnable", 1);
registry.SetValue("ProxyServer", "192.168.1.1:9876");
Regex
Top A concise and flexible means of matching strings in the text
In C#, Regex, Match, MatchCollection classes are used for finding string patterns. These Clasess are in following Namespace.
using System.Text.RegularExpressions;
Example 1: Finding a Number in a text
Following are the Regex Operators:
[xyz]
| A character set. Matches any one of the enclosed characters. For example, "[abc]" matches the "a" in "plain".
|
[^xyz]
|
A negative character set. Matches any character not enclosed. For example, "[^abc]" matches the "p" in "plain".
|
[a-z]
| A range of characters. Matches any character in the specified range. For
example, "[a-z]" matches any lowercase alphabetic character in the range "a" through "z".
|
[^m-z]
| A negative range characters. Matches any character not in the specified
range. For example, "[m-z]" matches any character not in the range "m" through "z".
|
*
| Matches the preceding character zero or more times. For example, "zo*" matches either "z" or "zoo".
|
+
| Matches the preceding character one or more times. For example, "zo+" matches "zoo" but not "z".
|
?
| Matches the preceding character zero or one time. For example, "a?ve?" matches the "ve" in "never".
|
.
| Matches any single character except a newline character.
|
Example 2: Finding Words in a sentence
The last word sentence is not in the match list, as it didnt have space after it.
{n}
| n is a non-negative integer. Matches exactly n times. For example, "o{2}" does not match the "o" in "Bob," but matches the first two o's in "foooood".
|
{n,}
| n is a non-negative integer. Matches at least n times. For example, "o{2,}" does not match the "o" in "Bob" and matches all the o's in "foooood." "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
|
{n,m}
| m and n are non-negative integers. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood." "o{0,1}" is equivalent to "o?".
|
Example: Matching Telephone Number of the format ddd-ddddd. Where d means digit
Example: Finding email in the text
TheRegex of a normal email can be
"\b[A-Za-z0-9_]+@[A-Za-z0-9_]+\.[A-Za-z0-9]{2,4}\b"
Where \b defines a blank space [A-Za-z0-9_]+ defines username which may include repitition of anything from A-Z, 0-9, a-z and _(uderscore) @[A-Za-z0-9_]+ defines hosting company name, for example yahoo \. Defines dot(.) [A-Za-z0-9]{2,4} Matches Top Level doamin, like com, net, edu etc
Example: Regex to find IP address in the text
A basic regex to find ip address can be "\b[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+\b" But this wil match something of this type also 1925.68541.268.1 (that mean any number of deigits with 3 dots – and its not valid ip address)
An Other can be
"\b[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}\.[0-9]{1-3}\b"
Now this wil not match a string whihc has more that 3 digits with dots. But it may matches 999.999.999.999 which is again invalid address
So a regex can be as complex as following
"b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\."+ "(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\."+"(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\."+"(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b"
There is always a trade of between complexity and accuracy. So depending upon you input text you may give one dimension more important over other.
Task 1: Anchor Tags of HTML are of vital importance in scrapping. The value of the link is placed in the href attribute as shown below. Write regex to Find href value
Answer 1: The Regex to this can be as simple as
Regex r = new Regex("href=\"[^\"]+\"");
\” is used to define double quote(“) as double quote is a special character so it needs to be written with slash(\). Later on u can remove href=” and “ from the value of the match.
Task 2: Pins at Pinterest.com has following format of web addresses.
"/pin/125678645821924781/"
"/pin/63894888434527857/"
"/pin/25825397833321410/"
Write
Regex to find pins addresses from the html of the www.pinterest.com
Task 3: Thumbnail images of the Ads are shown at
http://www.olx.com/cars-cat-378
Write a Regex to Match Images Links
Task 4: Make Following Utility.
Load Input Text File
Press Load Button
Write Regex
Press Execute
All The Matches are show in Multiple Line Text Box with One Match Value Per Line
Top This class is found under using System.Net;. It provides various funtions to download files from the internet. It can be used to download HTML source of the webpages as string, as file. It supports downloading files as data bytes.
This class is very helpful in scrapping, as it lets the coder download only the html file where as using webbrowser for scrapping is simple, but not an efficient/speedy way.
Example: Dowloading Yahoo.com html source as string
Create a Button and TextBox on the form
In Button Click Event add the following code, and press the Button at run
WebClient wc = new WebClient();
textBox1.Text = wc.DownloadString("http://www.yahoo.com");
Here we are making Webclient Variable and then using its DownloadString method to download the html of the given url.
The downloaded html is shown in the textbox1
Benefits of using Webclient
Its easy to use
Supports Various Methods for file and string downloading
Efficient, uses much less bandwidth as compared WebBrowser
Once HTML source is downloaded, u can use Regex or 3rd party HTML Parsers to get required info from the HTML source.
Example 2: Downloading and Saving an Image
Add the Following Code to Button Click Event, and press the Button at run
WebClient wc = new WebClient();
wc.DownloadFile("http://www.dotnetperls.com/one.png", "one.png");
The image one.png will be downloaded and stored in the current directory
The 1st argument is the url of the image and 2nd is the name of the image
The same way, WebClient class provides methods for string and file uploading, but we wil use HTTPWebRequest class for that
wc.DownloadData()Method provides downloading the data as bytes. This is useful where differnent encoding is used like UTF8 etc
Example: Blocking Mode of Webclient
Adds Following code to the Button Click and press button at run
WebClient wc = new WebClient();
wc.DownloadData("http://www.olx.com");
Just after pressing buttion, try to move the Form, and Form will go to Not responding
Why is it so? The downloading string, file or data from internet is time consuming and webclient class is performaing download operation on the same Thread as the UI is. This causes UI to go Unresponsive
This mean, no other task can be performed by the App, once WebClient is downloading. This is blocking Mode. Solution to this problem is using WebClient in Non – Blocking Mode
Example: Non– blocking Mode of webclient
wc.DownloadStringAsync()isused to perform the download operation on a separate thread. This causes
the UI to remain responsive and can App can do other task meanwhile the downloading is performed
wc.DownloadStringAsync(new Uri("http://www.yahoo.com"));
This Downloads a String from resource, without blocking the calling thread.
To Perform Asynchronous Download operation, user needs to define Download Completed Event, so that calling thread can be informed once Downloading is Complete
In above case, we need to add DownloadStringCompleted Event.
Following Piece of code will cause the Webclient to Asynchronous Dwonload string and will fire Download Completed event on completion
WebClient wc = new WebClient();
wc.DownloadStringCompleted+=new DownloadStringCompletedEventHandler(wc_DownloadStringCompleted);
wc.DownloadStringAsync(new Uri("http://www.yahoo.com"));
-
The Downloaded string is passed as argument to the Download Complete Event and can be accessed by following way
void wc_DownloadStringCompleted(object sender, DownloadStringCompletedEventArgs e)
{
string html = e.Result;
textBox1.Text = html;
}
The Download Completed Event is Fired at Calling Thread, so u can easily Access UI elements
Example: Read / Write Streams
Webclient class provides various Blocing and Non Blocking Methods to Access the Stream for direct Read and Write Operations
Following Piece of code obtains read Stream in blocking mode
WebClient wc = new WebClient();
StreamReader sr = new StreamReader(wc.OpenRead("http://www.yahoo.com"));
The Same Way Write stream can be obtained for Write Related IO operations
Example: QueryString for Webclinet
Gets or sets a collection of query name/value pairs associated with the request.
Query String is helpful in sending the parametres to the url by url posting mothed
Search Result Page of Google has following format of Address
https://www.google.com.pk/search?q=search+phrase
In above url, 1 is a parameter and search+phras is its value
Following Example Shows how to use Query String for sending parameters and their values to a URL
string uriString = "http://www.google.com/search";
WebClient wc = new WebClient();
NameValueCollection myQSC = new NameValueCollection();
myQSC.Add("q", "Search Phrase");
wc.QueryString = myQSC;
wc.DownloadFile(uriString, "searchresult.htm");
NameValueCollection class is under System.Collections.Specialized
Example: Uplading File To the URL
String uriString = "FileUploadPagePath";
WebClient myWebClient = new WebClient();
string fileName = "File Path";
byte[] responseArray = myWebClient.UploadFile(uriString, fileName);
textBox1.Text = "Response Received. " + System.Text.Encoding.ASCII.GetString(responseArray);
Example: Additional Info for WebClient
Setting Proxy
wc.Proxy = new WebProxy("ip:port");
Adding Custom Headers
wc.Headers.Add(HttpRequestHeader.UserAgent, "user-agent");
Obtaining Respose Headers
WebHeaderCollection whc = wc.ResponseHeaders;
Task 1: Add Refrrer Header.
Task 2: Read Response Code and Status from Response Header
Task 3: What is BaseAddress of the WebClient
Task 4: Use WebClient.QueryString to Do Search on Google
Task 5: Use WebClient.Upload to upload some File
BackGroundWorker
Top This class provides an easy way to run time-consuming operations on a background thread. The BackgroundWorker class enables you to check the state of the operation and it lets you cancel the operation.
Example: Running Time Consuming Function on BackGroundWorker
For This Example We Are assuming Following Function a time consuming, and user need to run this function for various times which causes the UI to go unresponsive
private void HeavyFunction()
{
System.Threading.Thread.Sleep(1000);
}
Make a Form a shown Below with Start, Stop Button and Status Text. Add a BackGroundWorker from the Componnets to the Form
Create an event handler for the background worker's DoWork event. The DoWork event handler is where you run the time-consuming operation on the background thread. You can make this Event By Double Clicking in the Event Pane for BackGroundWorker
Any values that are passed to the background operation are passed in the Argument property of the DoWorkEventArgs object that is passed to the event handler.
Let's Call the HeavyFunction 5 times in the backgroundWorker1_DoWork Event
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
for(int i = 0; i < 5; i++)
HeavyFunction();
}
To Start the BackGroundWorker Work, we need to call RunWorkerAsync() Function of the backgroundWorker1. Call it in the Start Button Click Event
private void Start_Click(object sender, EventArgs e)
{
backgroundWorker1.RunWorkerAsync();
}
Once, the Start Button will be Clicked, the BackgroundWorker will start working, But UI will remain responsive.
You have sucessfully learnt how to put Time Consuming Functions on Easily Maneged Separate Thread
RunWorkerCompleted Event is Fired Once The Work is Complete
The Event is Called on the Calle Thread(Thread From which the BackGroundWorker.RunWorkerAsync() was called). In our case, its UI thread
To be notified, About BackGroundWorker Completion, add the Event RunWorkerCompleted
RunWorkerCompleted Event Will be Fired on UI Thread, so we can easily Access All the UI Elements
private void backgroundWorker1_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
{
Status.Text = "Work Complete";
}
Now After 5 seconds of Pressing Start Button, the Status Label Text Will be set to Work Complete
While Performing some Time Consuming Function on the BackGroundWorker, we may want to update the progress to the user. For example in a scenario of downloading several files, we may want to update UI to show how many files have been completed
To perform such update, ReportProgress Function is called which raises the PogressChanged Event on the calle Thread.
To Call Report Progrees, First you need to Add Progress Changed Event and set the WorkReportProgress Property to True
In Report Progress Method 2 arguments can be passed, int ProgresssPercentage and object UserState. These Both arguments are available in ProgressChangedEventArgs ProgressPercentage and UserState Properties
To Report Progress, Change the BackGroundWorker DoWork Event as Following
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
for (int i = 0; i < 100; i++)
{
HeavyFunction();
backgroundWorker1.ReportProgress(i, " Heavy Function Done");
}
}
To update UI in ProgressChanged Event, modify it as following
private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
Status.Text = e.ProgressPercentage.ToString() + (string)e.UserState;
}
Now Once you press the Start Button, the status will be updated with arguments passed in ReportProgress Method
When the BackGroundWorker will finish working, the RunWorkerCompleted Event will be fired, so the status will be updated to Work Complete.
To Stop the BackGroundWorker During the Work, We need to Set the Property WorkerSupportsCancellation to True
At any time during the Work, we can Stop the BackgroundWorker by calling CancelAsync() Function. Modify the Stop button Click Event as Following
private void Stop_Click(object sender, EventArgs e)
{
backgroundWorker1.CancelAsync();
}
Modfiy the DoWork Event as Following to Stop if Cancelling is Pending
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
for (int i = 0; i < 1; i++)
{
if (backgroundWorker1.CancellationPending)
{
e.Cancel = true;
break;
}
HeavyFunction();
backgroundWorker1.ReportProgress(i, "Heavy Function Done");
}
}
To Update the UI with accurate info, u can modfiy BackGroundWorkerCompleted Event as following to show that either BackGroundWorker was stopped or it completed the Work
private void backgroundWorker1_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e)
{
if(e.Cancelled)
Status.Text = "Work Stooped";
else
Status.Text = "Work Complete";
}
If BackGroundWorker is Busy in some Task and user again press the Start Button, this is going to cause an error and throw and Exception. The IsBusy Propert tells either worker is busy or not, So Before Calling RunWorkerAsync() function, one must check that either BackGroundWorker is Busy in Work or Not. Following Code Does so
private void Start_Click(object sender, EventArgs e)
{
if (!backgroundWorker1.IsBusy)
backgroundWorker1.RunWorkerAsync();
else
MessageBox.Show("Busy in Work - Press Stop");
}
You can send Non UI objects as argument to the RunWorkerAsync function and then access it in the DoWork Event
Example: Make Multi Threaded App to download images from pinterest.com
Design the UI as shown Blow
Add a one BackGroundWorker, name it backGroundWorker1, add DoWork, ProgressChange and RunWorkCompleted Events. Set WorkerReportsProgress and WorkerSupportsCancellation Properties to True
Program Logic: We are going to use backGroundWorker1 to download html source of the http://www.pinterest.com using WebClient, then we will use regex to find urls of all the images and add it to a List<string> urls. The BackGroundWorkers equal to the number of threads set by the user will be created at run time, each of these backgroundworkers will take one url from List<string> urls, and download that image using WebClient.
Add Following Code for the backgroundworker1 Events
private void backgroundWorker1_DoWork(object sender, DoWorkEventArgs e)
{
WebClient wc = new WebClient();
string html = wc.DownloadString("http://www.pinterest.com");
Regex reg = new Regex("src=\"http://[^/]+/upload/[^\"]+");
MatchCollection mc = reg.Matches(html);
backgroundWorker1.ReportProgress(0, mc.Count.ToString() + "Images Found");
System.Threading.Thread.Sleep(2000);
lock(urls)
{
foreach (Match m in mc)
{
urls.Add(m.Value.Replace("src=\"",""));
}
}
}
private void backgroundWorker1_ProgressChanged(object sender, ProgressChangedEventArgs e)
{
Status.Text = (string)e.UserState;
}
In DoWork, we have just downloaded the html, used regex to get images links, and added it in the List<string> urls
Now we need to make workers for downloading images. We will do this once user press that Start Button, Add Following Code to Start Button Click Event
private void Start_Click(object sender, EventArgs e)
{
int maxThrds;
if(!int.TryParse(NoOfThreads.Text, out maxThrds))
{
MessageBox.Show("Enter Correct Number of Threads");
return;
}
if(maxThrds <= 0)
{
MessageBox.Show("Enter 1 or more Threads");
return;
}
if (!backgroundWorker1.IsBusy)
{
for(int i = 0; i < maxThrds; i++)
{
BackgroundWorker bgw = new BackgroundWorker();
bgw.WorkerReportsProgress = true;
bgw.WorkerSupportsCancellation = true;
bgw.DoWork += new DoWorkEventHandler(bgw_DoWork);
bgw.ProgressChanged += new ProgressChangedEventHandler(bgw_ProgressChanged);
bgw.RunWorkerCompleted += new RunWorkerCompletedEventHandler(bgw_RunWorkerCompleted);
bgw.RunWorkerAsync();
}
backgroundWorker1.RunWorkerAsync();
}
else
{
MessageBox.Show("Busy in Work");
}
}
First we are checking for correct input, then we are making backgroundworkers at the run time.
Once all properties of the run time threads are set, we are calling bgw.RunWorkerAsync for each worker.
- Following is the Code for DoWork Event of the RunTime made Workers
private void bgw_DoWork(object sender, DoWorkEventArgs e)
{
BackgroundWorker bgw = (BackgroundWorker)sender;
while(true)
{
string imgLink = "";
lock(urls)
{
if(urls.Count > 0)
{
imgLink = urls[0];
urls.RemoveAt(0);
count++;
}
else
{
System.Threading.Thread.Sleep(500);
}
}
if (imgLink != "")
{
string filename = imgLink.Substring(imgLink.LastIndexOf("/") + 1);
WebClient wc = new WebClient();
wc.Headers.Add(HttpRequestHeader.Referer, "Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20100101 firefox/15.0.1");
wc.DownloadFile(imgLink, filename);
bgw.ReportProgress(0, count.ToString() + "Images Downloaded");
}
}
}
In 1st line, we are casting sender to the BackGroundWorker object, so that we can ReportProgress for it. Then we have put all the code in a loop, in each iteration we are removing one url from List and then putting it on Download
If there is no link in List<string> urls, we have put the thread to sleep for 500ms
Since many threads will be accessing List<string> urls, so have put it in lock.
Task 1: How to Stop Run Time Created Workers
Task 2: Modify backgroundworker1 to collect user defined number of images. For example 30, 100, 220(for more than 50, u have to scrape page 2,3,4 ....
Hint for Task 1: Following Options can be used
Option 1: You can maintain a List of Run Tim Created Workers and then call CancelAsync()
for each worker in the List. Then modify the code of each run time Worker to break The Loop if CancellationPending
Option 2: Declare a Global Variable int rnd, assign it some random value in the start Button Click Event and pass it to BackGroundWorker DoWork Event as Argumet.
if(!backgroundWorker1.IsBusy)
{
rnd = new Random().Next(0, 99999);
for (int i = 0; i < maxThrds; i++)
{
BackgroundWorker bgw = new BackgroundWorker();
bgw.WorkerReportsProgress = true;
bgw.WorkerSupportsCancellation = true;
bgw.DoWork += new DoWorkEventHandler(bgw_DoWork);
bgw.ProgressChanged += new ProgressChangedEventHandler(bgw_ProgressChanged);
bgw.RunWorkerCompleted += new RunWorkerCompletedEventHandler(bgw_RunWorkerCompleted);
bgw.RunWorkerAsync(rnd);
}
backgroundWorker1.RunWorkerAsync();
}
Cast the rnd value to a local int variable, Modify the DoWork Event to work until rnd is not changed
void bgw_DoWork(object sender, DoWorkEventArgs e)
{
int chk = (int)e.Argument;
while (chk == rnd)
{
}
}
In Stop Button Click Event, Assign some new value to rnd, which will cause all the runtime created workers to break from loop
private void Stop_Click(object sender, EventArgs e)
{
rnd = new Random().Next(0, 9999);
}
Task 3: Think of some more options
HTTPWebRequest / HTTPWebResponse
Top
Before Starting this, we need to understand a bit about main HTTP Headers and install few ADD On’s which help us in determining Layout, Packets and Altering the Packets for a Website.
In an any of your Browser, Go to http://logme.mobi . You will get something like following
This is the Data of your HTTP Headers, which your browser sent to the web server of the http://logme.mobi In this, the User – Agent (it defines which browser is used for browsing) and Connection Headers are important.
Now Just Referesh the Page, and u will get something like following
This Header defines the Cookie, what is Cookie? A Cookie is a small piece of information stored as a text file on your computer that a web server uses when you browse certain web sites.
To maintain sessions, the cookie header is very important.
How the Sessions Work? Once user request login page, few cookies are issued by the server, then user submits login info along with the cookies, in case of successful login, server issues a new set of cookies, which identifies the user as authentiated user to the server. Then for further requests to the server, these newly issued set of cookies is used. This way a session is maintained. At any time, if u clean the Cookies Header, u will be redirected to the Login Page.
Install Mozilla Firefox 15.0 and then install the Live HTTP Headers add on for it. You can get it from here.
Run the LiveHTTPHeader, and Referesh the Page http://logme.mobi. You will see something like following
The LiveHTTPHeader Add on shows the HTTPHeaders of all the Requests and Responses once you dosome browsing using Mozilla Firefox. This Tool is Helpful in determining the website’s HTTP Packet formats, specially it helps in knowing what all data is being posted once some POST Action is performed
Now install the Add on Tamper Data from here. This Tool is helpful in modifying the content of HTTP Headers while browsing the web, this tool is of great use in determining that what fields and headers are compulsory for performing some HTTP POST Request and what all stuff we can skip out from a perticular post request. Once u run it, it will look like following
Click on the Start Tamper Button. In address bar type useragentstring.com/ press Enter, As soon u press Enter Following Window Will open, asking for u to Tamper Data, Submit Request or Abort Request
Click on Tamper Data, Then Following Window will open up
In the User Agent Field, Enter Following and Press OK
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1
Now submit all the subsequent requests and once the page will be loaded u will see website showing ur browser as Chrome where as you are using FireFox
The same way, you can alter the POST method parameters.
HTTPWebRequest Class in .Net: This class is under System.Net namespace and it provides methods and
properties to make HTTP Request to a web server.
Example 1: Downloading HTML of the Facebook login page
To make an object of this class, WebRequest.Create function is used HttpWebRequest
myReq = (HttpWebRequest)WebRequest.Create(url);
Open LiveHTTPHeaders, in browser, and browse to http://www.facebook.com
The picture above, shows the HTTP Request Headers, lets make this in C#
HttpWebRequest request = (HttpWebRequest)WebRequest.Create("https://www.facebook.com/");
request.UserAgent = "Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/15.0";
request.Accept = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
request.Headers.Add("Accept-Language: en-us,en;q=0.5");
request.Headers.Add("Accept-Encoding: gzip, deflate");
request.KeepAlive = true;
First line, creates an HTTPWebRequest Object to the given url, then we are adding the User-Agent and Accept Header to the HTTP packet by using properties. Not all of the Herders are Accessible via properties so user may need to Add Headers by Adding it to Headers Collection. Then we are adding Accept-Language and Accept-Encoding Headers by Adding it to Headers Collection.
Next important Stuff is Adding the Cookie Container to the HTTPWebRequest Object, as we want to keep record of the Cookies sent by the server in response to the request. If no Cookie container is Added then we can not Access the Cookies in the Response Header.
request.CookieContainer = new CookieContainer();
Declare a CookieCollection variable Globally, all the received Cookies will be added in this Collection so that we can use received cookies for subsequent requests. If each time you use new Cookie Container,
then its not possible to maintain session.
CookieCollectioncookies = new CookieCollection();
Now we are done with making required HTTPWebRequest Object. Before making HTTPWebResponse object, lets see what response we got in LiveHTTPHeader
for the request which we sent by browser
The 1st line shows the Code and Status. Then we are interested in Cookies Only. We a got 4-5 Cookies, which wil stored in browser Cookies foleder and will be sent with the next request. In case we perform login, then these Cookies wil be sent with the HTTP Request which will be generated for login, and then for successful login, server will issue some additional Cookies(those Cookies wil contain info which wil make us authenticated users for subsequent Requests)
Now lets make the HTTPWebResponse Object, its very simple
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Once Response is Received, next thing is adding the Received Cookies to the globally defined CookieCollection. But Before that Lets see what all Cookies we received.
Add the following Code after above line to see recevied Cookies
string txt = "Cookies Count=" + response.Cookies.Count.ToString() + "\n";
foreach (Cookie c in response.Cookies)
{
txt += c.ToString() + "\n";
}
MessageBox.Show(txt);
cookies.Add(response.Cookies);
This will show your cookies in a MessageBox
Now the Response is recieved, Next Step can be downloding the Data from Stream, it can be HTML source code, some other file or may be nothing at all depending upon the url to which u made the request. In our case it is HTML source of the Facebook Login page.
StreamReader loginPage = new StreamReader(response.GetResponseStream());
string html = loginPage.ReadToEnd();
This html source can be used to get some info by Regex or using some 3rd party HTML Parsing Library or stored in an html file as offline page.
Example 2: Performing
Login to FaceBook
Before diong login by C#, lets perform login in mozilla and analyze the HTTP Header by LiveHTTPHeaders. Start LiveHTTPHeader, Browse to http://www.facebook.com , enter username and password, click Login button. The HTTP Web Request sent by the Browser will look somthing like this
The First Line is the URL to which your username and password being sent(later in Example 3, we will see how to find this url). Second line tells the HTTP method and version used, which is POST and 1.1 respectively.
Then all the fields are just like normall HTTP Header as we saw in Example 1. The important stuff starts from Cookie Header, in Example 1, once we browse to http://www.facebook.com, there was no Cookie Header where as we received some Cookies in the Response Header, now when we click on the Login Button, the previously received set of Cookies is being sent in this Cookie Header.
Next Header shows Content Type, there are two major content types used to POST data, application/x-www-form-urlencoded and multipart/form-data. You can find more info about these here
Next Header shows Content Length and in last line Content is being shown. You will see your email address and password in this line. Actually last Line shows the data which is being sent to the server by HTTP Post method.
There are several other values also, later in Example, we will see what are these values and from where to obtain these Values ! ! !
Lets examine the Response Header for the above Request.
The Response Header shows a lot of Cookies, these are the Cookies which are issued by the server on successful login, now for any subsequect request, the browser will send these Cookies to the server and in this way session wil be maintained
Got to Tools->Clear Recent History and delete the Cookies, then try to browse to your facebook profile page, and u will see that u will be redirected to facebook login page.
Now lets create the same login Request header as we saw in above screen shot and test that either we are able to successfully log in or not
string getUrl = "https://www.facebook.com/login.php?login_attempt=1";
string postData = "lsd=AVo_jqIy&email=YourEmailAddress&pass=YourPassword&default_persistent=0& charset_test=%E2%82%AC%2C%C2%B4%2C%E2%82%AC%2C%C2%B4%2C%E6%B0%B4%2C%D0%94%2C%D0%84&timezone=-300&lgnrnd=072342_0iYK&lgnjs=1348842228&locale=en_US";
HttpWebRequest getRequest = (HttpWebRequest)WebRequest.Create(getUrl);
getRequest.UserAgent = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Firefox 15.0.0.1";
getRequest.CookieContainer = new CookieContainer();
getRequest.CookieContainer.Add(cookies);
getRequest.Method = WebRequestMethods.Http.Post;
getRequest.ProtocolVersion = HttpVersion.Version11;
getRequest.AllowAutoRedirect = false;
getRequest.ContentType = "application/x-www-form-urlencoded";
getRequest.Referer = "https://www.facebook.com";
getRequest.KeepAlive = true;
The getUrl is assigned to the address to which data will be posted, postData variable is copy of the Content from above HTTP Request Packet. Then we have created an HTTPWebRequest Object, and set its User-Agent Header
The Cookies which we received in Response to the Request for http://www.facebook.com are added to the HTTPWebRequest object, if we dont add these Cookies, then instead of enterteaining our request for login, Server will redirect us to Login page. Next we are setting HTTP Method to Post and Version to 1.1(used for HTTPS).
Setting the AllowAutoRedirect Property to false for requests in which we try to login is very important, if this property is set to true, then the HTTPWebRequest object will follow the Redirection Responses. And
during the redirections, you may lost access to the Cookies which server sent in response to Login Request.
Now Lets send the Login Info to the Server.
byte[] byteArray = Encoding.ASCII.GetBytes(postData);
getRequest.ContentLength = byteArray.Length;
Stream newStream = getRequest.GetRequestStream();
newStream.Write(byteArray, 0, byteArray.Length);
newStream.Close();
Data is written to stream, now lets get the Response and see what all Cookies we Receive
HttpWebResponse getResponse = (HttpWebResponse)getRequest.GetResponse();
string txt = "Cookies Count=" + getResponse.Cookies.Count.ToString() + "\n";
foreach (Cookie c in getResponse.Cookies) {
txt += c.ToString() + "\n";
}
MessageBox.Show(txt);
We successfully logged into the system and received 9 Cookies, the snapshot above shows very little info about the received Cookies, you can get more info by accessing the properties of the Cookies
Add the received Cookies to globally defined CookieCollection so that it can be used in subsequent requests
How to Check Login was Successfull or Not? Normally Cookies Count is an easy way to determine that Login was successfuly or not, to bemore sure, you can try getting HTML of Home Page, if you r not redirected to Login Page, that means u r successfully logged in.
Example 3: Custom HTTPWebRequest for Login
In Last example, we just replayed the HTTP Packet which mozilla Browser generated. Now let see from where the POST Url and PostData fields were obtained. Log Off from FaceBook, and open the Login Page. Right Click on Email textBox and click on Inspect Element.
Following HTML pane will appear in the bottom. Click on the Form Element
Here you can see the action filed in the highlighted area, this filed tells the url on which data is to be posted.
Below the highlighted area u can see few input fields, in Example 2, postData u saw many fields other than the email and password, so basically these fields were being sent to the server along with the email and password. These are part of the login Form, and these must be sent to the server along with the login info. Facebook changes the values of these fields frequently, so you cant hardcode these field’s values in the software/app.
Now we will see how to obtain these values from the facebook login page source code.
You can use Regex, string manipulation or some 3rd party HTML Parsing Library to obtain these fields and their values. I am using HTML Agility Pack to get the Login form tag and its all child input tags,
and finally preparing the postData
string email="youremail";
string passwd="yourpassword"; string postData = "";
A.HtmlDocument doc = new A.HtmlDocument();
doc.LoadHtml(fb_html);
node = doc.GetElementbyId("login_form");
node = node.ParentNode;
int i = 0; foreach(A.HtmlNode h in node.Elements("input"); {
if(i>0)
{
postData += "&";
}
if(i == 1)
{
postData += "email=" + email + "&";
postData += "pass=" + passwd + "&";
}
postData += (h.GetAttributeValue("name", "") + "=" +
h.GetAttributeValue("value", ""));
i++;
}
Now you can post the Data in same way as we did in Example 2. Once the Successful Login Cookies are recevied, Add it to Globally defined CookieCollection, then for any subsequent request, send these Cookies with the HTTPWebRequest.
Example 4: Uploading Pic to the Profile
We are going to upload picture to the profile using mobile version of the facebook, and leaving the upload to normall facebook as a task for the user.
To upload pictures/files, multipart/form-data is used as Content Type.
First lets examine the HTTP traffic for uploading the picture by LiveHTTPHeaders
Login to http://m.facebook.com/, and upload a picture.
You will see HTTP Request like following in LiveHTTPHeaders
By now you must be familiarize with the above HTTP Request Headers, the only thing
different is the way of posting the data, instead of using application/x-www-form-urlencoded, we are using multipart/form-data. You can also observe the layout of the postData (just below the ContenType)
Now Lets Examine from where these all fields like fb_dtsg, characterset etc came. Right Click on Upload Photo form on upload page and select Inspect element.
You can see all the fields are here under the form tag for photo upload. Again you can use Regex, string manipulation or HTMLAgilityPack to get
the name and values of these fields. But 1st you need to get the HTML of this page
HttpWebRequest req = (HttpWebRequest)WebRequest.Create("http://m.facebook.com/upload.php");
req.CookieContainer = new CookieContainer();
req.CookieContainer.Add(cookies);
req.AllowAutoRedirect=true;
req.UserAgent = "Mozilla/2.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/5.0.874.121";
HttpWebResponse resp = (HttpWebResponse)req.GetResponse();
StreamReader sr = new StreamReader(resp.GetResponseStream());
string uploadHTML = sr.ReadToEnd();
Let's get all these fields and add to a dictionary Collection. This class is available in
System.Collections
. Make a dictionary variable nvc
Dictionary<string, string> nvc = new Dictionary<string, string>();
As you can see there is no ID for the photoupload form, so 1st use string
manipulation to get the form tag html and then use HTMLAgilityPack to easily get all input tags.
uploadHTML = uploadHTML.Substring(uploadHTML.IndexOf("<form"));
uploadHTML = uploadHTML.Replace("<form","<formid=\"myform\" ");
uploadHTML = uploadHTML.Remove(uploadHTML.IndexOf("/form>") + 6);
A.HtmlDocument doc = new A.HtmlDocument();
doc.LoadHtml(html);
A.HtmlNode node = doc.GetElementbyId("myform");
node = node.ParentNode;
foreach (A.HtmlNode h in node.Elements("input"))
{
string key = h.GetAttributeValue("name", "");
if (key != "")
nvc.Add(key, h.GetAttributeValue("value",""));
}
We will use Following Function to upload photo
HttpUploadFile("http://upload.facebook.com/mobile_upload.php",
"file1", "filename", @"filePath", "image/jpeg", nvc);
The Details of the passes arguments is as following
Action URL of the Upload Form
Name of the input tag for the File to upload
Name of the file
Path to the file on your computer
File type, in this case its image with extension jpeg
A dictionary containing all the input tags name and values
Following the complete piece of code for the HTTPUploadFile
function
public void HttpUploadFile(string url,string paramName, string filename,
string filepath, string contentType, Dictionary<string,string> nvc)
{
string boundary = "---------------------------" + DateTime.Now.Ticks.ToString("x");
byte[] boundarybytes = System.Text.Encoding.ASCII.GetBytes("\r\n--" + boundary + "\r\n");
HttpWebRequest wr = (HttpWebRequest)WebRequest.Create(url);
wr.ContentType = "multipart/form-data; boundary=" + boundary;
wr.KeepAlive = true;
wr.CookieContainer = new CookieContainer();
wr.CookieContainer.Add(cookies);
wr.Method = WebRequestMethods.Http.Post;
wr.UserAgent = "Mozilla/2.0 (Windows NT 6.1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/5.0.874.121";
wr.AllowWriteStreamBuffering = true;
wr.ProtocolVersion = HttpVersion.Version11;
wr.AllowAutoRedirect = true;
wr.Referer = "Referer: http://m.facebook.com/upload.php";
Stream rs = wr.GetRequestStream();
string formdataTemplate = "Content-Disposition:
form-data; name=\"{0}\"\r\n\r\n{1}";
foreach (string key in nvc.Keys)
{
rs.Write(boundarybytes, 0, boundarybytes.Length);
string formitem = string.Format(formdataTemplate, key, nvc[key]);
byte[] formitembytes = System.Text.Encoding.UTF8.GetBytes(formitem);
rs.Write(formitembytes, 0, formitembytes.Length);
}
rs.Write(boundarybytes,0, boundarybytes.Length);
string headerTemplate = "Content-Disposition: form-data; " +
"name=\"{0}\"; filename=\"{1}\"\r\nContent-Type:{2}\r\n\r\n";
string header = string.Format(headerTemplate, paramName, filename, contentType);
byte[]headerbytes = System.Text.Encoding.UTF8.GetBytes(header);
rs.Write(headerbytes, 0, headerbytes.Length);
FileStream fileStream = new FileStream(filepath, FileMode.Open, FileAccess.Read);
byte[] buffer = new byte[4096];
int bytesRead = 0;
while((bytesRead = fileStream.Read(buffer, 0, buffer.Length)) != 0)
{
rs.Write(buffer, 0,bytesRead);
}
fileStream.Close();
byte[] trailer = System.Text.Encoding.ASCII.GetBytes("\r\n--" + boundary + "--\r\n");
rs.Write(trailer, 0, trailer.Length);
rs.Close();
HttpWebResponse wresp = (HttpWebResponse)wr.GetResponse();
cookies.Add(wresp.Cookies);
StreamReader sr = new StreamReader(wresp.GetResponseStream());
string sourceCode = sr.ReadToEnd();
StreamWriter sw = new StreamWriter("upload.html");
sw.Write(sourceCode);
sw.Close();
Task 1: Make Wall Posting Software for www.tagged.com
Task 2: Investigate some site which uses AJAX, to see how to use HTTPWebRequest, HTTPWebResponse for it
Task 3: Perform Login at some sites, using login cookies, view login protected pages