Click here to Skip to main content
Click here to Skip to main content

Tagged as

Semi generated crawler

, 22 Jun 2012
Rate this:
Please Sign up or sign in to vote.
Leverage Visual studio Web Test Framework for your crawling needs...

Semi generated crawler

Leverage VS Web Test Framework to create crawlers for you.



Table of content

Introduction

Well, I created lots of crawler in my life... more than I want to. So to keep my
passion alife I create new ways for doing the same stuff with less time !

In this article you will learn how to leverage the Web Test Framework and recorder
of Visual Studio Ultimate (professional or prenium version does not have it) to create website crawlers for you.

Maybe have you read my article about
browser automation
.


If you did, I will use the same use case example.

If you have not, here is a reminder: at the end of this article you will have an
application that will vote for my article !

The process is quick and simple :


The coffee step is the hardest one.

Use case : Vote 5 for this article automatically

Given the number of vote I had for my
browser automation
article, I decided to use the same trick to boost my
vote here.


This project will vote for me ! I am an egocentric person.

So with a classic crawler you might say :

  • Get request to http://www.codeproject.com/Articles/338036/BrowserAutomationCrawler
  • POST request with login parameter and password parameter to action "submitLogin"
  • Save the cookie
  • POST vote=5, articleId=MyArticleId to action=Vote, with the cookie

It might work, except that I completely make up the parameters and action name and
so you would need fiddler to fine tunes the requests correctly. (And depending on
the website it can be very very hard, especially with AJAX stuff)

Another way to do the same thing is to say :

  • Go to http://www.codeproject.com/Articles/338036/BrowserAutomationCrawler
  • If "logout" is present then you are already logged,

    else


    Wait that I click on sign in (so I can manually fill email and password)
  • Then click the option vote 5
  • Fill the comment textbox with "5 for me, great Nicolas, thanks for you work ! Smile | <img src= " /> "
  • Click on vote button

That's the way I did for my article on browser automation.

Now I have a new way :

  • Open IE with Web test record
  • Click click click
  • Generate code
  • Modify the login/password part of the code to use a custom property of the crawler
    class.

So first, create you need to create a new console project that will hold the code
GeneratedCrawler.Sample.

Then create a new test project, I called it GeneratedCrawler.Tests


In this test project, right click on the project/Add/New test, then select a new
Web Performance Test, I called it CodeProjectVoter.


Then a browser session in IE open...

Browse to codeproject, and to this article link, login, and vote 5 for me with a
nice comment...

If your browser is already logged in CodeProject when it is launched, you need to
logout first, stop the record, and start from the beginning.


If you are already logged, our crawler will not generate the code we need to log
in successfully to CodeProject.

And don't forget to click on Vote !!


You can see I used another article... That's because of the chicken and eggs
problem : creating my article on the project of my article... When I noticed my
mind stackoverflowed.

Then you can stop the record and then generate the code.


Now here is the trick... The generated code has heavy dependencies on tons of dll
and with MSTest.


This code is not easy to run in your own project without using MSTest.exe directly...


So I decompiled the MSTest dlls, checked what was going on, and created a project with the same classes, cutting all the dependencies.

LightWebTestFramework is born.

So first, copy the generated code from the test project to the console project created earlier : GeneratedCrawler.Sample.

Obviously, the code does not compile since GeneratedCrawler.Sample does not reference any MSTest assembly... reference the LightWebTestFramework instead.

Then use the LightWebTestFramework namespaces instead of microsoft's ones.


Now call the crawler in your code.

class Program
{
	static void Main(string[] args)
	{
		new CodeProjectVoterCoded().Execute();
	}
}

But wait... we forgot to specify the login and password as properties of the crawler.

public class CodeProjectVoterCoded : WebTest
{
	public CodeProjectVoterCoded()
	{
		this.PreAuthenticate = true;
	}
 
	public string Login
	{
		get;
		set;
	}
	public string Password
	{
		get;
		set;
	}

Update the program.cs...

class Program
{
	static void Main(string[] args)
	{
		new CodeProjectVoterCoded()
			{
				Login = "slashene@gmail.com",
				Password = "blabla"
			}.Execute();
	}
}

Then just find in CodeProjectVoterCoded where the login/password you used during the recording step appear,
and use your properties instead.


You can then generalize for other properties at will... 

Cleanup some dependant requests you don't need (like the ScoreCard Research requests)... and then you have your full crawler.

static void Main(string[] args)
{
	new CodeProjectVoterCoded()
		{
			Login = "slashene@gmail.com",
			Password = "blabla",
			ArticleId = 409009,
			Vote = 5,
			Comment = "5 for me, great work Nicolas !!!"
		}.Execute();
} 

Here is the full code of the final entirely generated then customized crawler : 

namespace GeneratedCrawler.Tests
{
	using System;
	using System.Collections.Generic;
	using System.Text;
	//using Microsoft.VisualStudio.TestTools.WebTesting;
	//using Microsoft.VisualStudio.TestTools.WebTesting.Rules;
	using LightWebTestFramework;
	using LightWebTestFramework.Rules;
	using System.Web;


	public class CodeProjectVoterCoded : WebTest
	{
		public CodeProjectVoterCoded()
		{
			this.PreAuthenticate = true;
		}


		public string Login
		{
			get;
			set;
		}
		public string Password
		{
			get;
			set;
		}
		public int ArticleId
		{
			get;
			set;
		}
		public int Vote
		{
			get;
			set;
		}
		public string Comment
		{
			get;
			set;
		}
		public override IEnumerator<webtestrequest> GetRequestEnumerator()
		{
			// Initialize validation rules that apply to all requests in the WebTest
			if((this.Context.ValidationLevel >= ValidationLevel.Low))
			{
				ValidateResponseUrl validationRule1 = new ValidateResponseUrl();
				this.ValidateResponse += new EventHandler<validationeventargs>(validationRule1.Validate);
			}
			if((this.Context.ValidationLevel >= ValidationLevel.Low))
			{
				ValidationRuleResponseTimeGoal validationRule2 = new ValidationRuleResponseTimeGoal();
				validationRule2.Tolerance = 0D;
				this.ValidateResponseOnPageComplete += new EventHandler<validationeventargs>(validationRule2.Validate);
			}

			WebTestRequest request1 = new WebTestRequest("http://www.codeproject.com/");
			WebTestRequest request1Dependent1 = new WebTestRequest("http://b.scorecardresearch.com/b");
			request1Dependent1.ThinkTime = 24;
			request1Dependent1.QueryStringParameters.Add("c1", "2", false, false);
			request1Dependent1.QueryStringParameters.Add("c2", "13507173", false, false);
			request1Dependent1.QueryStringParameters.Add("ns__t", "1340383504029", false, false);
			request1Dependent1.QueryStringParameters.Add("ns_c", "utf-8", false, false);
			request1Dependent1.QueryStringParameters.Add("c8", "CodeProject%20-%20Your%20Development%20Resource", false, false);
			request1Dependent1.QueryStringParameters.Add("c7", "http%3A%2F%2Fwww.codeproject.com%2F", false, false);
			request1Dependent1.QueryStringParameters.Add("c9", "", false, false);
			request1.DependentRequests.Add(request1Dependent1);
			ExtractHiddenFields extractionRule1 = new ExtractHiddenFields();
			extractionRule1.Required = true;
			extractionRule1.HtmlDecode = true;
			extractionRule1.ContextParameterName = "1";
			request1.ExtractValues += new EventHandler<extractioneventargs>(extractionRule1.Extract);
			yield return request1;
			request1 = null;

			WebTestRequest request2 = new WebTestRequest("http://www.codeproject.com/script/Membership/LogOn.aspx");
			request2.Method = "POST";
			request2.ExpectedResponseUrl = "http://www.codeproject.com/";
			request2.QueryStringParameters.Add("rp", "%2f", false, false);
			FormPostHttpBody request2Body = new FormPostHttpBody();
			request2Body.FormPostParameters.Add("FormName", this.Context["$HIDDEN1.FormName"].ToString());
			request2Body.FormPostParameters.Add("Email", Login);
			request2Body.FormPostParameters.Add("Password", Password);
			request2.Body = request2Body;
			yield return request2;
			request2 = null;

			WebTestRequest request4 = new WebTestRequest("http://www.codeproject.com/Script/Ratings/Ajax/RateItem.aspx");
			request4.QueryStringParameters.Add("obid", ArticleId.ToString(), false, false);
			request4.QueryStringParameters.Add("obtid", "2", false, false);
			request4.QueryStringParameters.Add("obstid", "1", false, false);
			request4.QueryStringParameters.Add("rvv", Vote.ToString(), false, false);
			request4.QueryStringParameters.Add("rvc", HttpUtility.UrlEncode(Comment), false, false);
			yield return request4;
			request4 = null;
		}
	}
} 

If the login/password is wrong you will have no exception or crash, it will just not work, I let it to you. Smile | :)  

If you need any information for your own needs, don't forget that the generated crawler is a child class of WebTest, and it comes with some cool properties.

Conclusion 

With great power comes great responsability, I hope you will use crawling for good ! 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Nicolas Dorier
Software Developer Freelance
France France
I am a trainer and a curious developer.
 
CEO of AO-IS, we created a tool to make IaaS on Azure more easy IaaS Management Studio.
 
If you are interested for working with me, for fun coding stuff, for freelance stuff, or interested in using our cloud training infrastructure freely for a kickass presentation for the dev community ? this way Smile | :)

Comments and Discussions

 
GeneralMy vote of 5 PinmemberNicolas Dorier22-Jun-12 22:52 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.140827.1 | Last Updated 23 Jun 2012
Article Copyright 2012 by Nicolas Dorier
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid