WebScrapper Parser in C#

John Kenedy S.Kom

4.80/5 (4 votes)

Oct 23, 2012

CPOL

5 min read

14971

454

Simple Method Parser with C# with unlimited recursion support

Download source - 397.9 KB

Introduction

Ever need a simple parser but yet powerful enough to execute your syntax? You want to download a file from webpage and do a quick tag searching and list out all the tags using only one statement? WebScapper is the Library you need, but currently it is still in alpha stage (still developing further).

string str = "SetResult('VariableName', Download('http://www.google.com'));
TagMatch(GetResult('VariableName'), '<a', '</a>')"; 
Scrapper scr = new Scrapper(); 
string[] results = scr.Multiple(str);

Using the above code, you can automatically download a URL and search all the links. The library built in several searching mechanisms including Regex. Notice the SetResult('VariableName', ...) and GetResult('VariableName'), this syntax will save the result in a variable and get the result of a variable stored previously.

Another example, we try to scrap http://www.anime-access.com to login as a registered user and get the email address of that user, which can be accomplished by using the below syntax:

string str = "TagMatch(Upload('http://anime-access.com/login', 'POST', 
	'username=<user>&password=<pass>', ''), '<div style=\"padding-top: 2px; 
	font-size: 12px;\">', '</div>')";
Scrapper scr = new Scrapper();
string[] results = scr.Single(str);

Change the <user> to your username and change the <pass> to the password and execute the above code and you will see the Email address of the registered user is scrapped. Note that of course you need to have an account in that particular site in order to be successfully logged on.

Library support:

POST or GET to a page
Download a page
Searching Tag using built in search
Regex search
Set and Retrieve Variable

Also notice that the library uses single quote as string terminator.

Background

Web scrapping is a technique to get data from the web. And there is certainly coding needed in order to get whatever we want after downloading the page. Using code, we usually find the string we want using regex, which results in the need to compile our code to become .exe or .dll so that it can be executed. There is a need for me to have a simple parser that is capable of retrieving the syntax from database or configuration files and executing it without having to write another piece of DLL or EXE code.

Using the Code

To use the source code, download it and include WebScrapper.dll into your project and instantiate the class call Scapper. Remember to include the namespace or use the fully qualified name.

string str = "SetResult('LINK1,LINK2', 
	TagMatch(Return(TagMatch(Download('http://www.google.com\'), '<a', '</a>'), '5,6'), 
	'onclick=gbar.logger.il(1,{t:5}); class=gbzt id=gb_5 href=\"', '\"'));
	Download(GetResult('LINK1'))";
WebScrapper.Scrapper scr = new WebScrapper.Scrapper();
string[] results = scr.Multiple(str);

The above code will download google.com and search for links.
Get the result in index 5 and 6 and filter if it matches 'onclick=...' and '"' which result index 5 will match and return a link while result 6 will not match and return empty string.
Set filtered result in index 5 to variable LINK1 and filtered result in index 6 to variable LINK2.
Download the link set in LINK1.

Upload to 22 File Hosting Services using WebScrapper

Download the latest package in the attached source code which includes capability to upload to 22 File Hosting services using WebScrapper (updated 17 January 2013).

The 22 File Hosting services are:

FileCloud.WS
FileSwap.com
SendMyWay.com
SendSpace.com
Tusfiles.net
Uppit.com
Bitshare.com
Depositfiles.com
FileDefend.com
HipFile.com
ShareBeast.com
Upafile.com
UploadCore.com
UptoBox.com
YourFileLink.com
BayFiles.com
FileCloud.IO
FileFactory.com
HotFile.com
Mediafire.com
Rapidgator.com
Slingfile.com

To upload file, instantiate the included class and call the Upload method, for example:

WebScrapper.Uploader.DepositFilesWSUploader dfws = 
			new WebScrapper.Uploader.DepositFilesWSUploader();
string[] links = dfws.Upload("d:\\anyfile.txt");

The links[0] will contain the link to download the file uploaded from the File Hosting services.

Point of Interest

As you notice, the parser supports unlimited recursion, you can call a method inside of a method parameter such as:

string[] listofTagMatches = scr.Multiple(@"Download(RegexSingle(
  Download(""http://www.google.com""), 
  ""(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*"")))");

It is a bit confusing but is worth it, since you can just store the syntax string in database for each website and execute the string for each website so that you do not have to write a bunch of DLLs to support multiple websites for your program.

The separation of concern meaning that the compact syntax serves only for one single purpose, which is Web Scrapping, which we have no control over the content since the content belongs to other entity. We need a strong research and testing so that the assumption is made that certain searches return a certain result.

Many developers when doing Web Scrapping assume a lot of things, because there is no definite way that a website will stay as it is, the company behind the site, or entity behind the site, might do renovation, or works, upgrading or maintenance that cause the web changes. If we specifically write a .NET assembly such as DLL or EXE to get content based on our research, our DLL or EXE is easily outdated once the website changes, thus we have to analyze the website again and update our DLL or EXE code and do recompilation and publish our code to our user or website. It is a tedious cycle that often happens.

By using WebScrapper, parsing of web is done using a single syntax which is a single string consisting of recursive statements. A string can be stored in database or configuration files, which makes it easy to modify without the need to recompile any code. When the target website changes, the developer only needs to update the scrapping syntax and the scrapping works again!

Benefit of WebScrapper

Single syntax in a string, thus can be stored in database or configuration files. Updating of Single Syntax is easy.
No need to compile the syntax, as it is being interpreted on the fly.
One instance of the Class uses one single WebClient control that maintains the Cookies state, thus downloading multiple page will keep the Cookies intact.
Support Regex
Built in string finder
Source code is available for anyone to modify

History

WebScrapper version Alpha 1.0 by John Kenedy
WebScrapper version Alpha 1.5 by John Kenedy

Changing string terminator to single quote
Add POST to page feature
Add Saving Variable and Retrieving Variable

WebScrapper version 1.0 by John Kenedy (includes lots of samples to upload to File Hosting Services)

Add new syntax (Replace, Base64Encode, Base64Decode, UploadFileNameless, etc.)
Include 22 uploads to File Hosting services WebScrapper syntax, which can let you easily upload to 22 File Host using a single method call