Click here to Skip to main content
15,888,282 members
Articles / Programming Languages / C#
Tip/Trick

WebScrapper Parser in C#

Rate me:
Please Sign up or sign in to vote.
4.80/5 (4 votes)
16 Jan 2013CPOL5 min read 14.7K   454   11   1
Simple Method Parser with C# with unlimited recursion support

Introduction

Ever need a simple parser but yet powerful enough to execute your syntax? You want to download a file from webpage and do a quick tag searching and list out all the tags using only one statement? WebScapper is the Library you need, but currently it is still in alpha stage (still developing further).

C#
string str = "SetResult('VariableName', Download('http://www.google.com'));
TagMatch(GetResult('VariableName'), '<a', '</a>')"; 
Scrapper scr = new Scrapper(); 
string[] results = scr.Multiple(str);  

Using the above code, you can automatically download a URL and search all the links. The library built in several searching mechanisms including Regex. Notice the SetResult('VariableName', ...) and GetResult('VariableName'), this syntax will save the result in a variable and get the result of a variable stored previously.

Another example, we try to scrap http://www.anime-access.com to login as a registered user and get the email address of that user, which can be accomplished by using the below syntax:

C#
string str = "TagMatch(Upload('http://anime-access.com/login', 'POST', 
	'username=<user>&password=<pass>', ''), '<div style=\"padding-top: 2px; 
	font-size: 12px;\">', '</div>')";
Scrapper scr = new Scrapper();
string[] results = scr.Single(str); 

Change the <user> to your username and change the <pass> to the password and execute the above code and you will see the Email address of the registered user is scrapped. Note that of course you need to have an account in that particular site in order to be successfully logged on.

Library support:

  1. POST or GET to a page
  2. Download a page
  3. Searching Tag using built in search
  4. Regex search
  5. Set and Retrieve Variable

Also notice that the library uses single quote as string terminator.

Background

Web scrapping is a technique to get data from the web. And there is certainly coding needed in order to get whatever we want after downloading the page. Using code, we usually find the string we want using regex, which results in the need to compile our code to become .exe or .dll so that it can be executed. There is a need for me to have a simple parser that is capable of retrieving the syntax from database or configuration files and executing it without having to write another piece of DLL or EXE code.

Using the Code

To use the source code, download it and include WebScrapper.dll into your project and instantiate the class call Scapper. Remember to include the namespace or use the fully qualified name.

C#
string str = "SetResult('LINK1,LINK2', 
	TagMatch(Return(TagMatch(Download('http://www.google.com\'), '<a', '</a>'), '5,6'), 
	'onclick=gbar.logger.il(1,{t:5}); class=gbzt id=gb_5 href=\"', '\"'));
	Download(GetResult('LINK1'))";
WebScrapper.Scrapper scr = new WebScrapper.Scrapper();
string[] results = scr.Multiple(str);   
  1. The above code will download google.com and search for links.
  2. Get the result in index 5 and 6 and filter if it matches 'onclick=...' and '"' which result index 5 will match and return a link while result 6 will not match and return empty string.
  3. Set filtered result in index 5 to variable LINK1 and filtered result in index 6 to variable LINK2.
  4. Download the link set in LINK1.

Upload to 22 File Hosting Services using WebScrapper

Download the latest package in the attached source code which includes capability to upload to 22 File Hosting services using WebScrapper (updated 17 January 2013).

The 22 File Hosting services are:

  1. FileCloud.WS
  2. FileSwap.com
  3. SendMyWay.com
  4. SendSpace.com
  5. Tusfiles.net
  6. Uppit.com
  7. Bitshare.com
  8. Depositfiles.com
  9. FileDefend.com
  10. HipFile.com
  11. ShareBeast.com
  12. Upafile.com
  13. UploadCore.com
  14. UptoBox.com
  15. YourFileLink.com
  16. BayFiles.com
  17. FileCloud.IO
  18. FileFactory.com
  19. HotFile.com
  20. Mediafire.com
  21. Rapidgator.com
  22. Slingfile.com

To upload file, instantiate the included class and call the Upload method, for example:

C#
WebScrapper.Uploader.DepositFilesWSUploader dfws = 
			new WebScrapper.Uploader.DepositFilesWSUploader();
string[] links = dfws.Upload("d:\\anyfile.txt");   

The links[0] will contain the link to download the file uploaded from the File Hosting services.

Point of Interest

As you notice, the parser supports unlimited recursion, you can call a method inside of a method parameter such as:

C#
string[] listofTagMatches = scr.Multiple(@"Download(RegexSingle(
  Download(""http://www.google.com""), 
  ""(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=%&=\-@/$,]*"")))"); 

It is a bit confusing but is worth it, since you can just store the syntax string in database for each website and execute the string for each website so that you do not have to write a bunch of DLLs to support multiple websites for your program.

The separation of concern meaning that the compact syntax serves only for one single purpose, which is Web Scrapping, which we have no control over the content since the content belongs to other entity. We need a strong research and testing so that the assumption is made that certain searches return a certain result.

Many developers when doing Web Scrapping assume a lot of things, because there is no definite way that a website will stay as it is, the company behind the site, or entity behind the site, might do renovation, or works, upgrading or maintenance that cause the web changes. If we specifically write a .NET assembly such as DLL or EXE to get content based on our research, our DLL or EXE is easily outdated once the website changes, thus we have to analyze the website again and update our DLL or EXE code and do recompilation and publish our code to our user or website. It is a tedious cycle that often happens.

By using WebScrapper, parsing of web is done using a single syntax which is a single string consisting of recursive statements. A string can be stored in database or configuration files, which makes it easy to modify without the need to recompile any code. When the target website changes, the developer only needs to update the scrapping syntax and the scrapping works again!

Benefit of WebScrapper

  1. Single syntax in a string, thus can be stored in database or configuration files. Updating of Single Syntax is easy.
  2. No need to compile the syntax, as it is being interpreted on the fly.
  3. One instance of the Class uses one single WebClient control that maintains the Cookies state, thus downloading multiple page will keep the Cookies intact.
  4. Support Regex
  5. Built in string finder
  6. Source code is available for anyone to modify

History

  1. WebScrapper version Alpha 1.0 by John Kenedy
  2. WebScrapper version Alpha 1.5 by John Kenedy
    • Changing string terminator to single quote
    • Add POST to page feature
    • Add Saving Variable and Retrieving Variable
  3. WebScrapper version 1.0 by John Kenedy (includes lots of samples to upload to File Hosting Services)
      • Add new syntax (Replace, Base64Encode, Base64Decode, UploadFileNameless, etc.)
      • Include 22 uploads to File Hosting services WebScrapper syntax, which can let you easily upload to 22 File Host using a single method call
This article was originally posted at http://innosia.com/Home/Article/WEBSCRAPPER

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Singapore Singapore
I write code mostly in C#, VB.NET, PHP and Assembly.

Comments and Discussions

 
GeneralMy vote of 4 Pin
nemopeti16-Jan-13 23:32
nemopeti16-Jan-13 23:32 
It's nice to store the syntax string in database.
But I miss a StringBuilder like extension, witch help to build the syntax string.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.