Click here to Skip to main content
Licence GPL3
First Posted 17 Sep 2007
Views 58,152
Downloads 1,877
Bookmarked 64 times

HTML Parser/ HTML Data Extractor/ HTML Extractor

By | 17 Sep 2007 | Article
Extracts any part of the HTML from the HTML data

Introduction

This is basically a simple class to get any part/data of an HTML formatted data/page. The data may be somewhere in a table in the HTML body in some row or column. It becomes difficult to use existing XML or HTML parser to extract data from such positions. This class makes data extraction very easy.

Using the Code

To use this class, you just need to include the file HTMLSearchResult.cs in your project.
To extract the HTML data, you just need to create an object of HTMLSearchResult class and call the method GetTagData. The member GetTagData is overloaded for different cases:

public HTMLSearchResult GetTagData(string sFileData, string sSearchTag, int nOccurance)
public HTMLSearchResult GetTagData(string sSearchTag)
public HTMLSearchResult GetTagData(string sSearchTag, int nOccurance)

Suppose you have an HTML site that has the HTML page as shown below:

Screenshot - htmlparser.jpg

Now to extract the data of this HTML code, let's say the HTML page is located on the Web at location http://test.test.com. To extract the first column of the second row in the table, the code has to be...

WebClient wc = new WebClient();
string PageData;
//You can get this data from any source, not necessarily from the Web
PageData = wc.DownloadString(http://test.test.com); 

//My code comes now
HTMLSearchResult searcher = new HTMLSearchResult(); //create a simple searcher object
HTMLSearchResult result; //create a temporary object
result = searcher.GetTagData(PageData, "html", 1).GetTagData("body").
            GetTagData("table").GetTagData("tr",2).GetTagData("td");

Console.WriteLine("The tag data is :{0}", result.TAGData);

//output
Row 2, Col 1

... or if you have a Web page named test.html in the current directory:

string PageData;
StreamReader file = new StreamReader(@"test.html"); //update the path here.
PageData = file.ReadToEnd();
HTMLSearchResult searcher = new HTMLSearchResult();
HTMLSearchResult result;
result = searcher.GetTagData(PageData, "html", 1).GetTagData("title").
            GetTagData("table").GetTagData("tr", 2).GetTagData("td");

Console.WriteLine("The tag data is :{0}", result.TAGData);

Steps that need to be followed to extract the data are as follows:

  1. Read all the HTML code in a string, it doesn't matter how you read it, either from disk or Web.
  2. Create an HTMLSearchResult object with default constructor.
  3. The first method that must be called on this object is the GetTagData which takes 3 arguments:
    1. HTML page data
    2. Top level tag you want to extract
    3. Nth occurrence of it (which usually is 1 for HTML)
  4. GetTagData again returns object of HTMLSearchResult, which contains other useful property, TAGAttribute which contains the attribute of the tag last requested on its object.
    For a tag element like <td colspan=2>My data</td>, TAGAttribute will return "My data".

Limitations

This parser currently can't retrieve data correctly in the tag "script" if it contains any if/for blocks that use "<" or ">". This limitation will be removed soon in the next version.

History

  • 17th September, 2007: Initial post

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

About the Author

Ashutosh Bhawasinka



India India

Member

I am a Compute Engineer, currently working for a telecom company. I have interest in C#, Visual C++, MAPI, COM/DCOM, Windows administration and Networking.
I love programming!!!
 
Visit my website www.ashusoft.com


Visit my blogs at abhawasinka.blogspot.com

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
QuestionCode works great! PinmemberAdamJ Learned13:15 3 Apr '12  
QuestionGreat Work!! PinmemberWaqas Qadeer22:11 20 Jul '11  
QuestionRe: Great Work!! PinmemberWaqas Qadeer22:44 20 Jul '11  
Generalsometimes it doesn't find the tags Pinmembernullpunktnull0:07 23 Mar '09  
GeneralRe: sometimes it doesn't find the tags PinmemberAshutosh Bhawasinka22:00 3 Apr '12  
GeneralHI PinmemberNitin Sawant21:14 29 May '08  
Questionhow to extract tag attribute? Pinmemberyicany16:10 16 Nov '07  
AnswerRe: how to extract tag attribute? PinmemberAshutosh Bhawasinka10:33 17 Nov '07  
GeneralHTML data extractor PinmemberMuthuramanB19:34 10 Oct '07  
GeneralRe: HTML data extractor PinmemberAshutosh Bhawasinka19:37 10 Oct '07  
GeneralRe: HTML data extractor PinmemberAshutosh Bhawasinka19:37 10 Oct '07  
GeneralExtract Data from a Table in HTML Page. PinmemberMuthuramanB1:10 9 Oct '07  
GeneralRe: Extract Data from a Table in HTML Page. PinmemberAshutosh Bhawasinka1:54 9 Oct '07  
QuestionHow can ??? Pinmembermaingaosuong18:32 25 Sep '07  
AnswerRe: How can ??? PinmemberAshutosh Bhawasinka20:48 29 Sep '07  
QuestionUses? PinmemberBoneSoft9:06 24 Sep '07  
AnswerRe: Uses? Pinmemberredevries9:25 24 Sep '07  
AnswerRe: Uses? PinmemberAshutosh Bhawasinka20:52 29 Sep '07  
GeneralBad Link Pinmemberfwsouthern10:26 17 Sep '07  
GeneralRe: Bad Link PinmemberAshutosh Bhawasinka19:41 17 Sep '07  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web02 | 2.5.120517.1 | Last Updated 17 Sep 2007
Article Copyright 2007 by Ashutosh Bhawasinka
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid