Introduction
This is basically a simple class to get any part/data of an HTML formatted data/page. The data may be somewhere in a table in the HTML body in some row or column. It becomes difficult to use existing XML or HTML parser to extract data from such positions. This class makes data extraction very easy.
Using the Code
To use this class, you just need to include the file HTMLSearchResult.cs in your project.
To extract the HTML data, you just need to create an object of HTMLSearchResult class and call the method GetTagData. The member GetTagData is overloaded for different cases:
public HTMLSearchResult GetTagData(string sFileData, string sSearchTag, int nOccurance)
public HTMLSearchResult GetTagData(string sSearchTag)
public HTMLSearchResult GetTagData(string sSearchTag, int nOccurance)
Suppose you have an HTML site that has the HTML page as shown below:
Now to extract the data of this HTML code, let's say the HTML page is located on the Web at location http://test.test.com. To extract the first column of the second row in the table, the code has to be...
WebClient wc = new WebClient();
string PageData;
PageData = wc.DownloadString(http:
HTMLSearchResult searcher = new HTMLSearchResult(); HTMLSearchResult result; result = searcher.GetTagData(PageData, "html", 1).GetTagData("body").
GetTagData("table").GetTagData("tr",2).GetTagData("td");
Console.WriteLine("The tag data is :{0}", result.TAGData);
Row 2, Col 1
... or if you have a Web page named test.html in the current directory:
string PageData;
StreamReader file = new StreamReader(@"test.html"); PageData = file.ReadToEnd();
HTMLSearchResult searcher = new HTMLSearchResult();
HTMLSearchResult result;
result = searcher.GetTagData(PageData, "html", 1).GetTagData("title").
GetTagData("table").GetTagData("tr", 2).GetTagData("td");
Console.WriteLine("The tag data is :{0}", result.TAGData);
Steps that need to be followed to extract the data are as follows:
- Read all the HTML code in a
string, it doesn't matter how you read it, either from disk or Web.
- Create an
HTMLSearchResult object with default constructor.
- The first method that must be called on this object is the
GetTagData which takes 3 arguments:
- HTML page data
- Top level tag you want to extract
- Nth occurrence of it (which usually is 1 for HTML)
GetTagData again returns object of HTMLSearchResult, which contains other useful property, TAGAttribute which contains the attribute of the tag last requested on its object.
For a tag element like <td colspan=2>My data</td>, TAGAttribute will return "My data".
Limitations
This parser currently can't retrieve data correctly in the tag "script" if it contains any if/for blocks that use "<" or ">". This limitation will be removed soon in the next version.
History
- 17th September, 2007: Initial post