Click here to Skip to main content
Click here to Skip to main content

html2struct Class Library

, 4 Apr 2014
Rate this:
Please Sign up or sign in to vote.
html2struct parses HTML code into a simple tree-like structure of objects and provides a little tool-set for extracting data from it.

Introduction

html2struct is intended as an aid when data-mining from external HTML sources.

It's makes it easy to extract data from HTML files based on tag-structure and attributes without having to rely for other content that may change and cause the extraction to fail.

It parses HTML code into a simple tree-like structure of objects and provides a little tool-set to extract data from it. It is a light-weight parser that does not rely on resource hungry external stuff like browsers or DOM objects. It just creates a simple tree made of htmlTag objects.

It does NOT generate HTML, run scripts or fetch any external references.

It makes no attempts to enforce HTML document standards and does not care about conforming to them like having to have <HTML> or <BODY> tags. This makes it easy to parse any segments of HTML code into a structure which as far as I know differs this solution from other HTML/XML parsers I've seen so far.

I theory it should parse other Markup Languages as well, like XHTML, XML, SGML and other variants. Currently this is mostly untested territory but I've tried it on a few RSS sources where it parses XML just fine and in time I hope to make this parser capable of handling all Markup Languages similar to HTML.

Background

I have been developing a search engine that specializes in mini-ads/classifieds, collects them from different sources and allows people to search them. I like to call it a kind of localized mini-Google and today I index up to 2.000 advertisements from 20 different sources each day. This project requires a lot of data-mining from different HTML pages represented in distinct ways to extract a uniform data material which people can then search.

I'm a big fan of regular expressions and have been using them to isolate the data from those HTML sources until now, but after struggling with it for months while fine-tuning ridiculously complex expressions I came to the conclusion that it was too hard to define a "correct" expression for ever changing data sources.

I have repeatedly found my search engine to mine data incorrectly after someone makes a minor changes to their HTML code and these changes can be notoriously hard to debug. These changes would include adding or removing HTML Tags, adding/removing/swapping the order of attributes in an element, even adding a single space somewhere could easily cause a problem. As much as I tried to anticipate those changes I found it impossible and the expressions seized to match repeatedly.

This problem called for a different approach. I wanted to be able to parse HTML code regardless of its casing, order of elements/attributes, white-spaces or compliance to specific HTML standards.

After a bit of searching I decided to make my own parser since all the existing solutions I found seemed to include a full-blown browser or a DOM object generator to do the parsing and tended to reject the HTML code as a whole if it did not comply to some particular standards.

Finally I decided to share this in the open source community. This is the first time I do this in official manner which I have been wanting to do for a long time. I hope you'll find this class useful and certainly hope I'm not reinventing the wheel 8-

Definition

This library consists of 2 classes, the main wrapper called htmlStruct and htmlTag which represents the tags themselves. As the Class View demonstrates the structure is quite simple.

A word on attributes: The htmlStruct wrapper has only 2 attributes

  • AllTags - holds all parsed elements in a HTML document
  • InnerTags - represents the tree-structure and tends to hold top-level elements such as <HTML> and <HEADER>. It is the list intended for navigating down the tree.

htmlTag is the class intended to be extracted from and has a few attributes for navigation and data extraction.

  • Tag holds the name of the current tag of course.
  • Attributes provides a Dictionary type access to attributes, such as 'src' and 'href', defined with the current tag.
  • Html holds the HTML source used to create the tag. In case of <TEXT> it holds the text.
  • LineNr has the line position in the HTML source where the tag was parsed for debugging purposes.
  • InnerTags holds the tags that were found within the opening/closing of the current tag, which then can have their inner tags, etc.
  • NextTag, PreviousTag and ParentTag are intended for navigation from a current tag that has been isolated with a search function.

A word on functions: As a rule of thumb, functions in the wrapper operate on AllTags and search the whole document, functions in htmlTag operate recursively on InnerTags and do not search outside the scope of the current tag.

  • Parse() - takes a HTML document as string, populates the attributes and generates the tree structure.
  • Search() - returns a list of tags that match all search criterias based on tag name, attribute or value.
  • FirstTag() - works the same as Search() except it return a single tag.
  • FirstHtml() - returns the first tag that matches a regular expression from its Html attribute.
  • ToText() - Extracts text from current and its inner tags. If it runs into <BR> or <P> tags they get treated as newlines.

A word on search criterias: All Search() and FirstTag() functions accept the same search parameters, name of tag, attribute and value. Also they take a case-insensitive regular expression as search string. They will then do a search returning tags where all given expressions are true. If name of tag is given it will return tags with names that match. If attribute is given it return tags with attribute names that match. If value is given it will return tags with any attributes having values that match. If both attribute and value is given it will return tags with any attribute names that match having a value that match (hmm, getting kinky...).

A word on <TEXT>/<COMMENT>/<SCRIPT>: To keep things simple I decided to represent text and comments, which actually appear between tags in HTML, as tags too. This allow you to easily search for <TEXT> or <COMMENTS> tags using the search functions. Also when the parser runs into scripts it just creates the <SCRIPT> tag and puts the code in the Html attribute.

Note that I do not bother with creating closing tags as objects since they are not necessary to represent the structure per see.

Using the class

Operating the main class.

When using this solution you will find that extracting data from HTML becomes ridiculously simple 8)

//
// Operating the main class.
//
htmlStruct tree = new htmlStruct(strHTML);

// And if you intend to re-use the wrapper just do:
tree.Parse(strHTML)

Quick examples of how I find myself using the classes:

I like to define a temporary tag (t) which I then use when extracting data. This prevents "Object reference not set to an instance of an object." errors, allows for sequental searches, and helps with debugging.

htmlTag t;

// Attempt to find a <H3> Tag and get the text contained in it
string sTitle = tree.FirstTag("<H3>", "", "").ToText();

// If there is any risk of ambiguity I like to combinine searches to find one tag and then another tag. Here I use combination of t and && (and then) operators to finally select data if found.
string sUrl = (t = tree.FirstTag("<H3>", "", "")) != null && (t = t.FirstTag("<A>", "href", "")) != null ? t.Attributes["href"] : ""; // Looove that one ;)

// Locate a Tag within BODY element using HTML source
htmlTag tag = (t = tree.FirstTag("<BODY>", "", "")) != null && (t = t.FirstHtml("<div class=\"details\">")) != null ? t : null;

// Isolating src of an image is of course a breeze
string strImage = (t = tree.FirstTag("<DIV>", "class", "picture")) != null && (t = t.FirstTag("<IMG>", "src", "")) != null ? t.Attributes["src"] : "";

// How to get all text and comment elements in a document
List<htmlTag> list = tree.Search("<TEXT>|<COMMENT>", "", "");

// How to get all references in a document
List<htmlTag> list = tree.Search("", "href|src", "");

// How to isolate an email address
string strEmail = (t = tree.FirstTag("<DIV>", "class", "classified_client_info")) != null && (t = tree.FirstTag("<A>", "href", "mailto:")) != null ? t.Attributes["href"] : "";

// How to extract a list of divs with varying classes, e.g. search results with light/dark entries
List<htmlTag> list = (t = tree.FirstTag("<DIV>", "class", "some-listing")) != null ? t.Search("<DIV>", "class", "entry( grey)?") : null;

Conclusion

Regular expressions, as powerful as they are, are not ideal for data mining. They tend to get big and very complex, very quickly. Slightest variation in code, such as adding a single space can easily cause it to stop matching and can be notoriously hard to debug.

After a bit of messing around with html2struct I find it quite tolerant to changing HTML code. I find it easy to re-use existing code on new HTML sources with just minor changes to search parameters.

html2struct does not care about changing order of tags or attributes, adding or removing of HTML elements as long as you don't rely on them directly. It does not even care about structural changes unless they remove the tags/attributes you explicitly search for.

html2struct handles data-mining much better than regular expressions alone. In fact they do not even compare to this approach and I kinda regret not doing this before...

Known Issues

Its a good idea to keep in mind that when dealing with HTML code we are dealing with pure unchecked user input. There is no saying what kind of crap people may insert into the code, wittingly or unwittingly. I have debugged this solution as far as to be able to use it without problems, but there are undoubtably numerious issues that are going to surface now since I decided to share it.

  • Nested comments and scripts don't get handled correctly. E.g "<!-- rem <!-- more rem --> -->" will cause the parser to skip the last "-->" from the comment and insert it as <TEXT> tag afterwards. Here I run into issues with regular expressions dealing with nested/recursive patterns.
  • Unnamed tags such as "<<em>desperately</em> important>" cause the parser to ignore the opening tag, continue as normal, but finish off with a <TEXT> tag with "important>" as Html.
  • Currently NextTag and PreviousTag point at tags in the order they were discovered during parsing. As a consequence NextTag of a parent element points at its first InnerTag instead of pointing to the next tag that came after it on the same level. PreviousTag also points at the last tag regardless of whether it is a child element of a parent tag that came before the current tag on the same level. Guess we can call this depth-first navigation instead of breadth-first navigation. I have not quite decided whether I should change this so I'll wait for some social pressure.

History

  • 2-7 Apr: Have been finalizing article and fixing minor issues. Apologies to the editors for all the minor fixes.
  • 4 Apr: Ran into a <![DATA[...]]> element which was not recognized while testing various sources. Fixed that and republished library as version 5.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

thorssig

Iceland Iceland
Hi,
 
I am a highly productive, well experienced, open minded, good problem solver and have been described productive as a whole department by my co-workers.
 
I have worked in London City as well as for various government beurous in Iceland such as the Custom Beurou and the National Archives.
 
If you find this solution useful and would like to support me you could donate to my paypal account. This would be greatly appreciated since I'm completely broke 8-
 
I am currently unemployed and looking for a job. I am sorry to say that I have been unemployed for 6 years now after running into the danish crown basically and I haven't got a single job since then and I desperately need an income. If anyone like to hire me for a job they can approach me at freelancer.com or contact me directly.

Comments and Discussions

 
QuestionMJ12 PinmemberJeremy Rudd8-Apr-14 18:36 
QuestionRe: MJ12 PinmemberJeremy Rudd8-Apr-14 20:56 
AnswerRe: MJ12 Pinmemberthorssig9-Apr-14 17:17 
AnswerRe: MJ12 Pinmemberthorssig9-Apr-14 16:30 
GeneralMy vote of 5 PinmemberArkitec8-Apr-14 10:46 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140827.1 | Last Updated 4 Apr 2014
Article Copyright 2014 by thorssig
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid