Constructing a Generic MarkupParser to Handle HTML, XML, etc.

Ben Fair

3.83/5 (5 votes)

Nov 6, 2008

CPOL

9 min read

22226

194

This article discusses the construction of a generic markup parsing engine in C#.NET 2.0 as well as a set of objects for working with the markup.

Download source - 41.63 KB

Introduction

This article discusses a markup parsing engine that can be used to work with markup data (such as HTML or XML) when a DOM is not available.

Background

Recently, I was working on a project in which I was interacting with a Web site in code using HttpWebRequest and HttpWebResponse objects. I had the need to simulate form submissions and in order to do that, I needed to set the form values in the Request object. I decided that I would use the HTML text returned by the HttpWebResponse to collect the names of the form controls and then I could set values for them as needed and push that into my Request object. It seemed like a rather straight-forward approach so I began looking at how I could load the HTML text from the WebResponse into some kind of DOM that I could use. Although I thought it would be a simple thing to do, I could not find framework classes that supported this. The XML DOM in the .NET Framework would not work as the HTML text was not well formed (XHTML) and the XML Document objects in the .NET Framework would throw exceptions and not load it. Searches on the internet turned up wacky COM implementations that almost gave me nightmares. So, next I looked for home-grown HTML parsers. I found a few free HTML parsers on the internet, but none seemed to support the kind of interaction I wanted.

I wanted to have the HTML text loaded into an Object Model that would allow intuitive access to the hierarchy of the markup tags without having to jump through a million hoops. As I thought about this, I came up with an idea of how to parse the text myself and also do it in a generic way that would support any markup format. The idea is to use a set of regular expressions to identify markup tags within a given string. I chose to define a markup tag as a single...

<....>

... occurrence. This means that an opening tag like...

<html>

... would be a markup tag occurrence as well as a closing tag like:

</html>

So, the idea was to use the regular expressions to identify the individual tags and only store the index into the original text of where the tag's text begins along with the length of the text. In this manner, I could create a map of all the tags in the text. Also, only one copy of the text would actually exist, and I could reference a specific tag by doing a substring against the original text using the index and length for the tag. From there, I saw that I would need to take a tag and do a similar analysis to determine its attributes. Again, taking an individual tag's text and feeding it into a regular expression, I was able to identify the tag's attributes.

I defined an attribute as a...

name=value

... occurrence within an opening tag. For example:

<input id="myinputcontrol">

Here, id="myinputcontrol" is the attribute text, 'id' is the name, and 'myinputcontrol' is the value. These also were stored as an index of where the attribute's text begins and the length of its text. I defined an inline tag as a tag that ends with:

/>

and a comment tag as a tag that begins with:

<!--

Last, I defined a badly formed tag as a tag that is an opening tag that is not an inline tag and has no corresponding closing tag. For example, here badTag is a badly formed tag:

<goodTag>
    <badTag>
</goodTag>

This is quite common in HTML and browsers handle it without problems.

Using the Code

The work for the parsing of the text is done in the ParseMarkup() function of the MarkupDocument class. The parsing consists of the following steps:

Identification of tags in the raw text - In this step, the raw text is fed through the regular expressions to identify the individual tags. When an individual tag is identified, a MarkupTag object is created for it. At this point, the tags are stored in a linear collection with no hierarchy.
Correction of faulty inline tags - In this step, opening tags that are identified as badly formed tags are flagged.
Construction of the document hierarchy - The linear collection of tags is analyzed to determine the nesting levels and construct the parent-child relationships between the tags.
Association of opening tags with their closing tags - At this point, the hierarchy of the document exists and we can now create a connection between an opening tag object and its closing tag object based on the nesting level of the tags and a case-insensitive text comparison.
Validation of the document is performed - Validation is simply that all opening tags should either be flagged as a faulty inline tag or have a closing tag associated with them.
Removal of closing tags from the document's RootTags collection as well as from all children - In essence, this makes the closing tags accessible only from the ClosingTag property of its opening tag. In general, if you are working with a markup document, you aren't much concerned with the closing tags.
Clear the internal cache - Through testing, I found that large markup documents would benefit from caching of strings in the markup tag objects to prevent unnecessary, repetitive substringing of the underlying text. The cache is essentially a local string of the substring result that is created at first use. After the parsing is complete, the local copies of the strings are cleared.

The MarkupParser uses three classes to represent the markup text in an object oriented manner:

MarkupDocument
MarkupTag
MarkupAttribute

The MarkupDocument class, as its name implies, represents the document as a whole. It contains a RootTags property which is a MarkupTag[] at the root level of the document. The MarkupTag class represents an individual tag in the document and it has a Children property that is also an array of MarkupTag objects to represent the tags that are embedded within it in the document hierarchy. Last, each MarkupTag has an Attributes property which is an array of MarkupAttribute objects that represent the attributes of the tag. Each MarkupAttribute has a Name and Value property that supply the name and value of the attribute as strings. When an attribute value is quoted, the quotes are removed so you will always get the text inside the quotes. For example:

<product id="1" />

Here, the 'id' attribute's Value property will return 1 rather than "1".

The parsing of the markup text is done automatically when a MarkupDocument is instantiated as a required parameter in construction is a string of the markup text to load and parse. Once the constructor has executed, the text has been parsed and the document has been completely filled with MarkupTag objects; it's ready for use! Here's an example of parsing some HTML text from a string:

string htmlText = GetHTMLText();
MarkupDocument doc = new MarkupDocument(htmlText);

To access the inner text of a tag at the root of the document named 'html':

string innerText = doc["html"][0].InnerText;

Note the use of the array index [0] following the reference to the tag name html. This is necessary because the string indexer returns a MarkupTag[] matching the supplied name (note that the MarkupTag class also has a string indexer that indexes into its Children property). With markup documents, it is generally valid to allow multiple tags of the same name. For example, the following is valid markup...

<products>
    <product id="1" name="lamp" />
    <product id="2" name="pillow" />
</products>

... even though there are multiple product tags defined. Cases where a tag is limited to a single occurrence are specific to markup implementations, such as HTML with its html, head, body, etc. tags. For this reason, I added an HTMLDocument class that is a wrapper around the MarkupDocument class and provides Head, Body, and Form properties that give access directly to their respective tags. It can be used like this:

string htmlText = GetHTMLText();
HTMLDocument doc = HTMLDocument.Load(htmlText);
string innerText = doc.Head.InnerText;

Notice here that htmlText is provided to the HTMLDocument class through a static method rather than a constructor. This is because there is some validation that must be done to ensure it is an HTMLDocument (an html tag must be at the root of the document and it can only have 1 head and 1 body tag).

Also, all string comparisons are case-insensitive by default. This is desirable as an opening tag and closing tag can generally have a different case. This also means that when accessing a tag via the string indexer, you don't need to worry about the case. So,...

string innerText = doc["html"][0].InnerText;

... and...

string innerText = doc["HTML"][0].InnerText;

... are the same.

When testing the performance I found that loading a complex XML document that was roughly 1 MB and contained 65000+ tags took about 6 seconds. Likewise, a typical HTML document that was roughly 1.5 KB took less than a second. You can use the MarkupDocument constructor and manipulate some of the options to see how it affects the performance. In particular, the fixBadlyFormedInlineTags option can be a big performance increase if it is false as that is one step in the parsing process that will be skipped; this of course would only be beneficial if you are certain the markup is well formed. Also the caseSensitiveComparisons parameter may also provide a performance gain if it is true as performing case sensitive comparisons should generally perform better; likewise this would only be beneficial if you are certain opening and closing tags in the document have matching case.

Points of Interest

When the useCaching parameter is not available in the constructor or static creation method caching will be determined automatically based on the size of the text, caching will be used when the raw text is larger than 4K characters.
There are static 'Known Inline' members in the MarkupDocument class that are used to account for tags that may erroneously be flagged as faulty inline tags. The only one currently in place is the <?xml ... ?> tag used by XML; since there is never a corresponding closing tag and it is a well-known standard. You can add others to the KnownInlineTags static property as needed.
The use of the generic Queue and Stack classes in the .NET Framework were invaluable in the parsing process and it was a good refresher in using a stack.
XML's CDATA tag is currently not supported as the main regular expression that identifies the tags in the document text does not account for it.

History

11/07/2008 - Minor grammatical corrections; added note about XML CDATA tag
11/06/2008 - Initial publication