This article discusses a markup parsing engine that can be used to work with markup data (such as HTML or XML) when a DOM is not available.
Recently, I was working on a project in which I was interacting with a Web site in code using
HttpWebResponse objects. I had the need to simulate form submissions and in order to do that, I needed to set the form values in the
Request object. I decided that I would use the HTML text returned by the
HttpWebResponse to collect the names of the form controls and then I could set values for them as needed and push that into my
Request object. It seemed like a rather straight-forward approach so I began looking at how I could load the HTML text from the
WebResponse into some kind of DOM that I could use. Although I thought it would be a simple thing to do, I could not find framework classes that supported this. The XML DOM in the .NET Framework would not work as the HTML text was not well formed (XHTML) and the XML Document objects in the .NET Framework would throw exceptions and not load it. Searches on the internet turned up wacky COM implementations that almost gave me nightmares. So, next I looked for home-grown HTML parsers. I found a few free HTML parsers on the internet, but none seemed to support the kind of interaction I wanted.
I wanted to have the HTML text loaded into an Object Model that would allow intuitive access to the hierarchy of the markup tags without having to jump through a million hoops. As I thought about this, I came up with an idea of how to parse the text myself and also do it in a generic way that would support any markup format. The idea is to use a set of regular expressions to identify markup tags within a given
string. I chose to define a markup tag as a single...
... occurrence. This means that an opening tag like...
... would be a markup tag occurrence as well as a closing tag like:
So, the idea was to use the regular expressions to identify the individual tags and only store the index into the original text of where the tag's text begins along with the length of the text. In this manner, I could create a map of all the tags in the text. Also, only one copy of the text would actually exist, and I could reference a specific tag by doing a substring against the original text using the index and length for the tag. From there, I saw that I would need to take a tag and do a similar analysis to determine its attributes. Again, taking an individual tag's text and feeding it into a regular expression, I was able to identify the tag's attributes.
I defined an attribute as a...
... occurrence within an opening tag. For example:
id="myinputcontrol" is the attribute text, '
id' is the
name, and '
myinputcontrol' is the
value. These also were stored as an index of where the attribute's text begins and the length of its text. I defined an inline tag as a tag that ends with:
and a comment tag as a tag that begins with:
Last, I defined a badly formed tag as a tag that is an opening tag that is not an inline tag and has no corresponding closing tag. For example, here
badTag is a badly formed tag:
This is quite common in HTML and browsers handle it without problems.
Using the Code
The work for the parsing of the text is done in the
ParseMarkup() function of the
MarkupDocument class. The parsing consists of the following steps:
- Identification of tags in the raw text - In this step, the raw text is fed through the regular expressions to identify the individual tags. When an individual tag is identified, a
MarkupTag object is created for it. At this point, the tags are stored in a linear collection with no hierarchy.
- Correction of faulty inline tags - In this step, opening tags that are identified as badly formed tags are flagged.
- Construction of the document hierarchy - The linear collection of tags is analyzed to determine the nesting levels and construct the parent-child relationships between the tags.
- Association of opening tags with their closing tags - At this point, the hierarchy of the document exists and we can now create a connection between an opening tag object and its closing tag object based on the nesting level of the tags and a case-insensitive text comparison.
- Validation of the document is performed - Validation is simply that all opening tags should either be flagged as a faulty inline tag or have a closing tag associated with them.
- Removal of closing tags from the document's
RootTags collection as well as from all children - In essence, this makes the closing tags accessible only from the
ClosingTag property of its opening tag. In general, if you are working with a markup document, you aren't much concerned with the closing tags.
- Clear the internal cache - Through testing, I found that large markup documents would benefit from caching of
strings in the markup tag objects to prevent unnecessary, repetitive substringing of the underlying text. The cache is essentially a local
string of the substring result that is created at first use. After the parsing is complete, the local copies of the
strings are cleared.
MarkupParser uses three classes to represent the markup text in an object oriented manner:
MarkupDocument class, as its name implies, represents the document as a whole. It contains a
RootTags property which is a
MarkupTag at the root level of the document. The
MarkupTag class represents an individual tag in the document and it has a
Children property that is also an array of
MarkupTag objects to represent the tags that are embedded within it in the document hierarchy. Last, each
MarkupTag has an
Attributes property which is an array of
MarkupAttribute objects that represent the attributes of the tag. Each
MarkupAttribute has a
Value property that supply the name and value of the attribute as
strings. When an attribute value is quoted, the quotes are removed so you will always get the text inside the quotes. For example:
<product id="1" />
Here, the '
Value property will return
1 rather than
The parsing of the markup text is done automatically when a
MarkupDocument is instantiated as a required parameter in construction is a
string of the markup text to load and parse. Once the constructor has executed, the text has been parsed and the document has been completely filled with
MarkupTag objects; it's ready for use! Here's an example of parsing some HTML text from a
string htmlText = GetHTMLText();
MarkupDocument doc = new MarkupDocument(htmlText);
To access the inner text of a tag at the root of the document named '
string innerText = doc["html"].InnerText;
Note the use of the array index
 following the reference to the tag name
html. This is necessary because the
string indexer returns a
MarkupTag matching the supplied name (note that the
MarkupTag class also has a
string indexer that indexes into its
Children property). With markup documents, it is generally valid to allow multiple tags of the same name. For example, the following is valid markup...
<product id="1" name="lamp" />
<product id="2" name="pillow" />
... even though there are multiple product tags defined. Cases where a tag is limited to a single occurrence are specific to markup implementations, such as HTML with its
body, etc. tags. For this reason, I added an
HTMLDocument class that is a wrapper around the
MarkupDocument class and provides
Form properties that give access directly to their respective tags. It can be used like this:
string htmlText = GetHTMLText();
HTMLDocument doc = HTMLDocument.Load(htmlText);
string innerText = doc.Head.InnerText;
Notice here that
htmlText is provided to the
HTMLDocument class through a
static method rather than a constructor. This is because there is some validation that must be done to ensure it is an
html tag must be at the root of the document and it can only have 1
head and 1
string comparisons are case-insensitive by default. This is desirable as an opening tag and closing tag can generally have a different case. This also means that when accessing a tag via the
string indexer, you don't need to worry about the case. So,...
string innerText = doc["html"].InnerText;
string innerText = doc["HTML"].InnerText;
... are the same.
When testing the performance I found that loading a complex XML document that was roughly 1 MB and contained 65000+ tags took about 6 seconds. Likewise, a typical HTML document that was roughly 1.5 KB took less than a second. You can use the
MarkupDocument constructor and manipulate some of the options to see how it affects the performance. In particular, the
fixBadlyFormedInlineTags option can be a big performance increase if it is
false as that is one step in the parsing process that will be skipped; this of course would only be beneficial if you are certain the markup is well formed. Also the
caseSensitiveComparisons parameter may also provide a performance gain if it is true as performing case sensitive comparisons should generally perform better; likewise this would only be beneficial if you are certain opening and closing tags in the document have matching case.
Points of Interest
- When the
useCaching parameter is not available in the constructor or static creation method caching will be determined automatically based on the size of the text, caching will be used when the raw text is larger than 4K characters.
- There are static 'Known Inline' members in the
MarkupDocument class that are used to account for tags that may erroneously be flagged as faulty inline tags. The only one currently in place is the
<?xml ... ?> tag used by XML; since there is never a corresponding closing tag and it is a well-known standard. You can add others to the
KnownInlineTags static property as needed.
- The use of the generic
Stack classes in the .NET Framework were invaluable in the parsing process and it was a good refresher in using a stack.
CDATA tag is currently not supported as the main regular expression that identifies the tags in the document text does not account for it.
- 11/07/2008 - Minor grammatical corrections; added note about XML
- 11/06/2008 - Initial publication