Click here to Skip to main content
15,867,686 members
Articles / Programming Languages / C#
Article

An Elementary HTML Parser

Rate me:
Please Sign up or sign in to vote.
4.06/5 (17 votes)
16 May 2002BSD6 min read 221.9K   6.7K   71   15
A class to parse HTML - a part of the ScreenSaver competition

Sample Image - HTMLParser.gif

Introduction

Around the second week of May, Christian came up with a great feature for the screensaver. It would be great for it to parse the HTML contained in some people's names.  Since he was busy adding other features I took up the task.

My first thought was to use some regular expressions with named references to pick out the tags and use the group to do it again and again until I had nothing but text in the group.

This, as it turns out, is a lot of work for doing something the .NET Framework already does.  First a little background on the type of HTML that we will be parsing at the end of the article.

Structure of an HTML fragment

A fragment consists of zero or more tags and zero or more strings.  It is perfectly valid for a fragment to consist of an empty string.

A single tag has three parts:  opening, value, and closing.  The opening part of the tag consists of the tag name, and zero or more quoted-attributes.  The value part of the tag can contain more tags, text, or nothing.  The closing part of the tag consists of just the name prefixed with a forward slash (/name).

We will not allow one tag to start inside another tag, but end out side of it.  So this is not valid <b><i>f</b>oo</i> notice that the 'b' tag ends before the 'i' tag.

Here is an example of a valid HTML fragment.

<font color='blue'>M<b>y</b> <i>name</i></font> 
<b>i</b>s....

Here is the breakdown of the above fragment

Tree output of the HTML

The tree nodes that are in quotes are the actual values that get output in the text.  In this example the only tag that has an attribute is the Font tag, and you'll notice that the attribute is named "color" with the value of "blue".

Parsing the HTML fragment

To produce the resulting formatted document you start at the top and work your way down following each node as you come to it building up styles as you go.  When you reach some plain text, you take the current format and package it with the text storing it for later use.  When you reach a node you cache the current format, so when you go into the next child node you don't have formatting from the first child node.

So how does one actually go about doing this?  The fragment structure I have described above is really an XML fragment, so we can use .NET's XML parser to do the dirty work and we can just step through the resulting "tree".

There is one small problem that we need to fix before we can let .NET do the work for us.  A valid XML document has one root tag, our HTML fragment doesn't require everything to be contained in one tag.  The remedy is to place the fragment inside of a tag, such as <html> & </html>.

Formatting structure

In our simple parser there are only several different formats that are supported.  Text color, bold, italic, underline, superscript, and subscript are the only formats in this parser; adding additional formatting is just a matter of looking for more tags and attributes.

Since formatting is built-up I have created a class called TextStyle, which is responsible for keeping track of the current format.  It has two constructors; one that creates a default TextStyle with no formatting, and another that copies a TextStyle.

Parsing the HTML (XML) document

In this code it uses recursion to parse each node as it comes up.  Text placed in the open has an XML node created for it by the framework, it is named #text.

Rather than copy and paste all of the code here, I will refer you to the source; I will describe the key parts of the code.

Determining font color

C#
    XmlNode node; // The node we are working on
    switch(node.Name.ToLower())
    {
    case "font":
        if( node.Attributes != null )
        {
            foreach( XmlAttribute attribute 
                in node.Attributes)
            {
                switch(attribute.Name.ToLower())
                {
                    case "color":
                        // ParseColor is actually 
                        // inline in the code but
                        // I broke it out so it
                        // won't scroll the page
                        ParseColor(attribute, style);
                        break;
                }
            }
        }
        break;
    }
 
private void ParseColor(XmlAttribute a, TextStyle style)
{
    if( attribute.Value[0] != '#' )
        style.ForeColor = Color.FromName(attribute.Value);
    else
    {
        try
        {
            int r, g, b;
            r = Int32.Parse(attribute.Value.Substring(1,2), 
                System.Globalization.NumberStyles.HexNumber);
            g = Int32.Parse(attribute.Value.Substring(3,2), 
                System.Globalization.NumberStyles.HexNumber);
            b = Int32.Parse(attribute.Value.Substring(5,2), 
                System.Globalization.NumberStyles.HexNumber);
            style.ForeColor = Color.FromArgb(r, g, b);
        }
        catch
        { }
    }
}

In ParseColor you'll notice that I'm not using the Convert class to convert the color values to hexadecimal RGB value to Int32's.  This is because I need to specify the HexNumber NumberStyle for the Parse() call so that it converts the hexadecimal number and not throw an exception.  I wrapped the code in a try/catch block just in case the color value isn't valid.

That's the only interesting bit of code I see in the parsing (its also the only one that is more than 2 lines long).

As you are building up styles, eventually you will run across a node named "#text".  This node contains the text that is inside of all the tags at that point.  With that knowledge in hand we can now take the text contained inside it, and the current formatting and add it to the list of formatted text we have already created.

Displaying the formatted text

Displaying the text is relatively easy, assuming the enumerator for your collection retrieves items in the same order they were added in.  Using the foreach looping method you can iterate through each TextStyle and format the text as needed.

Common problems

Unquoted-attributes

Depending on who wrote the HTML, the attributes may or may not be quoted.  This is a problem because XML requires them to be quoted.  This results in having to pre-parse the HTML to ensure that all attributes are quoted.

It turns out that regular expressions work quite well for this job.  I have to use a two-step process in quoting the attributes because I couldn't find a way to use the named references in the string that is used to replace the selected attribute name/value pair.

The regular expression (RE) for finding the attribute name/value pairs follows

\<(?<tagName>[a-zA-Z]*) (?<attributeName>[a-zA-Z]*)( )*=( )*(?<attributeValue>[#a-zA-Z0-9]*)?>

There is a bug in the RE above (and in the code that does the actual replacement); it will only work on tags that have one attribute.  Soon as I figure out how to go about fixing it I will post the updated code and explain it here.  Since the parser only parses one tag that has attributes this isn't likely to be a problem.  Because of the bug I'm going to wait until I fix it before the second part to quoting unquoted-attributes.

Malformed XML

In the event that the HTML is malformed, the Parser will throw an exception.  Fear not though, I have supplied a method to remove all XML like tags from a string that is passed in.  It uses a very simple RE to find a tag, then replaces it with an empty string.  <.*?> is the RE for finding a tag.

Conclusion

There are a few improvements I would like to the code.  Mike Dunn suggested fixing up the HTML so that it becomes valid XML before trying to parse it as an XML document.  It doesn't seem like it would be too difficult to do; but time doesn't allow me to implement it.  Maybe I'll get to it next week.

The demo program parses the HTML then displays the resulting collection of TextStyles that it built up.  If there is an Exception while parsing it will just string the HTML from the text and stick it into a default TextStyle.

License

This article, along with any associated source code and files, is licensed under The BSD License


Written By
Software Developer (Senior) InfoPlanIT, LLC
United States United States
James has been programming in C/C++ since 1998, and grew fond of databases in 1999. His latest interest has been in C# and .NET where he has been having fun writing code starting when .NET v1.0 was in its first beta.

He is currently a senior developer and consultant for InfoPlanIT, a small international consulting company that focuses on custom solutions and business intelligence applications.

He was previously employed by ComponentOne where he was a Product Manager for the ActiveReports, Data Dynamics Reports, and ActiveAnalysis products.

Code contained in articles where he is the sole author is licensed via the new BSD license.

Comments and Discussions

 
QuestionDoesn't work. Try parsing this Pin
Tawani Anyangwe13-Jan-13 10:52
Tawani Anyangwe13-Jan-13 10:52 
GeneralMore advanced HTML parser Pin
Member 376967028-Mar-08 0:05
Member 376967028-Mar-08 0:05 
GeneralIncorrect RE for tag removal. Pin
ToAoM2-Apr-05 12:11
ToAoM2-Apr-05 12:11 
GeneralSpace between tags Pin
Martin Andert12-Nov-03 0:37
Martin Andert12-Nov-03 0:37 
GeneralNice work Pin
ColinDavies19-Sep-03 15:49
ColinDavies19-Sep-03 15:49 
GeneralRe: Nice work Pin
James T. Johnson19-Sep-03 16:19
James T. Johnson19-Sep-03 16:19 
GeneralThe problem with HTML... Pin
Los Guapos19-Sep-03 13:58
Los Guapos19-Sep-03 13:58 
GeneralRe: The problem with HTML... Pin
James T. Johnson19-Sep-03 16:16
James T. Johnson19-Sep-03 16:16 
Questionvb.net version exist? Pin
Anonymous29-Jan-03 3:09
Anonymous29-Jan-03 3:09 
QuestionInvalid tag nesting? Pin
kymermosst10-Sep-02 11:38
kymermosst10-Sep-02 11:38 
AnswerRe: Invalid tag nesting? Pin
James T. Johnson10-Sep-02 13:22
James T. Johnson10-Sep-02 13:22 
Generalfixed Pin
igoestoucf2-Sep-02 14:53
igoestoucf2-Sep-02 14:53 
GeneralSo typically Hasakian! Pin
Nish Nishant18-May-02 0:14
sitebuilderNish Nishant18-May-02 0:14 
GeneralRe: So typically Hasakian! Pin
James T. Johnson21-May-02 2:13
James T. Johnson21-May-02 2:13 
GeneralRe: So typically Hasakian! Pin
Nish Nishant21-May-02 3:37
sitebuilderNish Nishant21-May-02 3:37 
James T. Johnson wrote:
Good, I met my goal

Yes, you did Smile | :)

Nish


Regards,
Nish
Native CPian.
Born and brought up on CP.
With the CP blood in him.

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.