Click here to Skip to main content
Email Password   helpLost your password?

Sample Image - HTMLParser.gif

Introduction

Around the second week of May, Christian came up with a great feature for the screensaver. It would be great for it to parse the HTML contained in some people's names.  Since he was busy adding other features I took up the task.

My first thought was to use some regular expressions with named references to pick out the tags and use the group to do it again and again until I had nothing but text in the group.

This, as it turns out, is a lot of work for doing something the .NET Framework already does.  First a little background on the type of HTML that we will be parsing at the end of the article.

Structure of an HTML fragment

A fragment consists of zero or more tags and zero or more strings.  It is perfectly valid for a fragment to consist of an empty string.

A single tag has three parts:  opening, value, and closing.  The opening part of the tag consists of the tag name, and zero or more quoted-attributes.  The value part of the tag can contain more tags, text, or nothing.  The closing part of the tag consists of just the name prefixed with a forward slash (/name).

We will not allow one tag to start inside another tag, but end out side of it.  So this is not valid <b><i>f</b>oo</i> notice that the 'b' tag ends before the 'i' tag.

Here is an example of a valid HTML fragment.

<font color='blue'>M<b>y</b> <i>name</i></font> <b>i</b>s....

Here is the breakdown of the above fragment

Tree output of the HTML

The tree nodes that are in quotes are the actual values that get output in the text.  In this example the only tag that has an attribute is the Font tag, and you'll notice that the attribute is named "color" with the value of "blue".

Parsing the HTML fragment

To produce the resulting formatted document you start at the top and work your way down following each node as you come to it building up styles as you go.  When you reach some plain text, you take the current format and package it with the text storing it for later use.  When you reach a node you cache the current format, so when you go into the next child node you don't have formatting from the first child node.

So how does one actually go about doing this?  The fragment structure I have described above is really an XML fragment, so we can use .NET's XML parser to do the dirty work and we can just step through the resulting "tree".

There is one small problem that we need to fix before we can let .NET do the work for us.  A valid XML document has one root tag, our HTML fragment doesn't require everything to be contained in one tag.  The remedy is to place the fragment inside of a tag, such as <html> & </html>.

Formatting structure

In our simple parser there are only several different formats that are supported.  Text color, bold, italic, underline, superscript, and subscript are the only formats in this parser; adding additional formatting is just a matter of looking for more tags and attributes.

Since formatting is built-up I have created a class called TextStyle, which is responsible for keeping track of the current format.  It has two constructors; one that creates a default TextStyle with no formatting, and another that copies a TextStyle.

Parsing the HTML (XML) document

In this code it uses recursion to parse each node as it comes up.  Text placed in the open has an XML node created for it by the framework, it is named #text.

Rather than copy and paste all of the code here, I will refer you to the source; I will describe the key parts of the code.

Determining font color

    XmlNode node; // The node we are working on

    switch(node.Name.ToLower())
    {
    case "font":
        if( node.Attributes != null )
        {
            foreach( XmlAttribute attribute 
                in node.Attributes)
            {
                switch(attribute.Name.ToLower())
                {
                    case "color":
                        // ParseColor is actually 

                        // inline in the code but

                        // I broke it out so it

                        // won't scroll the page

                        ParseColor(attribute, style);
                        break;
                }
            }
        }
        break;
    }
 
private void ParseColor(XmlAttribute a, TextStyle style)
{
    if( attribute.Value[0] != '#' )
        style.ForeColor = Color.FromName(attribute.Value);
    else
    {
        try
        {
            int r, g, b;
            r = Int32.Parse(attribute.Value.Substring(1,2), 
                System.Globalization.NumberStyles.HexNumber);
            g = Int32.Parse(attribute.Value.Substring(3,2), 
                System.Globalization.NumberStyles.HexNumber);
            b = Int32.Parse(attribute.Value.Substring(5,2), 
                System.Globalization.NumberStyles.HexNumber);
            style.ForeColor = Color.FromArgb(r, g, b);
        }
        catch
        { }
    }
}

In ParseColor you'll notice that I'm not using the Convert class to convert the color values to hexadecimal RGB value to Int32's.  This is because I need to specify the HexNumber NumberStyle for the Parse() call so that it converts the hexadecimal number and not throw an exception.  I wrapped the code in a try/catch block just in case the color value isn't valid.

That's the only interesting bit of code I see in the parsing (its also the only one that is more than 2 lines long).

As you are building up styles, eventually you will run across a node named "#text".  This node contains the text that is inside of all the tags at that point.  With that knowledge in hand we can now take the text contained inside it, and the current formatting and add it to the list of formatted text we have already created.

Displaying the formatted text

Displaying the text is relatively easy, assuming the enumerator for your collection retrieves items in the same order they were added in.  Using the foreach looping method you can iterate through each TextStyle and format the text as needed.

Common problems

Unquoted-attributes

Depending on who wrote the HTML, the attributes may or may not be quoted.  This is a problem because XML requires them to be quoted.  This results in having to pre-parse the HTML to ensure that all attributes are quoted.

It turns out that regular expressions work quite well for this job.  I have to use a two-step process in quoting the attributes because I couldn't find a way to use the named references in the string that is used to replace the selected attribute name/value pair.

The regular expression (RE) for finding the attribute name/value pairs follows

\<(?<tagName>[a-zA-Z]*) (?<attributeName>[a-zA-Z]*)( )*=( )*(?<attributeValue>[#a-zA-Z0-9]*)?>

There is a bug in the RE above (and in the code that does the actual replacement); it will only work on tags that have one attribute.  Soon as I figure out how to go about fixing it I will post the updated code and explain it here.  Since the parser only parses one tag that has attributes this isn't likely to be a problem.  Because of the bug I'm going to wait until I fix it before the second part to quoting unquoted-attributes.

Malformed XML

In the event that the HTML is malformed, the Parser will throw an exception.  Fear not though, I have supplied a method to remove all XML like tags from a string that is passed in.  It uses a very simple RE to find a tag, then replaces it with an empty string.  <.*?> is the RE for finding a tag.

Conclusion

There are a few improvements I would like to the code.  Mike Dunn suggested fixing up the HTML so that it becomes valid XML before trying to parse it as an XML document.  It doesn't seem like it would be too difficult to do; but time doesn't allow me to implement it.  Maybe I'll get to it next week.

The demo program parses the HTML then displays the resulting collection of TextStyles that it built up.  If there is an Exception while parsing it will just string the HTML from the text and stick it into a default TextStyle.

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralMore advanced HTML parser
Member 3769670
1:05 28 Mar '08  
Great work indeed!

If a more advanced HTML parser is needed I would recommend the Acrux one that also supports stuff such as XPath queries and more!

http://www.acruxsoftware.net/products.html[^]
GeneralIncorrect RE for tag removal.
ToAoM
13:11 2 Apr '05  
Please note that your Regular expression for removing tags is incorrect.

<.*?> should be: <[^>]*> instead.

This has to do with the fact that a regular expression parser should return the longest, left-most value it can find and thus the orgininal expression will under certain conditions delete more than just tags. The second expression does not suffer from this.
GeneralSpace between tags
Martin Andert
1:37 12 Nov '03  
Nice work.

I have tested your app with the following string:

<b>first</b> <i>second</i>

This gives back:

Text: "first"
ForeColor: Color [Empty]
FontStyle: Bold

Text: "second"
ForeColor: Color [Empty]
FontStyle: Italic


Where is the space between the two TextStyles gone?

TIA, Martin
GeneralNice work
Colin Davies
16:49 19 Sep '03  
Nice work and a good example. Smile
Especially the malformed HTML Smile

Regardz
Colin J Davies


* WARNING *
This could be addictive
The minion's version of "Catch Green Alien "



It's a real shame that people as stupid as you can work out how to use a computer. said by Christian Graus in the Soapbox
GeneralRe: Nice work
James T. Johnson
17:19 19 Sep '03  
Colin Davies wrote:
Nice work and a good example.
Especially the malformed HTML


Big Grin

James

"then when you go to bed...wait, you dont do that do you....ok....when you plug into the 'hive mind' to charge yourself, ill hack into your head"
Nnamdi Onyeyiri over MSN

GeneralThe problem with HTML...
Greg Ennis
14:58 19 Sep '03  
is that virtually nobody writes HTML that an XML parser will accept.. and yet the browser renders it anyway. What I am looking for is a true HTML parser, something that the IE DOM would load but a standard XML parser would choke on...

GeneralRe: The problem with HTML...
James T. Johnson
17:16 19 Sep '03  
Greg Ennis wrote:
The problem with HTML is that virtually nobody writes HTML that an XML parser will accept

I have to agree, I was originally going to write some regular expressions to pull each tag out so I could process it. But the entire purpose of this code was to allow the screensaver that Christian Graus and I worked on to correctly show other CPian names. With that purpose in mind, Christian's suggestion to use the XML classes worked well.

With the RE support built into the .NET framework it should be possible to convert from my use of the XML parser to one using REs.

There are rumors that version 2.0 of the .NET framework will include a managed WebBrowser object, unfortunately that is at least a year away Cry

James

"then when you go to bed...wait, you dont do that do you....ok....when you plug into the 'hive mind' to charge yourself, ill hack into your head"
Nnamdi Onyeyiri over MSN

Generalvb.net version exist?
Anonymous
4:09 29 Jan '03  
thanx .
r u solved bug.

GeneralInvalid tag nesting?
Kymermosst
12:38 10 Sep '02  
>> We will not allow one tag to start inside another tag, but end out side of it. So this is not valid foo notice that the 'b' tag ends before the 'i' tag.

Is there any way to change this behavior? While I'm sure it would not come up much, a string like abcdefghijkl would not work properly without some slicing.
GeneralRe: Invalid tag nesting?
James T. Johnson
14:22 10 Sep '02  
Not without a lot of processing.

I cheat by letting the XML classes do the parsing for me (which is why tags have to be full nested within each other). Then I take the resulting XML Document and break it down into the TextStyles.

If anyone would like to add this feature I'll gladly add a link to that article or include the code in mine.

James
"And we are all men; apart from the females." - Colin Davies
Generalfixed
Mark Mucha
15:53 2 Sep '02  
Big Grin
The following code "fixes" the bug with the RE.
The problem was, your RE was only checking for tags with one attribute.   In reality, we need to pull all tags, then pull their respective attributes separately.

Hope this helps

Mark


         
private void button1_Click_1(object sender, System.EventArgs e)
          {
               System.Net.WebClient client = new System.Net.WebClient();
               Byte[] pageData =client.DownloadData(txtURL.Text);
               string pageHtml = Encoding.ASCII.GetString(pageData);
               // find tags with attributes
               const string QT_REG = "\\<(?<tagName>[a-zA-Z]*) (?<attributes>[a-zA-Z0-9#= \"\']*)>";
               //match attributes that don't have quotes
               const string QUOTE_ATTR_RE = @"(?<attrname>[a-zA-Z]+)( )*=( )*(?<attrvalue>[a-zA-Z0-9#]+)";
               Regex quoteIt = new Regex(QT_REG, RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.IgnoreCase);
               Regex fixquotes = new Regex(QUOTE_ATTR_RE, RegexOptions.Compiled | RegexOptions.Singleline | RegexOptions.IgnoreCase);
               int startAt = 0;
                      Match m = null;
               Match am = null;
               MatchCollection attmatch;
               string val="";
               string newval="";
               string oldattribs = "";
               while((m = quoteIt.Match(pageHtml, startAt)) != null && (m.Length + m.Index) != 0 )
               {
                    string attrs = m.Groups["attributes"].Value;

                    txtHTML.Text += "name=" + m.Groups["tagName"].Value + "\r\n";
                    txtHTML.Text += "\tattrs=>" + attrs + "\r\n";
                    oldattribs = attrs;
                    attmatch = fixquotes.Matches(attrs);
                    for(int i=0; i < attmatch.Count;++i)
                    {
                         am = attmatch[i];
                         val = am.Value;
                         newval = fixquotes.Replace(val, am.Groups["attrname"].Value + "=\"" + am.Groups["attrvalue"].Value + "\"");
                         if(val != "")
                              attrs = attrs.Replace(val, newval);
                    }

                    if(oldattribs != "")
                         pageHtml = pageHtml.Replace(oldattribs, attrs);
                    if(m.Index <= 0)
                         MessageBox.Show("index < 0" + m.Index);
                    startAt = m.Index+ m.Length + 1;
                    if(startAt >= pageHtml.Length)
                         break;
               }
               pageHtml = pageHtml.Replace("\n", "\r\n");
               txtHTML.Text = txtHTML.Text + pageHtml;
          }
     }
}

GeneralSo typically Hasakian!
Nish - Native CPian
1:14 18 May '02  
Is that a new word?

"Hasakian" => means "nice and simple"

Wink

Nish


Regards,
Nish
Native CPian.
Born and brought up on CP.
With the CP blood in him.

GeneralRe: So typically Hasakian!
James T. Johnson
3:13 21 May '02  
Nish - Native CPian wrote:
Is that a new word?

I believe so.

Nish - Native CPian wrote:
means "nice and simple"

Good, I met my goal Smile

James

Simplicity Rules!
GeneralRe: So typically Hasakian!
Nish - Native CPian
4:37 21 May '02  
James T. Johnson wrote:
Good, I met my goal

Yes, you did Smile

Nish


Regards,
Nish
Native CPian.
Born and brought up on CP.
With the CP blood in him.


Last Updated 17 May 2002 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010