5,277,262 members and growing! (16,925 online)
Email Password   helpLost your password?
Languages » C# » Utilities License: The Code Project Open License (CPOL)

Another C# Legacy HTML parser using Tag processing.

By Ruxo Zheng

A class library of HTML parser for HTML tag work.
C# (C# 3.0, C#), HTML, .NET (.NET, .NET 3.5), Dev

Posted: 23 Feb 2008
Updated: 26 Feb 2008
Views: 7,816
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
7 votes for this Article.
Popularity: 3.84 Rating: 4.54 out of 5
1 vote, 14.3%
1
0 votes, 0.0%
2
0 votes, 0.0%
3
2 votes, 28.6%
4
4 votes, 57.1%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article
screenshot03.gif

Introduction

This article presents my simple HTML parser library which I've developed for an automated application. The parser mainly detects tag syntax and it can collect a tag pair as a group. I was trying to use parser generator like ANTLR but I'm in hurry that I don't have time to study the syntax, so I ended up to write it myself.

The parser was intended to be used with HTML content retrieved by .Net WebResponse class. So I also have developed a tool, named NativeWebSurf, that downloads HTML content by WebResponse and uses my parser to parse it into an HTML structure.

Using the code

The library and the tool are written in .Net 3.5 with Visual Studio 2008. Now, there are two archive files: NativeWebSurf and NativeWebSurf_1.0.1. The former does not intensively use LINQ or Extension methods, but the latter I applied LINQ more. (I finally found it is somewhat convenient language feature :). In case if anyone would like to convert to .Net 2.0, try the former archive.

NativeWebSurf solution contains 3 projects.

  • NativeWebSurf: This is main application that uses the parser from RZLib.
  • RZLib: This is a class library that contains the parser source code. This project uses C5 Generic Collection library, which is not provided. Please download it from the link provided.
  • RZLib.UnitTest: A Unit test module for RZLib. It uses NUnit 2.4.3.

The parser class is RZ.Web.HtmlParser. To create the parser object, pass an HTML text into its constructor.

using RZ.Web;

namespace TestLib
{
    class Program
    {
        HtmlParser parser = new HtmlParser(
            "<html><body>any HTML/text here...</body></html>");
    }
} 

HtmlParser.CurrentContent represents a content object that the parser has just read. The content object is represented by content classes, which are classes that start with HtmlContent. The content classes hierarchy is like following:

  • HtmlContent
    • HtmlContentText
    • HtmlContentTag
      • HtmlContentHeadTag
        • HtmlContentOpenTag
        • HtmlContentCompleteTag
        • HtmlContentBlock
      • HtmlContentCloseTag

Italic class name is an abstract class.

HtmlContentText keeps all texts that are not considered as tag content.

HtmlContentOpenTag, HtmlContentCloseTag, and HtmlContentCompleteTag keep information of open tag, close tag, and complete tag (i.e. <br />), respectively.

When the parser is just created, its CurrentContent is null. All parser public methods will cause it to change to a valid object.

FetchNextContent() will move CurrentContent to next content.

MoveToHeadTag() will move CurrentContent to the next open/complete tag.

MoveToTag() will move CurrentContent to open/complete tag with the specified name (passed as its parameter) or predicate.

With lambda expression in C# 3, it makes the predicate statement more compact (compared to anonymous method), now MoveToTag() can be used like:

parser.MoveToTag( tag => tag.TagName == "meta" && tag.Attributes["name"] == "Rating" );

GrabCurrentTag() can be used only when CurrentContent is at open/complete tag. It will match the end tag and put the whole content into HtmlContentBlock, which has tree structure.

Example:

HtmlParser parser = new HtmlParser(

@"<html>
    <head>
        <title>abc</title>
    </head>
    <body>any text here...
    </body>
</html>"
           ); 

Debug.Assert( parser.MoveToTag("head") );  // locate to <head> tag.

Boolean hasCloseTag;  // if it is false, it means parser cannot find its close tag.
HtmlContentBlock headBlock = parser.GrabCurrentTag(out hasCloseTag);

Debug.Assert( hasCloseTag );   // since we have </head>

Debug.Assert( headBlock.TagName == "head" );
Debug.Assert( headBlock.Attributes.Count == 0 );  // no attributes in <head>
Debug.Assert( headBlock.Count == 3 );  // there are 3 contents inside headBlock
Debug.Assert( headBlock[0] is HtmlContentText );  // \r\n between <head> and <title>
Debug.Assert( headBlock[1] is HtmlContentBlock ); // block of <title>
Debug.Assert( headBlock[2] is HtmlContentText );  // \r\n between </title> and </head>

HtmlContentBlock titleBlock = (HtmlContentBlock) headBlock[1];

Debug.Assert( titleBlock.Count == 1 );
Debug.Assert( titleBlock[0].ToString() == "abc" );

Debug.Assert( parser.CurrentContent is HtmlContentCloseTag );  // it is at </head>

Current parser is read forward only. You better parse HTML to a block in order to read backward.

HtmlContentBlock

The idea of HtmlContentBlock is an object that collects only two types of content: HtmlContentText and HtmlContentBlock.

Count property can be used to determine the number of child item (text or block) in it and we can get item by its indexer.

However, using its iterator with LINQ (or foreach either) may probably be easier.

It provides two version of iterators: the first one, which is default, is for HtmlContentBlock, another one is HtmlContent. (note that its default iterator has been changed from IEnumerable<HtmlContent> to IEnumerable<HtmlContentBlock>. The method IterateChild() is created for HtmlContent enumeration instead. -- Because I believe people would scan for block more than text).

Another way for searching a tag is through FindTag() method, which can find by either name or predicate as well. It returns an array of Int32 as an index, which we can use to get the content by the indexer and the index can be used as start position for finding next time too.

Issues

Script Support

This parser can handle to Java Script tag, but not other languages, by simple mean. It does not understand all Java script syntax but it can recognize JS string and comments and it will treat all java code like normal text.

Bugs

There is at least one case that can cause parser fail and when it fails it throws HtmlParserException. The case that I know is invalid tag form like:

This is <b>HTML <i>text</i</b>.

The parser expects well-form close tag but it is not. I have marked the code with TODO: in HtmlLegacyParser.cs the point where you can handle this case.

Some unrelated issue...

While I have been developing NativeWebSurf, I notice that the Cookie object returned from WebResponse always contains path even the Set-Cookie response header does not specify one!? I don't know whether it is a framework bug... but it could cause trouble when we can never know if it is really returned from the web server... Well, it may not a good place to ask here but if anyone knows something, it'd be grateful to be shared :))

End

Finally, I hope it could be a useful library for others too. If you fix some bug or add some feature please share with me too :)

History

2008-02-27 Add a few features and refactor the lib.

  • NativeWebSurf
    + Support Cookie cache (with all paths in same host).
    + Support HTML content charset.
    * Fix code to support new HtmlContentBlock iterator.
  • RZLib
    • HtmlContentBlock
      * Change default iterator to return HtmlContentBlock instead of HtmlContent.
      + Indexer with Int32[] index.
      + IterateChild() for HtmlContent enumeration.
      + FindTag() with predicate.
    • HtmlParser
      + MoveToTag() with predicate.
Update article.

2008-02-24 Start Article.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Ruxo Zheng


C/C++ and C# programmer.
Occupation: Software Developer (Senior)
Location: Thailand Thailand

Other popular C# articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 4 of 4 (Total in Forum: 4) (Refresh)FirstPrevNext
Subject  Author Date 
GeneralParser fails if attribute is defined more than oncememberub251:42 9 Jun '08  
Questionunresolved dependencies?memberMember 35171792:13 8 May '08  
AnswerRe: unresolved dependencies?memberTheSquiffy6:12 14 May '08  
GeneralMore tolerant HTML parsermemberMember 37696701:09 28 Mar '08  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 26 Feb 2008
Editor:
Copyright 2008 by Ruxo Zheng
Everything else Copyright © CodeProject, 1999-2008
Web17 | Advertise on the Code Project