This article presents my simple HTML parser library which I've developed for an automated application. The parser mainly detects tag syntax and it can collect a tag pair as a group. I was trying to use a parser generator like ANTLR but I'm in a hurry and don't have time to study the syntax, so I ended up writing it myself.
The parser was intended to be used with HTML content retrieved by the .NET
WebResponse class. So I have also developed a tool, named
NativeWebSurf, that downloads HTML content by
WebResponse and uses my parser to parse it into an HTML structure.
Using the Code
The library and the tool are written in .NET 3.5 with Visual Studio 2008. Now, there are two archive files: NativeWebSurf and NativeWebSurf_1.0.1. The former does not intensively use LINQ or Extension methods, but in the latter I applied LINQ more. (I finally found it is a somewhat convenient language feature. In case anyone would like to convert to .NET 2.0, try the former archive.
NativeWebSurf solution contains three projects.
NativeWebSurf: This is the main application that uses the parser from
RZLib: This is a class library that contains the parser source code. This project uses C5 Generic Collection library, which is not provided. Please download it from the link provided.
RZLib.UnitTest: A Unit test module for
RZLib. It uses NUnit 2.4.3.
The parser class is
RZ.Web.HtmlParser. To create the parser object, pass an HTML text into its constructor.
HtmlParser parser = new HtmlParser(
"<html><body>any HTML/text here...</body></html>");
HtmlParser.CurrentContent represents a content object that the parser has just read. The content object is represented by content classes, which are classes that start with
HtmlContent. The content classes hierarchy is as follows:
Bold class name is an
HtmlContentText keeps all texts that are not considered as tag content.
HtmlContentCompleteTag keep information of open tag, close tag, and complete tag (i.e.
<br />), respectively.
When the parser is just created, its
null. All parser
public methods will cause it to change to a valid object.
FetchNextContent() will move
CurrentContent to the next content.
MoveToHeadTag() will move
CurrentContent to the next open/complete tag.
MoveToTag() will move
CurrentContent to open/complete tag with the specified name (passed as its parameter) or predicate.
With lambda expression in C# 3, it makes the predicate statement more compact (compared to anonymous method), now
MoveToTag() can be used like:
parser.MoveToTag( tag => tag.TagName == "meta" && tag.Attributes["name"] == "Rating" );
GrabCurrentTag() can be used only when
CurrentContent is at open/complete tag. It will match the end tag and put the whole content into
HtmlContentBlock, which has a tree structure.
HtmlParser parser = new HtmlParser(
<body>any text here...
Debug.Assert( parser.MoveToTag("head") );
Boolean hasCloseTag; HtmlContentBlock headBlock = parser.GrabCurrentTag(out hasCloseTag);
Debug.Assert( hasCloseTag );
Debug.Assert( headBlock.TagName == "head" );
Debug.Assert( headBlock.Attributes.Count == 0 ); Debug.Assert( headBlock.Count == 3 ); Debug.Assert( headBlock is HtmlContentText ); Debug.Assert( headBlock is HtmlContentBlock ); Debug.Assert( headBlock is HtmlContentText );
HtmlContentBlock titleBlock = (HtmlContentBlock) headBlock;
Debug.Assert( titleBlock.Count == 1 );
Debug.Assert( titleBlock.ToString() == "abc" );
Debug.Assert( parser.CurrentContent is HtmlContentCloseTag );
The current parser is read forward only. You better parse HTML to a block in order to read backward.
The idea of
HtmlContentBlock is an object that collects only two types of content:
Count property can be used to determine the number of child items (text or block) in it and we can get an item by its indexer.
However, using its iterator with LINQ (or
foreach either) may probably be easier.
It provides two version of iterators: the first one, which is default, is for
HtmlContentBlock, another one is
HtmlContent. (Note that its default iterator has been changed from
IEnumerable<HtmlContentBlock>. The method
IterateChild() is created for
HtmlContent enumeration instead. -- Because I believe people would scan for block more than text).
Another way for searching a tag is through the
FindTag() method, which can find by either name or predicate as well. It returns an array of
Int32 as an index, which we can use to get the content by the indexer and the index can be used as start position for finding next time too.
There is at least one case that can cause parser to fail and when it fails, it throws
HtmlParserException. The case that I know is invalid tag form like:
This is <b>HTML <i>text</i</b>.
The parser expects a well-form close tag but it is not. I have marked the code with TODO: in HtmlLegacyParser.cs the point where you can handle this case.
Some Unrelated Issue...
While I have been developing
NativeWebSurf, I notice that the Cookie object returned from
WebResponse always contains a path even though the
Set-Cookie response header does not specify one!? I don't know whether it is a framework bug... but it could cause trouble when we can never know if it is really returned from the Web server... Well, it may not be a good place to ask here but if anyone knows something, it'd be grateful if you could share it .
I hope it could be a useful library for others too. If you fix some bug or add some features, please share with me too.
- 2008-02-27: Updated article
Added a few features and refactored the lib.
+ Supports Cookie cache (with all paths in same host)
+ Supports HTML content charset
* Fixed code to support new
* Changed default iterator to return
HtmlContentBlock instead of
+ Indexer with
FindTag() with predicate
MoveToTag() with predicate
- 2008-02-24: Posted initial article