|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
![]() IntroductionThis article presents my simple HTML parser library which I've developed for an automated application. The parser mainly detects tag syntax and it can collect a tag pair as a group. I was trying to use a parser generator like ANTLR but I'm in a hurry and don't have time to study the syntax, so I ended up writing it myself. The parser was intended to be used with HTML content retrieved by the .NET Using the CodeThe library and the tool are written in .NET 3.5 with Visual Studio 2008. Now, there are two archive files: NativeWebSurf and NativeWebSurf_1.0.1. The former does not intensively use LINQ or Extension methods, but in the latter I applied LINQ more. (I finally found it is a somewhat convenient language feature. In case anyone would like to convert to .NET 2.0, try the former archive. The
The parser class is using RZ.Web;
namespace TestLib
{
class Program
{
HtmlParser parser = new HtmlParser(
"<html><body>any HTML/text here...</body></html>");
}
}
Bold class name is an
When the parser is just created, its
With lambda expression in C# 3, it makes the predicate statement more compact (compared to anonymous method), now parser.MoveToTag( tag => tag.TagName == "meta" && tag.Attributes["name"] == "Rating" );
Example: HtmlParser parser = new HtmlParser(
@"<html>
<head>
<title>abc</title>
</head>
<body>any text here...
</body>
</html>"
);
Debug.Assert( parser.MoveToTag("head") ); // locate to <head> tag.
Boolean hasCloseTag; // if it is false, it means parser cannot find its close tag.
HtmlContentBlock headBlock = parser.GrabCurrentTag(out hasCloseTag);
Debug.Assert( hasCloseTag ); // since we have </head>
Debug.Assert( headBlock.TagName == "head" );
Debug.Assert( headBlock.Attributes.Count == 0 ); // no attributes in <head>
Debug.Assert( headBlock.Count == 3 ); // there are 3 contents inside headBlock
Debug.Assert( headBlock[0] is HtmlContentText ); // \r\n between <head> and <title>
Debug.Assert( headBlock[1] is HtmlContentBlock ); // block of <title>
Debug.Assert( headBlock[2] is HtmlContentText ); // \r\n between </title> and </head>
HtmlContentBlock titleBlock = (HtmlContentBlock) headBlock[1];
Debug.Assert( titleBlock.Count == 1 );
Debug.Assert( titleBlock[0].ToString() == "abc" );
Debug.Assert( parser.CurrentContent is HtmlContentCloseTag ); // it is at </head>
The current parser is read forward only. You better parse HTML to a block in order to read backward. HtmlContentBlockThe idea of
However, using its iterator with LINQ (or It provides two version of iterators: the first one, which is default, is for Another way for searching a tag is through the IssuesScript SupportThis parser can handle a JavaScript tag, but not other languages, by simple means. It does not understand all JavaScript syntax but it can recognise JS string and comments and it will treat all Java code like normal text. BugsThere is at least one case that can cause parser to fail and when it fails, it throws This is <b>HTML <i>text</i</b>.
The parser expects a well-form close tag but it is not. I have marked the code with TODO: in HtmlLegacyParser.cs the point where you can handle this case. Some Unrelated Issue...While I have been developing FinallyI hope it could be a useful library for others too. If you fix some bug or add some features, please share with me too. History
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||