|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
![]() IntroductionThis article presents my simple HTML parser library which I've developed for an automated application. The parser mainly detects tag syntax and it can collect a tag pair as a group. I was trying to use parser generator like ANTLR but I'm in hurry that I don't have time to study the syntax, so I ended up to write it myself. The parser was intended to be used with HTML content retrieved by .Net Using the codeThe library and the tool are written in .Net 3.5 with Visual Studio 2008. Now, there are two archive files: NativeWebSurf and NativeWebSurf_1.0.1. The former does not intensively use LINQ or Extension methods, but the latter I applied LINQ more. (I finally found it is somewhat convenient language feature :). In case if anyone would like to convert to .Net 2.0, try the former archive. NativeWebSurf solution contains 3 projects.
The parser class is using RZ.Web;
namespace TestLib
{
class Program
{
HtmlParser parser = new HtmlParser(
"<html><body>any HTML/text here...</body></html>");
}
}
Italic class name is an abstract class.
When the parser is just created, its
With lambda expression in C# 3, it makes the predicate statement more compact (compared to anonymous method), now parser.MoveToTag( tag => tag.TagName == "meta" && tag.Attributes["name"] == "Rating" );
Example: HtmlParser parser = new HtmlParser(
@"<html>
<head>
<title>abc</title>
</head>
<body>any text here...
</body>
</html>"
);
Debug.Assert( parser.MoveToTag("head") ); // locate to <head> tag.
Boolean hasCloseTag; // if it is false, it means parser cannot find its close tag.
HtmlContentBlock headBlock = parser.GrabCurrentTag(out hasCloseTag);
Debug.Assert( hasCloseTag ); // since we have </head>
Debug.Assert( headBlock.TagName == "head" );
Debug.Assert( headBlock.Attributes.Count == 0 ); // no attributes in <head>
Debug.Assert( headBlock.Count == 3 ); // there are 3 contents inside headBlock
Debug.Assert( headBlock[0] is HtmlContentText ); // \r\n between <head> and <title>
Debug.Assert( headBlock[1] is HtmlContentBlock ); // block of <title>
Debug.Assert( headBlock[2] is HtmlContentText ); // \r\n between </title> and </head>
HtmlContentBlock titleBlock = (HtmlContentBlock) headBlock[1];
Debug.Assert( titleBlock.Count == 1 );
Debug.Assert( titleBlock[0].ToString() == "abc" );
Debug.Assert( parser.CurrentContent is HtmlContentCloseTag ); // it is at </head>
Current parser is read forward only. You better parse HTML to a block in order to read backward. HtmlContentBlockThe idea of
However, using its iterator with LINQ (or It provides two version of iterators: the first one, which is default, is for Another way for searching a tag is through IssuesScript SupportThis parser can handle to Java Script tag, but not other languages, by simple mean. It does not understand all Java script syntax but it can recognize JS string and comments and it will treat all java code like normal text. BugsThere is at least one case that can cause parser fail and when it fails it throws HtmlParserException. The case that I know is invalid tag form like: This is <b>HTML <i>text</i</b>.The parser expects well-form close tag but it is not. I have marked the code with TODO: in Some unrelated issue...While I have been developing EndFinally, I hope it could be a useful library for others too. If you fix some bug or add some feature please share with me too :) History2008-02-27 Add a few features and refactor the lib.
2008-02-24 Start Article.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||