|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
IntroductionThe idea for this HTML parser is partly based on the MIL HTML Parser (http://www.codeproject.com/dotnet/apmilhtml.asp). Instead of SubString() and index-based string operations, it uses a fast forward reading stream (StringReader) for the tokenizer. The output is a domain tree of a given HTML document, allowing a developer to navigate the document in an methodical way. BackgroundThe library was written to parse HTML documents and get certain information out of this, like links to other sites, images etc. This might work with RegEx for some documents as well, but has limitations within script blocks or comments. XML parsers are not appropriate for HMTL, as HTML is not needed to be valid XML (XHTML only). Heavily using string operations requires a lot of index calculations and won't perform and scale well on larger documents due to the immutability of strings (creating new string and garbaging...) so I decided to take a stream-based approach. While the internal tokenizer takes next character from the stream until end, it has to manage a state, e.g. whether it's inside some quotes or script code etc to recognize tokens and their end correctly ("</" inside a quote is ok, elsewise marks an end tag). So decision about further processing/state is taken from current state and actual character, in certain cases it is also necessary to read ahead one or more characters (recognize script blocks which are not enclosed with comment markers). Using the codeUsing the code is simple, just pass a string containing the html document to the parser. As result you'll get a DOM collection of the nodes. The DOM contains elements or text nodes with their attributes, so it will be easy to further use them. There are no third-party dependencies, just build the DLL and add it as a reference. // Get some html from a web page
WebRequest request = HttpWebRequest.Create("http://www.codeproject.com/index.asp");
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
// Pass the stream to the document/parser and get a DOM back
HtmlDocument document = HtmlDocument.Create(stream);
stream.Close();
response.Close();
Points of InterestEven if there is an output option of the parsed DOM back as HTML, this is pretty basic and surely has room for improvement.History
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||