Introduction
The idea for this HTML parser is partly based on the MIL HTML Parser. Instead of SubString() and index-based string operations, it uses a fast forward reading stream (StringReader) for the tokenizer. The output is a domain tree of a given HTML document, allowing a developer to navigate the document in an methodical way.
Background
The library was written to parse HTML documents and get certain information out of this, like links to other sites, images etc. This might work with RegEx for some documents as well, but has limitations within script blocks or comments. XML parsers are not appropriate for HMTL, as HTML is not needed to be valid XML (XHTML only). Heavily using string operations requires a lot of index calculations and won't perform and scale well on larger documents due to the immutability of strings (creating new string and garbaging...) so I decided to take a stream-based approach.
While the internal tokenizer takes next character from the stream until end, it has to manage a state, e.g. whether it's inside some quotes or script code, etc. to recognize tokens and their end correctly ("</" inside a quote is ok, elsewise marks an end tag). So the decision about further processing/state is taken from current state and actual character, in certain cases it is also necessary to read ahead one or more characters (recognize script blocks which are not enclosed with comment markers).
Using the Code
Using the code is simple, just pass a string containing the HTML document to the parser. As a result, you'll get a DOM collection of the nodes. The DOM contains elements or text nodes with their attributes, so it will be easy to further use them. There are no third-party dependencies, just build the DLL and add it as a reference.
WebRequest request = HttpWebRequest.Create("http://www.codeproject.com/index.asp");
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();
HtmlDocument document = HtmlDocument.Create(stream);
stream.Close();
response.Close();
foreach (HtmlNode node in document.nodes)
{
HtmlElement element = node as HtmlElement;
if (null != element)
{
if ((element.Name.ToLower() == "a")
&& element.Attributes.Contains("href"))
{
System.Console.WriteLine(element.ToString());
}
}
}
Points of Interest
Even if there is an output option of the parsed DOM back as HTML, this is pretty basic and surely has room for improvement.
History
- 2007/08/30 - Initial release
- 2007/08/31 - Directly using
Stream or StreamReader as input to the parser now