Click here to Skip to main content
Licence CPOL
First Posted 29 Aug 2007
Views 27,312
Downloads 446
Bookmarked 46 times

Stream Based HTML Parser

By | 29 Aug 2007 | Article
Stream based HTML parser using pure .NET/C#

Introduction

The idea for this HTML parser is partly based on the MIL HTML Parser. Instead of SubString() and index-based string operations, it uses a fast forward reading stream (StringReader) for the tokenizer. The output is a domain tree of a given HTML document, allowing a developer to navigate the document in an methodical way.

Background

The library was written to parse HTML documents and get certain information out of this, like links to other sites, images etc. This might work with RegEx for some documents as well, but has limitations within script blocks or comments. XML parsers are not appropriate for HMTL, as HTML is not needed to be valid XML (XHTML only). Heavily using string operations requires a lot of index calculations and won't perform and scale well on larger documents due to the immutability of strings (creating new string and garbaging...) so I decided to take a stream-based approach.

While the internal tokenizer takes next character from the stream until end, it has to manage a state, e.g. whether it's inside some quotes or script code, etc. to recognize tokens and their end correctly ("</" inside a quote is ok, elsewise marks an end tag). So the decision about further processing/state is taken from current state and actual character, in certain cases it is also necessary to read ahead one or more characters (recognize script blocks which are not enclosed with comment markers).

Using the Code

Using the code is simple, just pass a string containing the HTML document to the parser. As a result, you'll get a DOM collection of the nodes. The DOM contains elements or text nodes with their attributes, so it will be easy to further use them. There are no third-party dependencies, just build the DLL and add it as a reference.

// Get some HTML from a Web page
WebRequest request = HttpWebRequest.Create("http://www.codeproject.com/index.asp");
WebResponse response = request.GetResponse();
Stream stream = response.GetResponseStream();

// Pass the stream to the document/parser and get a DOM back
HtmlDocument document = HtmlDocument.Create(stream);
stream.Close();
response.Close();

// Do whatever you want to do, e.g. list all links
foreach (HtmlNode node in document.nodes)
{
    HtmlElement element = node as HtmlElement;
    if (null != element)
    {
        if ((element.Name.ToLower() == "a") 
        && element.Attributes.Contains("href"))
        {
            System.Console.WriteLine(element.ToString());
        }
    }
}

Points of Interest

Even if there is an output option of the parsed DOM back as HTML, this is pretty basic and surely has room for improvement.

History

  • 2007/08/30 - Initial release
  • 2007/08/31 - Directly using Stream or StreamReader as input to the parser now

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

xidar

Web Developer

Norway Norway

Member



Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
Generalreplacing HtmlElement with XElement PinmemberMember 11775249:21 4 Aug '08  
GeneralRe: replacing HtmlElement with XElement Pinmemberxidar2:49 6 Jun '09  
Generala request from jansonlingli PinadminSean Ewington5:27 25 Sep '07  
GeneralOne of the biggest benefits of streams PinmemberEnnis Ray Lynch, Jr.4:49 30 Aug '07  
AnswerRe: One of the biggest benefits of streams Pinmemberxidar7:37 30 Aug '07  
GeneralActually you may have missed mine PinmemberEnnis Ray Lynch, Jr.8:26 30 Aug '07  
GeneralRe: One of the biggest benefits of streams PinmemberEnnis Ray Lynch, Jr.9:11 30 Aug '07  
GeneralRe: One of the biggest benefits of streams PinmemberDewey15:45 30 Aug '07  
GeneralRe: One of the biggest benefits of streams Pinmemberxidar21:02 30 Aug '07  
GeneralRe: One of the biggest benefits of streams PinmemberDag Oystein Johansen12:52 5 Apr '09  
GeneralRe: One of the biggest benefits of streams PinmemberDag Oystein Johansen12:45 5 Apr '09  
Generalmalformed HTML Pinmembergaspoda@seznam.cz0:20 30 Aug '07  
GeneralRe: malformed HTML Pinmemberxidar1:41 30 Aug '07  
GeneralRe: malformed HTML Pinmemberwjvii5:00 4 Sep '07  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web01 | 2.5.120517.1 | Last Updated 30 Aug 2007
Article Copyright 2007 by xidar
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid