5,276,406 members and growing! (15,440 online)
Email Password   helpLost your password?
Languages » C# » General     Intermediate

Stream based HTML Parser

By xidar

Stream based HTML parser using pure .NET/C#
HTML, C# 2.0, C# 3.0, C#, Windows, .NET, .NET 3.0, .NET 2.0, Visual Studio, Dev

Posted: 29 Aug 2007
Updated: 29 Aug 2007
Views: 9,155
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
5 votes for this Article.
Popularity: 1.96 Rating: 2.80 out of 5
2 votes, 40.0%
1
0 votes, 0.0%
2
0 votes, 0.0%
3
1 vote, 20.0%
4
2 votes, 40.0%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Introduction

The idea for this HTML parser is partly based on the MIL HTML Parser (http://www.codeproject.com/dotnet/apmilhtml.asp). Instead of SubString() and index-based string operations, it uses a fast forward reading stream (StringReader) for the tokenizer. The output is a domain tree of a given HTML document, allowing a developer to navigate the document in an methodical way.

Background

The library was written to parse HTML documents and get certain information out of this, like links to other sites, images etc. This might work with RegEx for some documents as well, but has limitations within script blocks or comments. XML parsers are not appropriate for HMTL, as HTML is not needed to be valid XML (XHTML only). Heavily using string operations requires a lot of index calculations and won't perform and scale well on larger documents due to the immutability of strings (creating new string and garbaging...) so I decided to take a stream-based approach.

While the internal tokenizer takes next character from the stream until end, it has to manage a state, e.g. whether it's inside some quotes or script code etc to recognize tokens and their end correctly ("</" inside a quote is ok, elsewise marks an end tag). So decision about further processing/state is taken from current state and actual character, in certain cases it is also necessary to read ahead one or more characters (recognize script blocks which are not enclosed with comment markers).

Using the code

Using the code is simple, just pass a string containing the html document to the parser. As result you'll get a DOM collection of the nodes. The DOM contains elements or text nodes with their attributes, so it will be easy to further use them. There are no third-party dependencies, just build the DLL and add it as a reference.

    // Get some html from a web page
    WebRequest request = HttpWebRequest.Create("http://www.codeproject.com/index.asp");
    WebResponse response = request.GetResponse();
    Stream stream = response.GetResponseStream();
    
    // Pass the stream to the document/parser and get a DOM back
    HtmlDocument document = HtmlDocument.Create(stream);
    stream.Close();
    response.Close();
    
    
    
    
    "cs">// Do whatever want to do, e.g. list all links
    foreach (HtmlNode node in document.nodes)
    {
        HtmlElement element = node as HtmlElement;
        if (null != element)
        {
            if ((element.Name.ToLower() == "a") 
            && element.Attributes.Contains("href"))
            {
                System.Console.WriteLine(element.ToString());
            }
        }
    }

Points of Interest

Even if there is an output option of the parsed DOM back as HTML, this is pretty basic and surely has room for improvement.

History

  • 2007/08/31 Directly using Stream or StreamReader as input to the parser now
  • 2007/08/30 Initial release
  • License

    This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

    A list of licenses authors might use can be found here

    About the Author

    xidar



    Occupation: Web Developer
    Location: Norway Norway

    Other popular C# articles:

    Article Top
    Sign Up to vote for this article
    You must Sign In to use this message board.
    FAQ FAQ Noise ToleranceSearch Search Messages 
     Layout  Per page   
     Msgs 1 to 10 of 10 (Total in Forum: 10) (Refresh)FirstPrevNext
    Subject  Author Date 
    Generala request from jansonlingliadminSean Ewington6:27 25 Sep '07  
    GeneralOne of the biggest benefits of streamsmemberEnnis Ray Lynch, Jr.5:49 30 Aug '07  
    AnswerRe: One of the biggest benefits of streamsmemberxidar8:37 30 Aug '07  
    GeneralActually you may have missed minememberEnnis Ray Lynch, Jr.9:26 30 Aug '07  
    GeneralRe: One of the biggest benefits of streamsmemberEnnis Ray Lynch, Jr.10:11 30 Aug '07  
    GeneralRe: One of the biggest benefits of streamsmemberDewey16:45 30 Aug '07  
    GeneralRe: One of the biggest benefits of streamsmemberxidar22:02 30 Aug '07  
    Generalmalformed HTMLmembergaspoda@seznam.cz1:20 30 Aug '07  
    GeneralRe: malformed HTMLmemberxidar2:41 30 Aug '07  
    GeneralRe: malformed HTMLmemberwjvii6:00 4 Sep '07  

    General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

    PermaLink | Privacy | Terms of Use
    Last Updated: 29 Aug 2007
    Editor:
    Copyright 2007 by xidar
    Everything else Copyright © CodeProject, 1999-2008
    Web20 | Advertise on the Code Project