Click here to Skip to main content
Licence Public Domain
First Posted 11 Jun 2009
Views 16,296
Downloads 500
Bookmarked 21 times

AfterWork HTML Parser in C#

By Aleksey Bykov | 11 Jun 2009
Actually, this is more of a lexical analyzer, but still very applicable for reading HTML and building a DOM tree.

1

2
1 vote, 50.0%
3

4
1 vote, 50.0%
5
3.67/5 - 2 votes
μ 3.67, σa 2.47 [?]

Introduction

This is a tribute to MIL HTML Parser which I used couple of times and which turned out to be not capable of reading some HTMLs around.

Background

This is an HTML lexical analyzer, which is one step away from a decent HTML parser. Basically, the only difference is that this analyzer produces a sequence of HTML tokens and doesn't build an HTML tree-structure.

This thing is well-trained to handle many situations of reading loosely formatted HTML pages (which are pretty common in the Internet).

Using the code

Here are a couple of examples to get a quick introduction of how this thing works.

In the following example, we take a page from eBay and see the sequence of HTML tokens produced by the lexer:

public void DemoExampleTest()
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.ebay.com/");
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    string html;
    using (StreamReader streamReader = new StreamReader(response.GetResponseStream()))
    {
        html = streamReader.ReadToEnd();
    }
    HtmlReader reader = new HtmlReader(html);
    IndentBuilder tracker = new IndentBuilder();
    HtmlReader.Read(reader, tracker);
    Trace.WriteLine(reader.Builder.ToString());
    // Trace.WriteLine(tracker.ToString());  // << == UNCOMMENT WITH CAUTION!!!
}

/* -- This is what you are likely to see  ---
[TAG_STARTS:<][DTD:!DOCTYPE][WHITESPACE: ][DTD_TOP:html][WHITESPACE: ][DTD_AVAIL:PUBLIC]
[WHITESPACE: ][DTD_FPI:"-//W3C//DTD HTML 4.01 Transitional//EN"][WHITESPACE: ]
[DTD_URL:"http://www.w3.org/TR/html4/loose.dtd"][TAG_ENDS:>][TAG_STARTS:<][NAME:html]
[TAG_ENDS:>][TAG_STARTS:<][NAME:head][TAG_ENDS:>][TAG_STARTS:<][NAME:meta][WHITESPACE: ]
[ATTR:http-equiv][ASSIGN:=][QUOTED_VALUE:"Content-Type"][WHITESPACE: ][ATTR:content]
[ASSIGN:=][QUOTED_VALUE:"text/html; charset=UTF-8"][TAG_ENDS:>][TAG_STARTS:<][NAME:link] ...
--------------------------------------------*/

In the next example, we work with the same page with some adjustments to the HTML grammar being used. Take a look at the handler of the TokenChaning event, this is a very good place to put your code for building an HTML tree-structure.

public void PracticalExampleTest()
{
    HttpWebRequest request = (HttpWebRequest)WebRequest.Create("http://www.ebay.com/");
    HttpWebResponse response = (HttpWebResponse)request.GetResponse();
    string html;
    using (StreamReader streamReader = new StreamReader(response.GetResponseStream()))
    {
        html = streamReader.ReadToEnd();
    }
    HtmlGrammarOptions options = new HtmlGrammarOptions();
    options.HandleCharacterReferences = true;
    options.DecomposeCharacterReference = true;
    options.HandleUnfinishedTags = true;
    HtmlGrammar grammar = new HtmlGrammar(options);
    HtmlReader reader = new HtmlReader(html, grammar);
    reader.Builder.TokenChaning += delegate(TokenChangingArgs args)
    {
        if (args.HasBefore)
        {
            Trace.WriteLine(args.Before.Id + ": " + args.Before.Value);
        }
    };
    HtmlReader.Read(reader, null);
}

Points of interest

I personally believe that there must be a free decent HTML parser in the public domain. It took me several days to realize that there is nothing in the Internet that can be used out of the box in a C# .NET project for this purpose. Well, let's try to make a difference.

History

This is actually a prototype of a parser I've been working on lately. Which means this thing needs some additional work to be finished. But it's somewhat already functional, and reads 98% of HTMLs you would ever see. There are a few cases where it can suddenly stop, but just have a quick look at the HtmlGrammar class. All missing links and states can be easily added there.

See also

There are also some useful stuff you might be interested in:

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication

About the Author

Aleksey Bykov



United States United States

Member
C# .NET developer since 2002

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralHTML Agilty kit PinmemberIsmail Mayat6:57 11 Jun '09  
QuestionDid you look at SGMLReader? PinmemberHightechRider6:43 11 Jun '09  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web04 | 2.5.120209.1 | Last Updated 11 Jun 2009
Article Copyright 2009 by Aleksey Bykov
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid