Click here to Skip to main content
15,888,908 members
Articles / Programming Languages / Visual Basic
Article

MIL HTML Parser

Rate me:
Please Sign up or sign in to vote.
4.79/5 (77 votes)
30 Mar 20041 min read 393.9K   7.8K   154   74
A non-well-formed HTML parser for .NET

Introduction

This library produces a domain tree of a given HTML document, allowing the developer to navigate and change the document in an methodical way. In addition to the basic HTML production, this library can also be used to produce XHTML documents, as it includes an HTML 4 entity encoder. Included in this release is a demonstration application in VB.NET showing how to use the library. I hope that it is all fairly self-explanatory.

Background

This library was written to avoid having to convert a document into XML prior to reading, whilst preserving the distinct HTML qualities. This gets round some deployment issues I had with different platforms.

Using the code

The simplest way to use the code is to add it into your solution as a C# class library. There are no third-party dependencies so it is just a matter of adding the source files in. Alternatively, you can build the DLL and add it as a reference.

Points of Interest

The XHTML production is fairly basic - there is no built-in DTD checking. So far, I have had no problems in the generation, but I'm keen on getting that sorted.

History

  • 1.4
  • 1.3
    • Bugfix: <!DOCTYPE...> and <!...> now treated as comments
    • Bugfix: Malformed or incomplete attribute values causing infinite loop fixed
  • 1.2
    • Bugfix: <tag/> now handled properly
    • Bugfix: Parse errors of scripts
    • Bugfix: Parse errors of styles
    • HTML 4 entity encoding
    • DOM tree navigation
    • Basic node searching
    • HTML production
    • XHTML production (as per http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd)
    • Added some component model stuff & comments
    • Hid the parser
  • 1.1
    • Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
United Kingdom United Kingdom
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralRe: Sweet Work Pin
Jacob Slusser27-Mar-04 11:38
Jacob Slusser27-Mar-04 11:38 
GeneralSgmlReader Pin
Jonathan de Halleux22-Mar-04 3:11
Jonathan de Halleux22-Mar-04 3:11 
GeneralSuggestions Pin
Stephane Rodriguez.22-Mar-04 2:53
Stephane Rodriguez.22-Mar-04 2:53 
GeneralRe: Suggestions Pin
GriffonRL22-Mar-04 10:06
GriffonRL22-Mar-04 10:06 
GeneralRe: Suggestions Pin
Stephane Rodriguez.22-Mar-04 18:59
Stephane Rodriguez.22-Mar-04 18:59 
GeneralRe: Suggestions Pin
GriffonRL22-Mar-04 20:46
GriffonRL22-Mar-04 20:46 
GeneralRe: Suggestions Pin
Stephane Rodriguez.22-Mar-04 22:13
Stephane Rodriguez.22-Mar-04 22:13 
GeneralRe: Suggestions Pin
GriffonRL22-Mar-04 22:38
GriffonRL22-Mar-04 22:38 
Stephane,

Stephane Rodriguez. wrote:
T'as pas dû chercher bien longtemps

Si à vrai dire j'ai joué aussi pas mal avec les classes réseaux de .NET, mais par rapport à une librairie comme URLmon, tu as du travail avant d'arriver à la même simplicité d'utilisation et à la même qualité face à des sites et des URLs qui sont parfois un peu batardes.
Mais je suis certain qu'on peut obtenir une très bonne librairie, avec SSL, authentifications, cookies et tout et tout et facile à utiliser. Reste plus qu'à se lancer Big Grin | :-D !


Stephane Rodriguez. wrote:
En fait, je l'ai développée. Une capture d'écran de LongSleeves ici.

Intéressant mais où en es-tu de ce projet ?


Stephane Rodriguez. wrote:
Mozilla has a wrapper Ax (mozctlx.dll)

I know this one. Unfortunately this is the only one. I was delighted to found it but I soon realised that the IE API was not fully implemented and was missing some important functions I need. Maybe they updated it very recently... However, I will be very glad to see something similar with a .NET wrapper, even without the same API as IE.


Stephane Rodriguez. wrote:
Don't watch too much the source code in this article though, I think it's not worth it until a major rewrite. To me, a real strong html parser is one that can read html as well as xml, without changing a single line of code, and that provides at the same time a DOM model (read everything, store everything in memory) as well as an event-driven model (only the latest elements and attributes are known). May be SgmlReader (linked by Jonathan above) should be a given a look. At least SgmlReader is written by MS Chris Lovett, one of the fathers of msxml.

I took a look at it. It looks fantastic but seems to have 2 drawbacks:
First, it is quite slow in term of throughput of converted pages per seconds if you plan to use it for a large amount of HTML pages (benchmark may vary with hardware and application architecture). However I saw that with preliminary tests. Performance is important for one of my applications.
Second, it doesn't handle malformed HTML like IE does. It expects well formed HTML, so you might fail on a lot of badly designed pages. But if your applications target such pages you have no choice but to be able to parse them anyway. However the source code is available for such an enhancement.

Thanks,


R. LOPES
Just programmer.
GeneralRe: Suggestions Pin
Stephane Rodriguez.22-Mar-04 23:10
Stephane Rodriguez.22-Mar-04 23:10 
GeneralRe: Suggestions Pin
Anonymous22-Mar-04 20:49
Anonymous22-Mar-04 20:49 
GeneralAlready exists Pin
Rui Dias Lopes22-Mar-04 0:06
Rui Dias Lopes22-Mar-04 0:06 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.