Click here to Skip to main content
Licence 
First Posted 21 Mar 2004
Views 246,138
Downloads 5,046
Bookmarked 144 times

MIL HTML Parser

By Member 987427 | 30 Mar 2004
A non-well-formed HTML parser for .NET
2 votes, 2.8%
1
1 vote, 1.4%
2

3
11 votes, 15.3%
4
58 votes, 80.6%
5
4.78/5 - 72 votes
3 removed
μ 4.69, σa 1.40 [?]

Introduction

This library produces a domain tree of a given HTML document, allowing the developer to navigate and change the document in an methodical way. In addition to the basic HTML production, this library can also be used to produce XHTML documents, as it includes an HTML 4 entity encoder. Included in this release is a demonstration application in VB.NET showing how to use the library. I hope that it is all fairly self-explanatory.

Background

This library was written to avoid having to convert a document into XML prior to reading, whilst preserving the distinct HTML qualities. This gets round some deployment issues I had with different platforms.

Using the code

The simplest way to use the code is to add it into your solution as a C# class library. There are no third-party dependencies so it is just a matter of adding the source files in. Alternatively, you can build the DLL and add it as a reference.

Points of Interest

The XHTML production is fairly basic - there is no built-in DTD checking. So far, I have had no problems in the generation, but I'm keen on getting that sorted.

History

  • 1.4
  • 1.3
    • Bugfix: <!DOCTYPE...> and <!...> now treated as comments
    • Bugfix: Malformed or incomplete attribute values causing infinite loop fixed
  • 1.2
    • Bugfix: <tag/> now handled properly
    • Bugfix: Parse errors of scripts
    • Bugfix: Parse errors of styles
    • HTML 4 entity encoding
    • DOM tree navigation
    • Basic node searching
    • HTML production
    • XHTML production (as per http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd)
    • Added some component model stuff & comments
    • Hid the parser
  • 1.1
    • Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Member 987427



United Kingdom United Kingdom

Member


Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
BugBug fix for pages with different encoding PinmemberMember 84647596:40 13 Dec '11  
BugBug fix in returning multi empty spaces between words PinmemberMember 84647595:31 12 Dec '11  
BugError : Input string was not in a correct format. PinmemberMember 84647591:51 12 Dec '11  
Questionthank you PinmemberMember 84647597:09 6 Dec '11  
GeneralMy vote of 5 Pinmemberjp7311:45 25 Nov '10  
NewsWorks very good for Google! Pinmemberjp7311:45 25 Nov '10  
GeneralDoes not remove whitespaces Pinmemberevald801:25 18 Jan '10  
GeneralGood man good Pinmemberniks0419:31 12 Jan '10  
QuestionCan I get MIL HTML parser Algorithm. PinmemberHasibul Haque10:21 26 May '09  
Generalcongratulations Pinmembervukovicg5:03 13 May '09  
GeneralSimply amazing! Pinmemberthe Asocial Ape4:57 13 May '09  
GeneralLowercased href Pinmemberexxellence2:04 12 Nov '08  
Generalfeature missing Pinmemberzeltera5:34 17 Aug '08  
GeneralRe: feature missing Pinmembersmitsc9:11 7 Oct '08  
GeneralDOCTYPE breaks the parser Pinmemberbenblo6:18 14 May '08  
QuestionIs it a bug? Pinmemberhuyhk21:57 27 Feb '08  
GeneralRe: Is it a bug? PinmemberNatural Cause23:34 26 Mar '08  
GeneralRe: Is it a bug? PinmemberNoodleNoggin9811:03 1 Jul '09  
GeneralRe: Is it a bug? [modified] PinmemberMember 458246613:21 16 Jul '09  
GeneralRe: Is it a bug? PinmemberJeremy Falcon5:50 8 Jul '09  
GeneralSuggestions for new interface methods PinmemberBerend Engelbrecht10:10 26 Feb '08  
GeneralFound a bug Pinmemberstavinski11:09 14 Jan '08  
AnswerRe: Found a bug - me too, and the solution PinmemberBerend Engelbrecht22:04 25 Feb '08  
QuestionHTML marked Pinmembermaingaosuong17:19 25 Sep '07  
Generalregarding extracting tags Pinmemberrama jayapal4:49 29 Mar '07  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web01 | 2.5.120209.1 | Last Updated 31 Mar 2004
Article Copyright 2004 by Member 987427
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid