Click here to Skip to main content
11,417,467 members (54,994 online)
Click here to Skip to main content

MIL HTML Parser

, 30 Mar 2004
Rate this:
Please Sign up or sign in to vote.
A non-well-formed HTML parser for .NET

Introduction

This library produces a domain tree of a given HTML document, allowing the developer to navigate and change the document in an methodical way. In addition to the basic HTML production, this library can also be used to produce XHTML documents, as it includes an HTML 4 entity encoder. Included in this release is a demonstration application in VB.NET showing how to use the library. I hope that it is all fairly self-explanatory.

Background

This library was written to avoid having to convert a document into XML prior to reading, whilst preserving the distinct HTML qualities. This gets round some deployment issues I had with different platforms.

Using the code

The simplest way to use the code is to add it into your solution as a C# class library. There are no third-party dependencies so it is just a matter of adding the source files in. Alternatively, you can build the DLL and add it as a reference.

Points of Interest

The XHTML production is fairly basic - there is no built-in DTD checking. So far, I have had no problems in the generation, but I'm keen on getting that sorted.

History

  • 1.4
  • 1.3
    • Bugfix: <!DOCTYPE...> and <!...> now treated as comments
    • Bugfix: Malformed or incomplete attribute values causing infinite loop fixed
  • 1.2
    • Bugfix: <tag/> now handled properly
    • Bugfix: Parse errors of scripts
    • Bugfix: Parse errors of styles
    • HTML 4 entity encoding
    • DOM tree navigation
    • Basic node searching
    • HTML production
    • XHTML production (as per http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd)
    • Added some component model stuff & comments
    • Hid the parser
  • 1.1
    • Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Share

About the Author

Member 987427

United Kingdom United Kingdom
No Biography provided

Comments and Discussions

 
QuestionWhitespace between text and anchor Pin
Mike11424-Oct-12 15:59
memberMike11424-Oct-12 15:59 
BugBug fix for pages with different encoding Pin
Member 846475913-Dec-11 6:40
memberMember 846475913-Dec-11 6:40 
BugBug fix in returning multi empty spaces between words Pin
Member 846475912-Dec-11 5:31
memberMember 846475912-Dec-11 5:31 
BugError : Input string was not in a correct format. Pin
Member 846475912-Dec-11 1:51
memberMember 846475912-Dec-11 1:51 
Questionthank you Pin
Member 84647596-Dec-11 7:09
memberMember 84647596-Dec-11 7:09 
GeneralMy vote of 5 Pin
jp73125-Nov-10 1:45
memberjp73125-Nov-10 1:45 
NewsWorks very good for Google! Pin
jp73125-Nov-10 1:45
memberjp73125-Nov-10 1:45 
GeneralDoes not remove whitespaces Pin
evald8018-Jan-10 1:25
memberevald8018-Jan-10 1:25 
GeneralGood man good Pin
niks0412-Jan-10 19:31
memberniks0412-Jan-10 19:31 
QuestionCan I get MIL HTML parser Algorithm. Pin
Hasibul Haque26-May-09 10:21
memberHasibul Haque26-May-09 10:21 
Generalcongratulations Pin
vukovicg13-May-09 5:03
membervukovicg13-May-09 5:03 
just wanted to say this is the most useful piece of code I have found on the web recently. and brilliantly written too!
GeneralSimply amazing! Pin
the Asocial Ape13-May-09 4:57
memberthe Asocial Ape13-May-09 4:57 
GeneralLowercased href Pin
exxellence12-Nov-08 2:04
memberexxellence12-Nov-08 2:04 
Generalfeature missing Pin
zeltera17-Aug-08 5:34
memberzeltera17-Aug-08 5:34 
GeneralRe: feature missing Pin
smitsc7-Oct-08 9:11
membersmitsc7-Oct-08 9:11 
GeneralDOCTYPE breaks the parser Pin
benblo14-May-08 6:18
memberbenblo14-May-08 6:18 
QuestionIs it a bug? Pin
huyhk27-Feb-08 21:57
memberhuyhk27-Feb-08 21:57 
GeneralRe: Is it a bug? Pin
Natural Cause26-Mar-08 23:34
memberNatural Cause26-Mar-08 23:34 
GeneralRe: Is it a bug? Pin
NoodleNoggin981-Jul-09 11:03
memberNoodleNoggin981-Jul-09 11:03 
GeneralRe: Is it a bug? [modified] Pin
Member 458246616-Jul-09 13:21
memberMember 458246616-Jul-09 13:21 
GeneralRe: Is it a bug? Pin
Jeremy Falcon8-Jul-09 5:50
memberJeremy Falcon8-Jul-09 5:50 
GeneralSuggestions for new interface methods Pin
Berend Engelbrecht26-Feb-08 10:10
memberBerend Engelbrecht26-Feb-08 10:10 
GeneralFound a bug Pin
stavinski14-Jan-08 11:09
memberstavinski14-Jan-08 11:09 
AnswerRe: Found a bug - me too, and the solution Pin
Berend Engelbrecht25-Feb-08 22:04
memberBerend Engelbrecht25-Feb-08 22:04 
QuestionHTML marked Pin
maingaosuong25-Sep-07 17:19
membermaingaosuong25-Sep-07 17:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150427.4 | Last Updated 31 Mar 2004
Article Copyright 2004 by Member 987427
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid