Click here to Skip to main content
Click here to Skip to main content

MIL HTML Parser

By , 30 Mar 2004
Rate this:
Please Sign up or sign in to vote.

Introduction

This library produces a domain tree of a given HTML document, allowing the developer to navigate and change the document in an methodical way. In addition to the basic HTML production, this library can also be used to produce XHTML documents, as it includes an HTML 4 entity encoder. Included in this release is a demonstration application in VB.NET showing how to use the library. I hope that it is all fairly self-explanatory.

Background

This library was written to avoid having to convert a document into XML prior to reading, whilst preserving the distinct HTML qualities. This gets round some deployment issues I had with different platforms.

Using the code

The simplest way to use the code is to add it into your solution as a C# class library. There are no third-party dependencies so it is just a matter of adding the source files in. Alternatively, you can build the DLL and add it as a reference.

Points of Interest

The XHTML production is fairly basic - there is no built-in DTD checking. So far, I have had no problems in the generation, but I'm keen on getting that sorted.

History

  • 1.4
  • 1.3
    • Bugfix: <!DOCTYPE...> and <!...> now treated as comments
    • Bugfix: Malformed or incomplete attribute values causing infinite loop fixed
  • 1.2
    • Bugfix: <tag/> now handled properly
    • Bugfix: Parse errors of scripts
    • Bugfix: Parse errors of styles
    • HTML 4 entity encoding
    • DOM tree navigation
    • Basic node searching
    • HTML production
    • XHTML production (as per http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd)
    • Added some component model stuff & comments
    • Hid the parser
  • 1.1
    • Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Member 987427

United Kingdom United Kingdom
No Biography provided

Comments and Discussions

 
QuestionWhitespace between text and anchor PinmemberMike11424-Oct-12 14:59 
BugBug fix for pages with different encoding PinmemberMember 846475913-Dec-11 5:40 
BugBug fix in returning multi empty spaces between words PinmemberMember 846475912-Dec-11 4:31 
BugError : Input string was not in a correct format. PinmemberMember 846475912-Dec-11 0:51 
Questionthank you PinmemberMember 84647596-Dec-11 6:09 
GeneralMy vote of 5 Pinmemberjp73125-Nov-10 0:45 
NewsWorks very good for Google! Pinmemberjp73125-Nov-10 0:45 
GeneralDoes not remove whitespaces Pinmemberevald8018-Jan-10 0:25 
GeneralGood man good Pinmemberniks0412-Jan-10 18:31 
QuestionCan I get MIL HTML parser Algorithm. PinmemberHasibul Haque26-May-09 9:21 
Generalcongratulations Pinmembervukovicg13-May-09 4:03 
GeneralSimply amazing! Pinmemberthe Asocial Ape13-May-09 3:57 
GeneralLowercased href Pinmemberexxellence12-Nov-08 1:04 
GeneralDOCTYPE breaks the parser Pinmemberbenblo14-May-08 5:18 
QuestionIs it a bug? Pinmemberhuyhk27-Feb-08 20:57 
GeneralRe: Is it a bug? PinmemberNatural Cause26-Mar-08 22:34 
GeneralRe: Is it a bug? PinmemberNoodleNoggin981-Jul-09 10:03 
GeneralRe: Is it a bug? [modified] PinmemberMember 458246616-Jul-09 12:21 
GeneralRe: Is it a bug? PinmemberJeremy Falcon8-Jul-09 4:50 
GeneralSuggestions for new interface methods PinmemberBerend Engelbrecht26-Feb-08 9:10 
I have succesfully used your library in a project to query books by isbn and collect catalogue data on them from various sites. Your html parser was by far the best of the five or so parser libraries that I tried, but still I missed some features in the API. I made some changes to my copy of the source, perhaps you would be willing to consider them?
 
My changes were:
1- change abstract class HtmlEncoder from internal to public, so that I can decode any html text fragment myself.
2- bugfix in decoding &xNN; hexadecimal html escape in HtmlEncoder.cs (see message earlier today)
3- Introduce extended matching for methods for attribute values:
      public enum SearchMethod
      {
         ExactMatch, // default
         ValueBeginsWith, // uses .StartsWith to match beginning of attribute value
         ValueContains // uses .IndexOf to match any part of a value
      }
 
I made an extra overload to FindByAttributeNameValue that has a searchMethod parameter to incorporate this.
 
Usage example: Consider an amazon.com "product overview" for a book. Authors are contained in A elements where the href attribute contains the substring "&field-author=". Having the SearchMethod parameter allows me to directly find only the nodes that I need:
 
HtmlNodeCollection nc = htmlDoc.FindByAttributeNameValue("href", "&field-author=", true, SearchMethod.ValueContains);
 

4- added an extra method FindByNameAttributeNameValue to match both node name and an attribute name/value pair. The example above can be made more efficient by also specifying the node name a:
 
HtmlNodeCollection nc = htmlDoc.FindByNameAttributeNameValue("a", "href", "&field-author=", true, SearchMethod.ValueContains);
 

This will return the same collection, but significantly faster because it no longer has iterate through every attribute of each node in the html document, but only through the small subset of a nodes.
 
Best regards,
 
Berend Engelbrecht
GeneralFound a bug Pinmemberstavinski14-Jan-08 10:09 
AnswerRe: Found a bug - me too, and the solution PinmemberBerend Engelbrecht25-Feb-08 21:04 
QuestionHTML marked Pinmembermaingaosuong25-Sep-07 16:19 
Generalregarding extracting tags Pinmemberrama jayapal29-Mar-07 3:49 
GeneralRe: regarding extracting tags PinmemberJames S.F. Hsieh29-Mar-07 22:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140415.2 | Last Updated 31 Mar 2004
Article Copyright 2004 by Member 987427
Everything else Copyright © CodeProject, 1999-2014
Terms of Use
Layout: fixed | fluid