Click here to Skip to main content
Click here to Skip to main content

MIL HTML Parser

By , 30 Mar 2004
 

Introduction

This library produces a domain tree of a given HTML document, allowing the developer to navigate and change the document in an methodical way. In addition to the basic HTML production, this library can also be used to produce XHTML documents, as it includes an HTML 4 entity encoder. Included in this release is a demonstration application in VB.NET showing how to use the library. I hope that it is all fairly self-explanatory.

Background

This library was written to avoid having to convert a document into XML prior to reading, whilst preserving the distinct HTML qualities. This gets round some deployment issues I had with different platforms.

Using the code

The simplest way to use the code is to add it into your solution as a C# class library. There are no third-party dependencies so it is just a matter of adding the source files in. Alternatively, you can build the DLL and add it as a reference.

Points of Interest

The XHTML production is fairly basic - there is no built-in DTD checking. So far, I have had no problems in the generation, but I'm keen on getting that sorted.

History

  • 1.4
  • 1.3
    • Bugfix: <!DOCTYPE...> and <!...> now treated as comments
    • Bugfix: Malformed or incomplete attribute values causing infinite loop fixed
  • 1.2
    • Bugfix: <tag/> now handled properly
    • Bugfix: Parse errors of scripts
    • Bugfix: Parse errors of styles
    • HTML 4 entity encoding
    • DOM tree navigation
    • Basic node searching
    • HTML production
    • XHTML production (as per http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd)
    • Added some component model stuff & comments
    • Hid the parser
  • 1.1
    • Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Member 987427
United Kingdom United Kingdom
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralSuggestions for new interface methodsmemberBerend Engelbrecht26 Feb '08 - 9:10 
I have succesfully used your library in a project to query books by isbn and collect catalogue data on them from various sites. Your html parser was by far the best of the five or so parser libraries that I tried, but still I missed some features in the API. I made some changes to my copy of the source, perhaps you would be willing to consider them?
 
My changes were:
1- change abstract class HtmlEncoder from internal to public, so that I can decode any html text fragment myself.
2- bugfix in decoding &xNN; hexadecimal html escape in HtmlEncoder.cs (see message earlier today)
3- Introduce extended matching for methods for attribute values:
      public enum SearchMethod
      {
         ExactMatch, // default
         ValueBeginsWith, // uses .StartsWith to match beginning of attribute value
         ValueContains // uses .IndexOf to match any part of a value
      }
 
I made an extra overload to FindByAttributeNameValue that has a searchMethod parameter to incorporate this.
 
Usage example: Consider an amazon.com "product overview" for a book. Authors are contained in A elements where the href attribute contains the substring "&field-author=". Having the SearchMethod parameter allows me to directly find only the nodes that I need:
 
HtmlNodeCollection nc = htmlDoc.FindByAttributeNameValue("href", "&field-author=", true, SearchMethod.ValueContains);
 

4- added an extra method FindByNameAttributeNameValue to match both node name and an attribute name/value pair. The example above can be made more efficient by also specifying the node name a:
 
HtmlNodeCollection nc = htmlDoc.FindByNameAttributeNameValue("a", "href", "&field-author=", true, SearchMethod.ValueContains);
 

This will return the same collection, but significantly faster because it no longer has iterate through every attribute of each node in the html document, but only through the small subset of a nodes.
 
Best regards,
 
Berend Engelbrecht
GeneralFound a bugmemberstavinski14 Jan '08 - 10:09 
I was using the HtmlDocument.Create(...) against HTML returned from the msn search site, and kept getting a FormatException, i managed to trace it to this call:
 
int v = int.Parse( token.ToString().Substring(2,token.Length-3) );
 
line 831 in the HtmlEncoder, the token.ToString().Substring(2, token.Length-3) resulted in the following value "xB7" as it is using a hex base character entity "&#xB7;", think some logic needs to be added to check for hex entity as opposed to dec.
 
Thanks,
Mike
AnswerRe: Found a bug - me too, and the solutionmemberBerend Engelbrecht25 Feb '08 - 21:04 
Since I had to parse a web site that used A0; for nonbreaking spaces everywhere, I took the liberty of fixing it in my copy. I would welcome that my fix (or similar code) is included in the standard version:
if (token[1] == '#')
{
// Berend: also support hex notation
try
{
if (token[2] == 'x')
{
int v = int.Parse(token.ToString().Substring(3).Split(';')[0], System.Globalization.NumberStyles.HexNumber);
output.Append((char)v);
}
else
{
int v = int.Parse(token.ToString().Substring(2, token.Length - 3));
output.Append((char)v);
}
}
catch (Exception ex)
{
Trace.Write(ex);
}
}

QuestionHTML markedmembermaingaosuong25 Sep '07 - 16:19 
Hi all
I'm implementing a Winform app about 'HTML parser'.
In my app, the users input an URL (such as: www.amazon.com) and my app will show the expected page in a web browser control.
I want to let users can choose an area on that page and a label control will show all texts in that selected area. How can I do that???
I mean that: how can I determine the HTML tags (in that page) which enclose all selected texts ???
EX:
HTML:
<html>
<body>
selected text


none selected text
</body>
</html>
 
Page:
selected text
none selected text
 
When I drag the mouse to enclose "selected text", I want to determine that table with id=1 is selected and "selected text" will be showed in a label control.
 
Please show me your ideas.
Thank in advance.

 
mns

Generalregarding extracting tagsmemberrama jayapal29 Mar '07 - 3:49 
can anyone solve my problem
 
i have developed a webapplication where i have parsed the contents of the webpage using
 
MILHTML parser
 
i have the document now in html format
 
i need to use the parser's attributes like
 
htmldocument
 
htmlelement
 
htmlnode
 
htmlattributes
 
am really new to this Dotnet environment and now i need to know
 
how to find the the tags with<input type=hidden....">
 
i need to seperate the input tags first and then find their attributes like type="submit,hidden" name="" etc....
 
have anybody done this before or can anybody give me an idea abt how to write the recursive function to seperate the input tags from the document
 

 
plz help am running short of time
 

 
thanks
 
Rama
GeneralRe: regarding extracting tagsmemberJames S.F. Hsieh29 Mar '07 - 22:09 
Maybe the following program can match your requirement
in DOL HTML Parser (http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]).
Good Luck
 
// Open HTML file "xxx.htm"
DHtmlGeneralParser parser = DHtmlGeneralParser();
DHtmlDocument htmlDoc = new DHtmlDocument(parser);
htmlDoc.Load(@"..\xxx.htm");
 

DHtmlNodeCollection result = new DHtmlNodeCollection();
 
// Find all tag of this pattern <input type="oooo"> in all html document
// function: void FindByNameAttribute
// (
// DHtmlNodeCollection result, // a collection to collect result
// string name, // tag name which you want to find
// string attributeName, // attribute name which you want to find
// bool searchChildren // whether it searchs child with recursive
// )
htmlDoc.Nodes.FindByNameAttribute(result, "input" "type", true);
GeneralYou can try it. :)memberJames S.F. Hsieh27 Mar '07 - 22:36 
The MIL HTML Parser is an useful library for me, but the project has stopped to maintain.
I created a project "DOLS HTML Parser" based on MIL HTML Parser in codeproject and wish it can help everyone. Smile | :)
 
A non-well-formed HTML Parser and CSS Resolver,
The URL:http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]
Generalmisses some IMG nodesmemberencapsul10 Mar '07 - 7:53 
Great code! For the most part it works, but more often than note it does not pick up on an IMG node I am looking for. I have tweaked the source HTML docs a bit and usually get it to work, but havn't nailed down the cause.
 
Is there a requirement or restrictions for the source HTML/ XHTML?
 
thanks!
QuestionStrings instead of Streams?memberYuvi Panda26 Sep '06 - 6:15 
I just thought of asking, why use a String instead of a Stream?
 
Yuvi Panda T
15 Year old Microsoft Student Partner
Blogs at : http://yuvipanda.blogspot.com

AnswerRe: Strings instead of Streams?memberLennard Fonteijn1 Dec '06 - 13:16 
Ì dont understand why you find that so hard, while its soo easy!
 
Try this:
 
Dim mDocument As MIL.Html.HtmlDocument
Dim html As String = "Your HTML thingies here, instead of a StreamReading result)
mDocument = HtmlDocument.Create(html, False)
 
Then just do whatever yo want with mDocument, just a bit of hushling with the demo project...
 
Though, I find it a rather stupid question for a Microsoft Partner, since the readed streams are actually Strings...
 
SO instead of:
Dim html As String = Stream.ReadToEnd
 
you just change Stream.ReadToEnd to your string...
 
Or am I misunderstanding a question here?
 
You mess with the best, you die like the rest... well... kinda???

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 31 Mar 2004
Article Copyright 2004 by Member 987427
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid