Click here to Skip to main content
Click here to Skip to main content

MIL HTML Parser

By , 30 Mar 2004
 

Introduction

This library produces a domain tree of a given HTML document, allowing the developer to navigate and change the document in an methodical way. In addition to the basic HTML production, this library can also be used to produce XHTML documents, as it includes an HTML 4 entity encoder. Included in this release is a demonstration application in VB.NET showing how to use the library. I hope that it is all fairly self-explanatory.

Background

This library was written to avoid having to convert a document into XML prior to reading, whilst preserving the distinct HTML qualities. This gets round some deployment issues I had with different platforms.

Using the code

The simplest way to use the code is to add it into your solution as a C# class library. There are no third-party dependencies so it is just a matter of adding the source files in. Alternatively, you can build the DLL and add it as a reference.

Points of Interest

The XHTML production is fairly basic - there is no built-in DTD checking. So far, I have had no problems in the generation, but I'm keen on getting that sorted.

History

  • 1.4
  • 1.3
    • Bugfix: <!DOCTYPE...> and <!...> now treated as comments
    • Bugfix: Malformed or incomplete attribute values causing infinite loop fixed
  • 1.2
    • Bugfix: <tag/> now handled properly
    • Bugfix: Parse errors of scripts
    • Bugfix: Parse errors of styles
    • HTML 4 entity encoding
    • DOM tree navigation
    • Basic node searching
    • HTML production
    • XHTML production (as per http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd)
    • Added some component model stuff & comments
    • Hid the parser
  • 1.1
    • Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Member 987427
United Kingdom United Kingdom
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionWhitespace between text and anchormemberMike11424 Oct '12 - 14:59 
When I call HtmlDocument.Create with wantSpaces = false parser removes even necessary white spaces between text and anchor. I.e. original html
<p>Go to <a href="www.google.com">Google</a>/p>
will be changed to
<p>Go to<a href="www.google.com">Google</a>/p>
that would produce bad result as
Go toGoogle
Is there any workaround?
BugBug fix for pages with different encodingmemberMember 846475913 Dec '11 - 5:40 
there was an error in opening pages contain hex unicode chars for example in the following string:
 
People say it&#x2019;s a life-changing experience. And there&#x2019;s no doubt that when you make the incredible journey to India and Nepal, you&#x2019;ll return with a new outlook on life and endless extraordinary memories. Fascinating and intriguing, this is a world that&#x2019;s far removed from our western way of living.Our wonderfully designed tour encompasses the very best that Northern India has
 
bolded strings are the hex numbers representing some characters, for full list of this codes follow this link: http://code.cside.com/3rdpage/us/unicode/converter.html[^]
 
the error is in line 828 in HtmlEncoder. DecodeValue function. change the that lines to below:
if( token[1] == '#' )
                        {
                            if (token[2] == 'x')
                            {
                                //the value is a hex code
                                int v = Convert.ToInt32(token.ToString().Substring(3, token.Length - 4),16);
                                output.Append((char)v);
                            }
                            else
                            {
                                int v = int.Parse(token.ToString().Substring(2, token.Length - 3));
                                output.Append((char)v);
                            }
                        }
 

Good lock
Mesut Talebi

BugBug fix in returning multi empty spaces between wordsmemberMember 846475912 Dec '11 - 4:31 
when we have a token like this
"Local \n                  Business List"
the
private string RemoveWhitespace(string input)
return the output as :
"Local                  Business List"
to solving this problem just fix the above function as below:
 
private string RemoveWhitespace(string input)
        {
            string output = input.Replace( "\r" , "" );
            output = output.Replace( "\n" , "" );
            output = output.Replace( "\t" , " " );
            output = output.Replace("  ", "");
            output = output.Trim();
            return output;
        }
 
just add bold line before output=output.trim();
 
the resulting value will be: "Local Business List"
 
Smile | :)
BugError : Input string was not in a correct format.memberMember 846475912 Dec '11 - 0:51 
Previously Thank for this great Program
there is an error in some pages, the error message is as subject of this message
for example in this site:
http://travelshop.telegraph.co.uk/
 
please take a look at this problem,
and another thing I want to say is some suggestions:
the program you developed remove's comments in html content, some times we need to analysis comments. and other thing the inner text in scripts and styles behave like a text, it would be better if they added as script and style to tree
 
thank you
Questionthank youmemberMember 84647596 Dec '11 - 6:09 
thank you for this great code
but it needs to be improved for example if i have a node i should can get collection of childs for that node, and go through nested childs without need to implement any recursive function.
Mesut Talebi

GeneralMy vote of 5memberjp73125 Nov '10 - 0:45 
works perfect!
NewsWorks very good for Google!memberjp73125 Nov '10 - 0:45 
I recently I'm using your code to parse Goolge results and works perfect!
 
Previously and still now sometimes, I use an online html parser that works very good too. I think that their code could be very similar to yours but written in PHP.
GeneralDoes not remove whitespacesmemberevald8018 Jan '10 - 0:25 
Hello,
 
as subject, it seems that it does not removes whitespacesfrom the html file or to be more precise it removes only white spaces between the tags but not inside the tag!
 
Can you make it so it removes also the white spaces inside tags?
 
thank you
GeneralGood man goodmemberniks0412 Jan '10 - 18:31 
Thumbs Up | :thumbsup:
QuestionCan I get MIL HTML parser Algorithm.memberHasibul Haque26 May '09 - 9:21 
Dear Sir,
Thank you very much for shearing this kind of project. Can I get any algorithm about this or any idea how it works.
 
Hope you will help me.
 
Thanking You
Hasibul
Generalcongratulationsmembervukovicg13 May '09 - 4:03 
just wanted to say this is the most useful piece of code I have found on the web recently. and brilliantly written too!
GeneralSimply amazing!memberthe Asocial Ape13 May '09 - 3:57 
This is real programming. You've made an awesome tool that anyone would be well served to add to their kit!
 

Congratulations, and thank you for sharing this!
 
"A strike line of programmers? That'd be reeeeal hard to break."
-my old, anti-union boss

GeneralLowercased hrefmemberexxellence12 Nov '08 - 1:04 
In version 1.4 all XHTML tags are lowercase, however all attributes are also lowercased, including the HREF attribute of the A tag. This shouldn't be done because URL's and querystrings are casesensitive, for instance YouTube URL's.
I changed this in HTMLattribute.cs in the XHTML property, so that HREF attributevalues aren't lowercased.
 
Just to let you know.
Generalfeature missingmemberzeltera17 Aug '08 - 4:34 
I cannot select for a node it's attributes... Is this a missing feature or I just didn't see it??
thanks for this great parser.
 

GeneralRe: feature missingmembersmitsc7 Oct '08 - 8:11 
since you're a recent user maybe you could show me a simple example of how to make this work ??
GeneralDOCTYPE breaks the parsermemberbenblo14 May '08 - 5:18 
I have a document that starts with a comment and doctype :
 
<!-- #BeginTemplate "/Templates/manual-scriptref-page.dwt" --><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<body>bla</body>
</html><!-- #EndTemplate -->

 
The parser breaks... renders 3 roots: the html tag as a text element, then the head and body.
After a few trials, turns out a doc starting with a comment gets parsed ok, but starting with a doctype doesn't.
 
... just wanted to let the author know, seems odd since the history mentions a doctype bugfix: is the zip up-to-date?
QuestionIs it a bug?memberhuyhk27 Feb '08 - 20:57 
I parse this HTML string
 
<a target='_blank' class=team href=/data/league/team.php?t=610&s=413 class=team>Mariehamn</a>
 
And got an HTMLElement object with 8 attributes
target="_blank"
class="team"
href=""
null
data=null
league=null
team.php?t="t=610&s=413"
class="team"
 
I think the problem is from href attribute.
GeneralRe: Is it a bug?memberNatural Cause26 Mar '08 - 22:34 
It could be your poorly contructed HTML...
 
You don't wrap any of the values in double quotes, you have class specified twice...
GeneralRe: Is it a bug?memberNoodleNoggin981 Jul '09 - 10:03 
hmmmm...i'm not too sure here but i don't think that mal-formed, sloppy, redundant, ill-written, newbie html code is their fault. any parser can handle 'good' code. a parser ought to account for code of this quality without throwing an error.
 
Smile | :)
GeneralRe: Is it a bug? [modified]memberMember 458246616 Jul '09 - 12:21 
You are correct, this is clearly a bug. I found it in the GetTokens method within the HtmlParser.cs file. I have a fix if anyone needs it.
 
I changed the code starting at line # 711 to read as follows:
 
while ((i < input.Length) && (input.Substring(i, 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1))
{
    i++;
}
int dataLength = (i - value_start);
if (input.Substring(i, 1).Equals(">") &&
    input.Substring(i - 1, 1).Equals("/"))
{
    // if it's a proper end of tag '/>', don't include the '/' in the data
    dataLength--;
}
tokens.Add(input.Substring(value_start, dataLength));
 
The original search list of " \r\n\t/>" was getting a false end of data trigger on the forward slashes in the href path. I considered removing the \r\n to because I didn't think it was valid to trigger the end of data on those characters, but wasn't sure so I left them in for now.
 
modified on Thursday, July 16, 2009 6:35 PM

GeneralRe: Is it a bug?memberJeremy Falcon8 Jul '09 - 4:50 
Natural Cause wrote:
You don't wrap any of the values in double quotes

 
Actually that's perfectly valid HTML. It's invalid in XHTML, and although I prefer quoting, it's still valid.
 
Jeremy Falcon
jeremyfalcon.com[^]

GeneralSuggestions for new interface methodsmemberBerend Engelbrecht26 Feb '08 - 9:10 
I have succesfully used your library in a project to query books by isbn and collect catalogue data on them from various sites. Your html parser was by far the best of the five or so parser libraries that I tried, but still I missed some features in the API. I made some changes to my copy of the source, perhaps you would be willing to consider them?
 
My changes were:
1- change abstract class HtmlEncoder from internal to public, so that I can decode any html text fragment myself.
2- bugfix in decoding &xNN; hexadecimal html escape in HtmlEncoder.cs (see message earlier today)
3- Introduce extended matching for methods for attribute values:
      public enum SearchMethod
      {
         ExactMatch, // default
         ValueBeginsWith, // uses .StartsWith to match beginning of attribute value
         ValueContains // uses .IndexOf to match any part of a value
      }
 
I made an extra overload to FindByAttributeNameValue that has a searchMethod parameter to incorporate this.
 
Usage example: Consider an amazon.com "product overview" for a book. Authors are contained in A elements where the href attribute contains the substring "&field-author=". Having the SearchMethod parameter allows me to directly find only the nodes that I need:
 
HtmlNodeCollection nc = htmlDoc.FindByAttributeNameValue("href", "&field-author=", true, SearchMethod.ValueContains);
 

4- added an extra method FindByNameAttributeNameValue to match both node name and an attribute name/value pair. The example above can be made more efficient by also specifying the node name a:
 
HtmlNodeCollection nc = htmlDoc.FindByNameAttributeNameValue("a", "href", "&field-author=", true, SearchMethod.ValueContains);
 

This will return the same collection, but significantly faster because it no longer has iterate through every attribute of each node in the html document, but only through the small subset of a nodes.
 
Best regards,
 
Berend Engelbrecht
GeneralFound a bugmemberstavinski14 Jan '08 - 10:09 
I was using the HtmlDocument.Create(...) against HTML returned from the msn search site, and kept getting a FormatException, i managed to trace it to this call:
 
int v = int.Parse( token.ToString().Substring(2,token.Length-3) );
 
line 831 in the HtmlEncoder, the token.ToString().Substring(2, token.Length-3) resulted in the following value "xB7" as it is using a hex base character entity "&#xB7;", think some logic needs to be added to check for hex entity as opposed to dec.
 
Thanks,
Mike
AnswerRe: Found a bug - me too, and the solutionmemberBerend Engelbrecht25 Feb '08 - 21:04 
Since I had to parse a web site that used A0; for nonbreaking spaces everywhere, I took the liberty of fixing it in my copy. I would welcome that my fix (or similar code) is included in the standard version:
if (token[1] == '#')
{
// Berend: also support hex notation
try
{
if (token[2] == 'x')
{
int v = int.Parse(token.ToString().Substring(3).Split(';')[0], System.Globalization.NumberStyles.HexNumber);
output.Append((char)v);
}
else
{
int v = int.Parse(token.ToString().Substring(2, token.Length - 3));
output.Append((char)v);
}
}
catch (Exception ex)
{
Trace.Write(ex);
}
}

QuestionHTML markedmembermaingaosuong25 Sep '07 - 16:19 
Hi all
I'm implementing a Winform app about 'HTML parser'.
In my app, the users input an URL (such as: www.amazon.com) and my app will show the expected page in a web browser control.
I want to let users can choose an area on that page and a label control will show all texts in that selected area. How can I do that???
I mean that: how can I determine the HTML tags (in that page) which enclose all selected texts ???
EX:
HTML:
<html>
<body>
selected text


none selected text
</body>
</html>
 
Page:
selected text
none selected text
 
When I drag the mouse to enclose "selected text", I want to determine that table with id=1 is selected and "selected text" will be showed in a label control.
 
Please show me your ideas.
Thank in advance.

 
mns

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web01 | 2.6.130516.1 | Last Updated 31 Mar 2004
Article Copyright 2004 by Member 987427
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid