|
When I call HtmlDocument.Create with wantSpaces = false parser removes even necessary white spaces between text and anchor. I.e. original html
<p>Go to <a href="www.google.com">Google</a>/p> will be changed to
<p>Go to<a href="www.google.com">Google</a>/p> that would produce bad result as
Go toGoogle Is there any workaround?
|
|
|
|
|
there was an error in opening pages contain hex unicode chars for example in the following string:
People say it’s a life-changing experience. And there’s no doubt that when you make the incredible journey to India and Nepal, you’ll return with a new outlook on life and endless extraordinary memories. Fascinating and intriguing, this is a world that’s far removed from our western way of living.Our wonderfully designed tour encompasses the very best that Northern India has
bolded strings are the hex numbers representing some characters, for full list of this codes follow this link: http://code.cside.com/3rdpage/us/unicode/converter.html[^]
the error is in line 828 in HtmlEncoder. DecodeValue function. change the that lines to below:
if( token[1] == '#' )
{
if (token[2] == 'x')
{
int v = Convert.ToInt32(token.ToString().Substring(3, token.Length - 4),16);
output.Append((char)v);
}
else
{
int v = int.Parse(token.ToString().Substring(2, token.Length - 3));
output.Append((char)v);
}
}
Good lock
Mesut Talebi
|
|
|
|
|
when we have a token like this
"Local \n Business List"
the
private string RemoveWhitespace(string input)
return the output as :
"Local Business List"
to solving this problem just fix the above function as below:
private string RemoveWhitespace(string input)
{
string output = input.Replace( "\r" , "" );
output = output.Replace( "\n" , "" );
output = output.Replace( "\t" , " " );
output = output.Replace(" ", "");
output = output.Trim();
return output;
}
just add bold line before output=output.trim();
the resulting value will be: "Local Business List"
|
|
|
|
|
Previously Thank for this great Program
there is an error in some pages, the error message is as subject of this message
for example in this site:
http://travelshop.telegraph.co.uk/
please take a look at this problem,
and another thing I want to say is some suggestions:
the program you developed remove's comments in html content, some times we need to analysis comments. and other thing the inner text in scripts and styles behave like a text, it would be better if they added as script and style to tree
thank you
|
|
|
|
|
Hi,
Today I have faced the same problem. Here is the steps to fix it:
1. Open the HtmlEncoder.cs file and go to line number 830.
2. change the
int v = int.Parse(token.ToString().Substring(2, token.Length - 3));
code with this lines:
int v;
if (token[2] == 'x')
v = int.Parse(token.ToString().Substring(3, token.Length - 4), System.Globalization.NumberStyles.HexNumber);
else
v = int.Parse(token.ToString().Substring(2, token.Length - 3));
That's it.
Wish you all the best !
|
|
|
|
|
thank you for this great code
but it needs to be improved for example if i have a node i should can get collection of childs for that node, and go through nested childs without need to implement any recursive function.
Mesut Talebi
|
|
|
|
|
|
I recently I'm using your code to parse Goolge results and works perfect!
Previously and still now sometimes, I use an online html parser that works very good too. I think that their code could be very similar to yours but written in PHP.
|
|
|
|
|
Hello,
as subject, it seems that it does not removes whitespacesfrom the html file or to be more precise it removes only white spaces between the tags but not inside the tag!
Can you make it so it removes also the white spaces inside tags?
thank you
|
|
|
|
|
|
Dear Sir,
Thank you very much for shearing this kind of project. Can I get any algorithm about this or any idea how it works.
Hope you will help me.
Thanking You
Hasibul
|
|
|
|
|
just wanted to say this is the most useful piece of code I have found on the web recently. and brilliantly written too!
|
|
|
|
|
This is real programming. You've made an awesome tool that anyone would be well served to add to their kit!
Congratulations, and thank you for sharing this!
"A strike line of programmers? That'd be reeeeal hard to break."
-my old, anti-union boss
|
|
|
|
|
In version 1.4 all XHTML tags are lowercase, however all attributes are also lowercased, including the HREF attribute of the A tag. This shouldn't be done because URL's and querystrings are casesensitive, for instance YouTube URL's.
I changed this in HTMLattribute.cs in the XHTML property, so that HREF attributevalues aren't lowercased.
Just to let you know.
|
|
|
|
|
I cannot select for a node it's attributes... Is this a missing feature or I just didn't see it??
thanks for this great parser.
|
|
|
|
|
since you're a recent user maybe you could show me a simple example of how to make this work ??
|
|
|
|
|
I have a document that starts with a comment and doctype :
<!-- #BeginTemplate "/Templates/manual-scriptref-page.dwt" --><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><br />
<html xmlns="http://www.w3.org/1999/xhtml"><br />
<body>bla</body><br />
</html><!-- #EndTemplate -->
The parser breaks... renders 3 roots: the html tag as a text element, then the head and body.
After a few trials, turns out a doc starting with a comment gets parsed ok, but starting with a doctype doesn't.
... just wanted to let the author know, seems odd since the history mentions a doctype bugfix: is the zip up-to-date?
|
|
|
|
|
I parse this HTML string
<a target='_blank' class=team href=/data/league/team.php?t=610&s=413 class=team>Mariehamn</a>
And got an HTMLElement object with 8 attributes
target="_blank"
class="team"
href=""
null
data=null
league=null
team.php?t="t=610&s=413"
class="team"
I think the problem is from href attribute.
|
|
|
|
|
It could be your poorly contructed HTML...
You don't wrap any of the values in double quotes, you have class specified twice...
|
|
|
|
|
hmmmm...i'm not too sure here but i don't think that mal-formed, sloppy, redundant, ill-written, newbie html code is their fault. any parser can handle 'good' code. a parser ought to account for code of this quality without throwing an error.
|
|
|
|
|
You are correct, this is clearly a bug. I found it in the GetTokens method within the HtmlParser.cs file. I have a fix if anyone needs it.
I changed the code starting at line # 711 to read as follows:
<code>
while ((i < input.Length) && (input.Substring(i, 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1))
{
i++;
}
int dataLength = (i - value_start);
if (input.Substring(i, 1).Equals(">") &&
input.Substring(i - 1, 1).Equals("/"))
{
dataLength--;
}
tokens.Add(input.Substring(value_start, dataLength));
</code>
The original search list of " \r\n\t/>" was getting a false end of data trigger on the forward slashes in the href path. I considered removing the \r\n to because I didn't think it was valid to trigger the end of data on those characters, but wasn't sure so I left them in for now.
modified on Thursday, July 16, 2009 6:35 PM
|
|
|
|
|
Natural Cause wrote: You don't wrap any of the values in double quotes
Actually that's perfectly valid HTML. It's invalid in XHTML, and although I prefer quoting, it's still valid.
|
|
|
|
|
I have succesfully used your library in a project to query books by isbn and collect catalogue data on them from various sites. Your html parser was by far the best of the five or so parser libraries that I tried, but still I missed some features in the API. I made some changes to my copy of the source, perhaps you would be willing to consider them?
My changes were:
1- change abstract class HtmlEncoder from internal to public, so that I can decode any html text fragment myself.
2- bugfix in decoding &xNN; hexadecimal html escape in HtmlEncoder.cs (see message earlier today)
3- Introduce extended matching for methods for attribute values:
public enum SearchMethod
{
ExactMatch, // default
ValueBeginsWith, // uses .StartsWith to match beginning of attribute value
ValueContains // uses .IndexOf to match any part of a value
}
I made an extra overload to FindByAttributeNameValue that has a searchMethod parameter to incorporate this.
Usage example: Consider an amazon.com "product overview" for a book. Authors are contained in A elements where the href attribute contains the substring "&field-author=". Having the SearchMethod parameter allows me to directly find only the nodes that I need:
HtmlNodeCollection nc = htmlDoc.FindByAttributeNameValue("href", "&field-author=", true, SearchMethod.ValueContains);
4- added an extra method FindByNameAttributeNameValue to match both node name and an attribute name/value pair. The example above can be made more efficient by also specifying the node name a:
HtmlNodeCollection nc = htmlDoc.FindByNameAttributeNameValue("a", "href", "&field-author=", true, SearchMethod.ValueContains);
This will return the same collection, but significantly faster because it no longer has iterate through every attribute of each node in the html document, but only through the small subset of a nodes.
Best regards,
Berend Engelbrecht
|
|
|
|
|
I was using the HtmlDocument.Create(...) against HTML returned from the msn search site, and kept getting a FormatException, i managed to trace it to this call:
int v = int.Parse( token.ToString().Substring(2,token.Length-3) );
line 831 in the HtmlEncoder, the token.ToString().Substring(2, token.Length-3) resulted in the following value "xB7" as it is using a hex base character entity "·", think some logic needs to be added to check for hex entity as opposed to dec.
Thanks,
Mike
|
|
|
|
|
Since I had to parse a web site that used A0; for nonbreaking spaces everywhere, I took the liberty of fixing it in my copy. I would welcome that my fix (or similar code) is included in the standard version:
if (token[1] == '#')<br />
{<br />
try<br />
{<br />
if (token[2] == 'x')<br />
{<br />
int v = int.Parse(token.ToString().Substring(3).Split(';')[0], System.Globalization.NumberStyles.HexNumber);<br />
output.Append((char)v);<br />
}<br />
else<br />
{<br />
int v = int.Parse(token.ToString().Substring(2, token.Length - 3));<br />
output.Append((char)v);<br />
}<br />
}<br />
catch (Exception ex)<br />
{<br />
Trace.Write(ex);<br />
}<br />
}<br />
|
|
|
|