|
Hello,
as subject, it seems that it does not removes whitespacesfrom the html file or to be more precise it removes only white spaces between the tags but not inside the tag!
Can you make it so it removes also the white spaces inside tags?
thank you
|
|
|
|
|
|
Dear Sir,
Thank you very much for shearing this kind of project. Can I get any algorithm about this or any idea how it works.
Hope you will help me.
Thanking You
Hasibul
|
|
|
|
|
just wanted to say this is the most useful piece of code I have found on the web recently. and brilliantly written too!
|
|
|
|
|
This is real programming. You've made an awesome tool that anyone would be well served to add to their kit!
Congratulations, and thank you for sharing this!
"A strike line of programmers? That'd be reeeeal hard to break."
-my old, anti-union boss
|
|
|
|
|
In version 1.4 all XHTML tags are lowercase, however all attributes are also lowercased, including the HREF attribute of the A tag. This shouldn't be done because URL's and querystrings are casesensitive, for instance YouTube URL's.
I changed this in HTMLattribute.cs in the XHTML property, so that HREF attributevalues aren't lowercased.
Just to let you know.
|
|
|
|
|
I cannot select for a node it's attributes... Is this a missing feature or I just didn't see it??
thanks for this great parser.
|
|
|
|
|
since you're a recent user maybe you could show me a simple example of how to make this work ??
|
|
|
|
|
I have a document that starts with a comment and doctype :
<!-- #BeginTemplate "/Templates/manual-scriptref-page.dwt" --><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><br />
<html xmlns="http://www.w3.org/1999/xhtml"><br />
<body>bla</body><br />
</html><!-- #EndTemplate -->
The parser breaks... renders 3 roots: the html tag as a text element, then the head and body.
After a few trials, turns out a doc starting with a comment gets parsed ok, but starting with a doctype doesn't.
... just wanted to let the author know, seems odd since the history mentions a doctype bugfix: is the zip up-to-date?
|
|
|
|
|
I parse this HTML string
<a target='_blank' class=team href=/data/league/team.php?t=610&s=413 class=team>Mariehamn</a>
And got an HTMLElement object with 8 attributes
target="_blank"
class="team"
href=""
null
data=null
league=null
team.php?t="t=610&s=413"
class="team"
I think the problem is from href attribute.
|
|
|
|
|
It could be your poorly contructed HTML...
You don't wrap any of the values in double quotes, you have class specified twice...
|
|
|
|
|
hmmmm...i'm not too sure here but i don't think that mal-formed, sloppy, redundant, ill-written, newbie html code is their fault. any parser can handle 'good' code. a parser ought to account for code of this quality without throwing an error.
|
|
|
|
|
You are correct, this is clearly a bug. I found it in the GetTokens method within the HtmlParser.cs file. I have a fix if anyone needs it.
I changed the code starting at line # 711 to read as follows:
<code>
while ((i < input.Length) && (input.Substring(i, 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1))
{
i++;
}
int dataLength = (i - value_start);
if (input.Substring(i, 1).Equals(">") &&
input.Substring(i - 1, 1).Equals("/"))
{
dataLength--;
}
tokens.Add(input.Substring(value_start, dataLength));
</code>
The original search list of " \r\n\t/>" was getting a false end of data trigger on the forward slashes in the href path. I considered removing the \r\n to because I didn't think it was valid to trigger the end of data on those characters, but wasn't sure so I left them in for now.
modified on Thursday, July 16, 2009 6:35 PM
|
|
|
|
|
Natural Cause wrote: You don't wrap any of the values in double quotes
Actually that's perfectly valid HTML. It's invalid in XHTML, and although I prefer quoting, it's still valid.
|
|
|
|
|
I have succesfully used your library in a project to query books by isbn and collect catalogue data on them from various sites. Your html parser was by far the best of the five or so parser libraries that I tried, but still I missed some features in the API. I made some changes to my copy of the source, perhaps you would be willing to consider them?
My changes were:
1- change abstract class HtmlEncoder from internal to public, so that I can decode any html text fragment myself.
2- bugfix in decoding &xNN; hexadecimal html escape in HtmlEncoder.cs (see message earlier today)
3- Introduce extended matching for methods for attribute values:
public enum SearchMethod
{
ExactMatch, // default
ValueBeginsWith, // uses .StartsWith to match beginning of attribute value
ValueContains // uses .IndexOf to match any part of a value
}
I made an extra overload to FindByAttributeNameValue that has a searchMethod parameter to incorporate this.
Usage example: Consider an amazon.com "product overview" for a book. Authors are contained in A elements where the href attribute contains the substring "&field-author=". Having the SearchMethod parameter allows me to directly find only the nodes that I need:
HtmlNodeCollection nc = htmlDoc.FindByAttributeNameValue("href", "&field-author=", true, SearchMethod.ValueContains);
4- added an extra method FindByNameAttributeNameValue to match both node name and an attribute name/value pair. The example above can be made more efficient by also specifying the node name a:
HtmlNodeCollection nc = htmlDoc.FindByNameAttributeNameValue("a", "href", "&field-author=", true, SearchMethod.ValueContains);
This will return the same collection, but significantly faster because it no longer has iterate through every attribute of each node in the html document, but only through the small subset of a nodes.
Best regards,
Berend Engelbrecht
|
|
|
|
|
I was using the HtmlDocument.Create(...) against HTML returned from the msn search site, and kept getting a FormatException, i managed to trace it to this call:
int v = int.Parse( token.ToString().Substring(2,token.Length-3) );
line 831 in the HtmlEncoder, the token.ToString().Substring(2, token.Length-3) resulted in the following value "xB7" as it is using a hex base character entity "·", think some logic needs to be added to check for hex entity as opposed to dec.
Thanks,
Mike
|
|
|
|
|
Since I had to parse a web site that used A0; for nonbreaking spaces everywhere, I took the liberty of fixing it in my copy. I would welcome that my fix (or similar code) is included in the standard version:
if (token[1] == '#')<br />
{<br />
try<br />
{<br />
if (token[2] == 'x')<br />
{<br />
int v = int.Parse(token.ToString().Substring(3).Split(';')[0], System.Globalization.NumberStyles.HexNumber);<br />
output.Append((char)v);<br />
}<br />
else<br />
{<br />
int v = int.Parse(token.ToString().Substring(2, token.Length - 3));<br />
output.Append((char)v);<br />
}<br />
}<br />
catch (Exception ex)<br />
{<br />
Trace.Write(ex);<br />
}<br />
}<br />
|
|
|
|
|
Hi all
I'm implementing a Winform app about 'HTML parser'.
In my app, the users input an URL (such as: www.amazon.com) and my app will show the expected page in a web browser control.
I want to let users can choose an area on that page and a label control will show all texts in that selected area. How can I do that???
I mean that: how can I determine the HTML tags (in that page) which enclose all selected texts ???
EX:
HTML:
Page:
selected text
none selected text
When I drag the mouse to enclose "selected text", I want to determine that table with id=1 is selected and "selected text" will be showed in a label control.
Please show me your ideas.
Thank in advance.
mns
|
|
|
|
|
can anyone solve my problem
i have developed a webapplication where i have parsed the contents of the webpage using
MILHTML parser
i have the document now in html format
i need to use the parser's attributes like
htmldocument
htmlelement
htmlnode
htmlattributes
am really new to this Dotnet environment and now i need to know
how to find the the tags with
i need to seperate the input tags first and then find their attributes like type="submit,hidden" name="" etc....
have anybody done this before or can anybody give me an idea abt how to write the recursive function to seperate the input tags from the document
plz help am running short of time
thanks
Rama
|
|
|
|
|
Maybe the following program can match your requirement
in DOL HTML Parser (http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]).
Good Luck
// Open HTML file "xxx.htm"
DHtmlGeneralParser parser = DHtmlGeneralParser();
DHtmlDocument htmlDoc = new DHtmlDocument(parser);
htmlDoc.Load(@"..\xxx.htm");
DHtmlNodeCollection result = new DHtmlNodeCollection();
// Find all tag of this pattern in all html document
// function: void FindByNameAttribute
// (
// DHtmlNodeCollection result, // a collection to collect result
// string name, // tag name which you want to find
// string attributeName, // attribute name which you want to find
// bool searchChildren // whether it searchs child with recursive
// )
htmlDoc.Nodes.FindByNameAttribute(result, "input" "type", true);
|
|
|
|
|
The MIL HTML Parser is an useful library for me, but the project has stopped to maintain.
I created a project "DOLS HTML Parser" based on MIL HTML Parser in codeproject and wish it can help everyone.
A non-well-formed HTML Parser and CSS Resolver,
The URL:http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]
|
|
|
|
|
Great code! For the most part it works, but more often than note it does not pick up on an IMG node I am looking for. I have tweaked the source HTML docs a bit and usually get it to work, but havn't nailed down the cause.
Is there a requirement or restrictions for the source HTML/ XHTML?
thanks!
|
|
|
|
|
I just thought of asking, why use a String instead of a Stream?
|
|
|
|
|
Ì dont understand why you find that so hard, while its soo easy!
Try this:
Dim mDocument As MIL.Html.HtmlDocument
Dim html As String = "Your HTML thingies here, instead of a StreamReading result)
mDocument = HtmlDocument.Create(html, False)
Then just do whatever yo want with mDocument, just a bit of hushling with the demo project...
Though, I find it a rather stupid question for a Microsoft Partner, since the readed streams are actually Strings...
SO instead of:
Dim html As String = Stream.ReadToEnd
you just change Stream.ReadToEnd to your string...
Or am I misunderstanding a question here?
You mess with the best, you die like the rest... well... kinda???
|
|
|
|
|
Read Streams are *not* strings. The ReadToEnd method is not the best way to use streams, especially if you are parsing...
Strings are immutable, meaning that everytime I do an operation on them, a new one is created. So, if I am doing 25 operations on a 60 kb string, that'll allocate some 1.2MB! And, that's just garbage, waiting to be GCed...
So, in my app, I process about an MB of HTML a second(since It's a Scrapper), using a string based solution would just not scale...
Streams Scale: If you do a Read with a 25 byte buffer, you only allocate 25 bytes, so it scales...
Yes, building a Stream based parser is harder, but I've already found one: HTMLAgilityPack[^], which is quite fast, and stream based...
Thanks anyway...
|
|
|
|