|
Hi all
I'm implementing a Winform app about 'HTML parser'.
In my app, the users input an URL (such as: www.amazon.com) and my app will show the expected page in a web browser control.
I want to let users can choose an area on that page and a label control will show all texts in that selected area. How can I do that???
I mean that: how can I determine the HTML tags (in that page) which enclose all selected texts ???
EX:
HTML:
Page:
selected text
none selected text
When I drag the mouse to enclose "selected text", I want to determine that table with id=1 is selected and "selected text" will be showed in a label control.
Please show me your ideas.
Thank in advance.
mns
|
|
|
|
|
can anyone solve my problem
i have developed a webapplication where i have parsed the contents of the webpage using
MILHTML parser
i have the document now in html format
i need to use the parser's attributes like
htmldocument
htmlelement
htmlnode
htmlattributes
am really new to this Dotnet environment and now i need to know
how to find the the tags with
i need to seperate the input tags first and then find their attributes like type="submit,hidden" name="" etc....
have anybody done this before or can anybody give me an idea abt how to write the recursive function to seperate the input tags from the document
plz help am running short of time
thanks
Rama
|
|
|
|
|
Maybe the following program can match your requirement
in DOL HTML Parser (http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]).
Good Luck
// Open HTML file "xxx.htm"
DHtmlGeneralParser parser = DHtmlGeneralParser();
DHtmlDocument htmlDoc = new DHtmlDocument(parser);
htmlDoc.Load(@"..\xxx.htm");
DHtmlNodeCollection result = new DHtmlNodeCollection();
// Find all tag of this pattern in all html document
// function: void FindByNameAttribute
// (
// DHtmlNodeCollection result, // a collection to collect result
// string name, // tag name which you want to find
// string attributeName, // attribute name which you want to find
// bool searchChildren // whether it searchs child with recursive
// )
htmlDoc.Nodes.FindByNameAttribute(result, "input" "type", true);
|
|
|
|
|
The MIL HTML Parser is an useful library for me, but the project has stopped to maintain.
I created a project "DOLS HTML Parser" based on MIL HTML Parser in codeproject and wish it can help everyone.
A non-well-formed HTML Parser and CSS Resolver,
The URL:http://www.codeproject.com/useritems/DOL_HTML_Parser.asp[^]
|
|
|
|
|
Great code! For the most part it works, but more often than note it does not pick up on an IMG node I am looking for. I have tweaked the source HTML docs a bit and usually get it to work, but havn't nailed down the cause.
Is there a requirement or restrictions for the source HTML/ XHTML?
thanks!
|
|
|
|
|
I just thought of asking, why use a String instead of a Stream?
|
|
|
|
|
Ì dont understand why you find that so hard, while its soo easy!
Try this:
Dim mDocument As MIL.Html.HtmlDocument
Dim html As String = "Your HTML thingies here, instead of a StreamReading result)
mDocument = HtmlDocument.Create(html, False)
Then just do whatever yo want with mDocument, just a bit of hushling with the demo project...
Though, I find it a rather stupid question for a Microsoft Partner, since the readed streams are actually Strings...
SO instead of:
Dim html As String = Stream.ReadToEnd
you just change Stream.ReadToEnd to your string...
Or am I misunderstanding a question here?
You mess with the best, you die like the rest... well... kinda???
|
|
|
|
|
Read Streams are *not* strings. The ReadToEnd method is not the best way to use streams, especially if you are parsing...
Strings are immutable, meaning that everytime I do an operation on them, a new one is created. So, if I am doing 25 operations on a 60 kb string, that'll allocate some 1.2MB! And, that's just garbage, waiting to be GCed...
So, in my app, I process about an MB of HTML a second(since It's a Scrapper), using a string based solution would just not scale...
Streams Scale: If you do a Read with a 25 byte buffer, you only allocate 25 bytes, so it scales...
Yes, building a Stream based parser is harder, but I've already found one: HTMLAgilityPack[^], which is quite fast, and stream based...
Thanks anyway...
|
|
|
|
|
I strongly suggest moving your project somewhere with more exposure. I googled hard to find it and at this point it's much more usefull than microsoft's wrapper of the IE control. Such libraries are important just because the cost of keeping them up to date for a small project or for research purposes (my case) is prohibitive.
I suggest GotDotNet workspaces or the shiny new Codeplex.
Btw great work
|
|
|
|
|
I think it is very good but it can't fix some error of html code like IE. For example, I have a html code here:
Hi
Hello
Does it fix this error like ie : Text "Hello" is subnode of ?
Who has fix this error?
Please help me.
sfdafsda
|
|
|
|
|
Spaces between words get removed if a formatting tag is contained between the words. For example:
Dear Ally
gets displayed as:
DearAlly
|
|
|
|
|
HtmlDocument.Create() has an overload for "wantSpaces"
Set that to true and your spaces will be preserved.
|
|
|
|
|
I found that the eats some CJK(Chinese,Japanese,Korean) words.
It took me a long time to find the reason.
Finally I found that it just because of line 315:
Dim sr As System.IO.StreamReader = System.IO.File.OpenText(OpenHtmlFileDialog.FileName)
It will be ok when making a chang to :
Dim sr As System.IO.StreamReader = New System.IO.StreamReader(OpenHtmlFileDialog.FileName, System.Text.Encoding.Default)
Thanks a lot for your work.
It makes a good help to me.
|
|
|
|
|
Hello!
I have found a minor error during weeping out of the element. If there is something meaningfull immediatelly behind section, then this first two characters of it are removed.
This error lies in the file HtmlParser.cs in RemoveSGMLComments function. The repaired version of the while cycle (at the beginning of this function) is at the end of this letter.
Greetings,
jmas8109
Source code section:
--------------------
while( i < input.Length )
{
if( i + 2 < input.Length && input.Substring( i , 2 ).Equals( "" , i );
if( i == -1 )
{
break;
}
i += 1; // originally there was i += 3 (which is a bug)
jmas8109
|
|
|
|
|
This has made my day. There are 5 billion articles and samples about translating XML->HTML, etc. but no one wants to touch HTML->something freaking usable.
Thanks!
|
|
|
|
|
This is good article. I will say 5 star.
We can't stop asking "WHY!!"
|
|
|
|
|
the filterindex being returned by the common file dialog is 1-based, yielding indices 1 and 2. the code using the filterindex is checking for 0 or 1. the result is that xhtml is always exported.
|
|
|
|
|
I found some codes which let me confuse.
I'm not sure that these codes are correct/incorrect.
in HtmlParser.cs
code line 272:
original: if( i + 4 < input.Length && input.Substring( i , 4 ).Equals( "<!--" ) )<br />
suggestion: if( i + 3 < input.Length && input.Substring( i , 4 ).Equals( "<!--" ) )
code line 344:
original: if( i + 2 < input.Length && input.Substring( i , 2 ).Equals( "<!" ) )<br />
suggestion: if( i + 1 < input.Length && input.Substring( i , 2 ).Equals( "<!" ) )
code line 352:
original: i += 3;<br />
suggestion: i += 1;
code line 542:
<br />
original: if( i+2 < input.Length && input.Substring( i , 2 ).Equals( "<" ) )<br />
suggestion: if( i+1 < input.Length && input.Substring( i , 2 ).Equals( "</" ) )
code line 588:
<br />
original: if( i+1<input.Length && input.Substring( i , 1 ).Equals( "/>" ) )<br />
suggestion: if( i+1<input.Length && input.Substring( i , 2 ).Equals( "/>" ) )
code line 711:
In some case, value of attribute could be "images/about_logo.gif", and we can't use "/ to decide the end of the value string. (This case can be found in http://www.google.com/intl/en/about.html[^])
original: while( i<input.Length && input.Substring( i , 1 ).IndexOfAny( " \r\n\t/>".ToCharArray() ) == -1 )<br />
suggestion: while(i < input.Length && input.Substring(i , 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1 && i + 1 < input.Length && input.Substring(i , 2).Equals("/>") == false) ++ i;
I wish this information can help this project. =^_^=
|
|
|
|
|
I just download the code and I noticed that when parsing '', the parser think the node has two attributes. One attribute name is the 'width' with '10' as value and the other attribute is name is blank and null as value.
Vincent
|
|
|
|
|
if you have a page with the following meta tag for special characters
<meta http-equiv=Content-Type content="text/html; charset=ISO-8859-1">
then later use them “like these slanted quotes”
the parser removes the slanted-quote characters completely
would be happy to fix, but am at a loss as to where to look - i don't see an obvious place in the parser code for this...
--S
|
|
|
|
|
my bad - problem was reading file, not parser.
to read a html/asp file using special characters where the file was not created as a unicode/utf-8 file [e.g. the file was created in Visula Interdev], use the Encoding.Default parameter when creating the StreamReader, e.g.
FileStream fs = File.OpenRead(origFname);
StreamReader sro = new StreamReader(fs,
System.Text.Encoding.Default);
textBox1.Text = sro.ReadToEnd();
sro.Close();
thanks to Andy for helping me track this down
--S
|
|
|
|
|
very nice library; one minor issue - it doesn't seem to handle server-side includes or server script blocks correctly, e.g.
<!-- #INCLUDE FILE="..\Includes\stdvars.asp" -->
<%
Dim x, y
'blah
if (x > y) blah...
%>
%>
<html>
...etc.
becomes
<% Dim x, y 'blah if (x />
followed by a text node; the SSI is dropped completely
any chance of this being fixed soon, or should I attempt code surgery?
thanks!
--S
|
|
|
|
|
Thanks to Andy for a link to the current version, which handles #INCLUDEs already, and for instructions on how to modify the code for server-side script block handling.
there is an immediate work-around with the current version: replace "<%" and "%>" with "<!-- %" and "% -->" before parsing and the parser pulls out the server-size script as a comment block, which is good enough for my purposes
am taking the liberty of posting the link to the current code: http://powney.demon.co.uk/milhtml.html
--S
|
|
|
|
|
Try parsing the HTML at www.dn.se, your parser fails to correctly identify the <html> tag that follows directly after the <!DOCTYPE> tag. I get "html>" as a HtmlText node.
|
|
|
|
|
There's a bug that results in a parse-error in some <a href=...>. In fact it will cause a lot of other things to fail also, but I discovered it by following a link from google.
Google's main page contains this link:
<a href=/advanced_search?hl=en>Advanced Search</a>
The parser gets this totally wrong, ending up thinking there are 3 attributes:
attrib-name "href" maps to attrib-value "";
attrib-name "" maps to attrib-value "advanced_search?hl";
attrib-name "hl" maps to attrib-value "en".
(yes, that middle attrib-name really is an empty string).
Based on my cursory examination of the code, I fear this might be hard to fix. I think the root problem is that the code assumes it can tokenize HTML indepent of the parsing phase. This example (from one of the world's most popular web sites) shows, I believe, that attribute values must be tokenized differently from other things, which means that the tokenization is context sensitive. I hope I am wrong.
I wish you the best of success,
Marshall
PS: I have some ideas for how you might repair this. I'd be willing to spend a few minutes corresponding with you off-line if you are interested.
|
|
|
|