 |

|
Hello,
i'm trying to remove text and comments from a html file and so i have created this code but it goes in error.
any idea how to fix it?
thank you
public void RemoveWhAndComments(DHtmlNode node)
{
DHtmlText text = node as DOL.DHtml.DHtmlParser.Node.DHtmlText;
if(text != null)
{
if(text.IsWhiteSpace)
{
node.Parent.Nodes.RemoveAt(text.NodeID);
}
return;
}
DHtmlElement element = node as DOL.DHtml.DHtmlParser.Node.DHtmlElement;
if(element != null)
{
for (int i = 0; i < element.Nodes.Count; i++)
{
RemoveWhAndComments(element.Nodes[i]);
}
return;
}
}
|
|
|
|

|
Hello James,
originally you've licensed this work under GPL. Later you provide explicit use for commercial applications and the permission to modify the source code (see posts below). Is the project Public Domain, or have you got a license for this like BSD, LGPL or ? I like to modify your lib therefore i want to know the exact license.
Kind Regards Tom
|
|
|
|

|
when I put the mouse over/select the text from html, the left frame show the node of the text, I change the node to parent or child, I can get the dom xpath.
many thanks.
|
|
|
|

|
is there a webbrowser control in this parser?
_____________________________
Don't download it, make it.
Visual Basic /C#
|
|
|
|
|

|
That's just what I needed. thanks very much!
|
|
|
|

|
Is there an innerHTML Property available?
|
|
|
|

|
Take a look at the methods InnerText and TransformHtml
|
|
|
|

|
Hi all
I'm implementing a Winform app about 'HTML parser'.
In my app, the users input an URL (such as: www.amazon.com) and my app will show the expected page in a web browser control.
I want to let users can choose an area on that page and a label control will show all texts in that selected area. How can I do that???
I mean that: how can I determine the HTML tags (in that page) which enclose all selected texts ???
EX:
HTML:
<html>
<body>
</body>
</html>
Page:
selected text
none selected text
When I drag the mouse to enclose "selected text", I want to determine that table with id=1 is selected and "selected text" will be showed in a label control.
Please show me your ideas.
Thank in advance.
mns
|
|
|
|

|
Can we modify this piece of code to parse/resolve .css files ?
Ashish
|
|
|
|

|
i try to make a unit tests for my lib based on yours. but i always get an error: assertion failed. is it somthing wrong in my actions? or may be you know, why error is occurs.
by the way: tell me, why are some classes marked as sealed?
|
|
|
|

|
1. I try to add unit test in my lib via Visual Studio unit test project, and it work well.
Most assertion failed are argurment checking failed.
For example:
[TestMethod()]
public void ComapctWSCStringTest()
{
string str = NULL;
// string str = "Some test string you need assign.";
string expected = NULL;
// string expected = "You need assign expected return value.";
string actual;
actual = DOL.DHtml.DHtmlTextProcessor.ComapctWSCString(str);
Assert.AreEqual(expected, actual, "DOL.DHtml.DHtmlTextProcessor.ComapctWSCString unexpected return value");
}
2. I don't expect classes marked as sealed to extend by inheritance because creating instance of these classes is fixed in
class DHtmlGeneralParser. I think that you may extend these classes in some reasons. Hence, I will remove "sealed keyword"
and implement "abstract factory or factory method pattern" in class DHtmlGeneralParser to meet your requirement .
-- modified at 20:43 Wednesday 1st August, 2007
|
|
|
|

|
Because the 'DHtmlTextProcessor' class modifies a static StringBuilder instance (m_builder) in a few of its methods, you can very easily run into race conditions when using this library in multiple threads. Please keep in mind that I am not asking for these classes to be thread-safe, but rather suggesting that you re-work the DHtmlTextProcessor code so that 2 (or more) DHtmlDocument instances can be created (parsing can occur) on different threads at the same time. Currently, parsing is only safe on one thread at a time. You could synchronize access to m_builder by locking it in each method of DHtmlTextProcessor, however this would cause undue performance overhead. I think you are better off creating a new StringBuilder instance in each method that needs one.
|
|
|
|

|
I think you are right. The reason of using one StringBuilder instance is I concern the performance of memory allocating in each method because DHtmlTextProcessor is main performance bottleneck of this lib;P. But I modify it to create a new StringBuilder instance in each method that needs one, and the performance is OK. So I will update that, thank you for your suggesting.
|
|
|
|

|
| This is great work. One thing that looks to be broken is support for a colon in an attribute name which may be there for namespaces (<html xmlns:vml="urn:schemas-microsoft-com:vml" ...>). On a related note, and I'm not sure what the HTML specs are on this, but most browsers will accept a period in the attribute name ( |
|
|
|

|
As I know, concept of namespace is not defined in HTML spec and all elements of HTML DTD but XHTML has, it is because XHTML is a kind of XML document. Many browsers accept a colon or period in an element name or attribute name but it maybe ignores that to present in screen. In 3.2 HTML Lexical Syntax of HTML - 2.0 (RFC 1866), it seems to permit that so I retain that in parsing.
|
|
|
|

|
It is perfect!
Do you mind I use css parser as library in commercial applications?
|
|
|
|

|
It's OK
|
|
|
|

|
Thank you very much!
And Can I modify some code for some purpose?
|
|
|
|

|
Sure
If you can, please give me some suggestion for this lib. Thanks!
|
|
|
|

|
Thank you~!
I want to add some code to achieve underside function.
1, At rule in CSS2 also can parse;
For example : @import "test.css"->get the name of css file;
@media print{...}->parse into selector which has media type;
2,The CSS value also can parse;
For example:
font-family:'qMmpS Pro W3','Hiragino Kaku Gothic Pro', 'lr oSVbN', 'MS PGothic', Osaka, sans-serif;
->The value(String) list can parse out. Also,the value of Length ,URI,Integers,Colors and so on can parse out.
3,Ignore some invalid token;
For example: H3, H4 & H5 {color: red }
->Ignore the whole line, and not set the color of H3 to red
background: "red"->Ignore the whole line
-- modified at 2:47 Monday 16th July, 2007
|
|
|
|
|
|

|
Now, interesting is popular studies parser this domain.
Tried your demo to let me be interested very much.
Wants to use your library to try it.
Did not know whether you also do have the more specifies document?
(ex. class diagram or method usage and so on.)
Could you provide me to refer?
Thank you very much!!
|
|
|
|
|
|

|
Hi! I've tried to use you HTML parser and can say that is has
good performance and excellent interface (except comments. There are no ).
But I had a problem with parsing home page of http://www.mail.ru - it looks like
it works in infinitive loop. May be you can take a look at this site and find out a problem.
Don't you mind I will use it in my open-source project (web-browser?)
|
|
|
|
|

|
Thanks!
I tried my libraries before I stopped on yours. Some of them had greate performance but crashed even on missing closing tag and some had awful class structure. So your library is great!
May be you can throw an exception like DHtmlStuctureException? It would be cool.
|
|
|
|
|
|
|

|
can anyone solve my problem
i have developed a webapplication where i have parsed the contents of the webpage using
MILHTML parser
i have the document now in html format
i need to use the parser's attributes like
htmldocument
htmlelement
htmlnode
htmlattributes
am really new to this Dotnet environment and now i need to know
how to find the the tags with<input type=hidden....">
i need to seperate the input tags first and then find their attributes like type="submit,hidden" name="" etc....
have anybody done this before or can anybody give me an idea abt how to write the recursive function to seperate the input tags from the document
plz help am running short of time
thanks
Rama
|
|
|
|

|
Maybe the following program can match your requirement.
Good Luck
// Open HTML file "xxx.htm"
DHtmlGeneralParser parser = new DHtmlGeneralParser();
DHtmlDocument htmlDoc = new DHtmlDocument(parser);
htmlDoc.Load(@"..\xxx.htm");
DHtmlNodeCollection result = new DHtmlNodeCollection();
// Find all tag of this pattern <input type="oooo"> in all html document
// function: void FindByNameAttribute
// (
// DHtmlNodeCollection result, // a collection to collect result
// string name, // tag name which you want to find
// string attributeName, // attribute name which you want to find
// bool searchChildren // whether it searchs child with recursive
// )
htmlDoc.Nodes.FindByNameAttribute(result, "input" "type", true);
-- modified at 11:48 Friday 30th March, 2007
|
|
|
|

|
hi thanks that was good but can you give me little more specific coding for this
let me explain my part of project
i have developed a web application i have done the below coding to retrieve the html content
//string html = URL;
// System.Net.HttpWebRequest webrequest = (HttpWebRequest)System.Net.WebRequest.Create(html);
// System.Net.HttpWebResponse webresponse = (HttpWebResponse)webrequest.GetResponse();
// StreamReader webstream = new StreamReader(webresponse.GetResponseStream(), Encoding.ASCII);
// webrequest.Method = "GET";
// string strml = webstream.ReadToEnd();
now this string strml has got the html content and now i need to extract the input tags from them
can you give me an idea regarding this
thanks
Rama
|
|
|
|

|
// Based on your code, the following code shows you how to extract the input tags from "strml".
string html = URL;
System.Net.HttpWebRequest webrequest = (HttpWebRequest)System.Net.WebRequest.Create(html);
System.Net.HttpWebResponse webresponse = (HttpWebResponse)webrequest.GetResponse();
StreamReader webstream = new StreamReader(webresponse.GetResponseStream(), Encoding.ASCII);
webrequest.Method = "GET";
string strml = webstream.ReadToEnd();
// transform "strml" to a HTML document tree
DHtmlGeneralParser parser = new DHtmlGeneralParser();
DHtmlDocument htmlDoc = new DHtmlDocument(parser);
htmlDoc.LoadHtml(strml);
DHtmlNodeCollection result = new DHtmlNodeCollection();
// extract "input tag" from HTML document tree
htmlDoc.Nodes.FindByName(result, "input", true);
// or extract "input tag" which includes "type" attribute from "strml"
htmlDoc.Nodes.FindByNameAttribute(result, "input" "type", true);
// The "result" collection contains html node(s)taht match your condition taht you want.
// visit result
foreach(DHtmlElement inputTag in result)
{
// do your job here :-D
}
James
-- modified at 11:48 Friday 30th March, 2007
|
|
|
|

|
it really worked for me
but in the line of code
DHtmlGeneralParser parser = DHtmlGeneralParser();
it gave me an error like "DHtmlGeneralParser is a type and hence invalid in the given context"
i removed that line
and now it works
thanks for your help
and ur article was wonderful
thanks
Rama
|
|
|
|

|
Sorry, It's my fault.
I miss keyword new in
DHtmlGeneralParser parser = new DHtmlGeneralParser(); .
Wish the library can help you more.
|
|
|
|

|
anyway i appreciate ur help
thanks
Rama
|
|
|
|

|
Could you try uploading again. The file seems to be corrupted
|
|
|
|

|
Sorry , I fixed the ZIP file now, and I am apologetic about that.
|
|
|
|
 |