|
|
i've had the same problem today, and i think i've found a quick fix: when searching for the end of the value, instead of searching for />, search just for >, and afterwise, if it was a /> roll back the index.
regards.
line 718 in htmlparser.cs
//***** original:
//while (i < input.Length && input.Substring(i, 1).IndexOfAny(" \r\n\t/>".ToCharArray()) == -1)
//{
// i++;
//}
//***** new:
//do not search for /, if it's a /> we'll fix it later
while (i < input.Length && input.Substring(i, 1).IndexOfAny(" \r\n\t>".ToCharArray()) == -1)
{
i++;
}
if (input.Substring(i-1,2) == "/>")
i--;
|
|
|
|
|
do you know if there is any vb/vb.net version of html parser?
Regards,
unruledboy@hotmail.com
|
|
|
|
|
Reviewing the code of HtmlEncoder.cs: you can use a hack to avoid the parsing of each literal (&xxx;) existent or future
Replacing the lines 828 to 1603 (a big "if" block!) with the following code*:
<br />
string encodedLiteral = System.Web.HttpUtility.HtmlDecode (token.ToString ());<br />
output.Append (encodedLiteral);<br />
will avoid the "manual" parsing.
I think that is better to replace the whole functions DecodeValue and EncodeValue for the equivalent functions of HttpUtility class HtmlDecode & HtmlEncode.
Check the HttpUtility class, have a lot of nice functions!
Congratulations for the code...
PD:
*(need to add a reference to System.Web to the project for the HtmlUtility class)
!
Giralt
|
|
|
|
|
This is a great article. Your tip makes it perfect.
Thanks to all you guys.
|
|
|
|
|
Well use your own version , cause HtmlUtility class has some bugs for the FullHalf Characters, which happens when you use a different Character-Encoding then UTF-8.
|
|
|
|
|
It does not properly parse double-byte characters like Simplifed-Chinese.
It simply encode these characters to be something like "кϢ"
|
|
|
|
|
Nice piece of work.
I was about to make one myself, but this is great.;)
Is there a license on this code?
Aaron Eldreth
TheCollective4.com
|
|
|
|
|
I recently published a very lightweight HTML parser written in java that also specialises in analysing badly formed HTML and reproducing it verbatim with any disired changes.
I wanted to run it through the JLCA to produce a .NET version but found I need VS.NET 2003 (to convert the java collection classes), which I am only getting in a couple of weeks time. If anyone is interested in this you can monitor the package the sourceforge project page to receive announcements. I'm not sure how efficient the code is that JLCA produces but a simple library like this is a good candidate for finding out.
http://sourceforge.net/projects/jerichohtml/
The approach differs from this one in that it does NOT produce a DOM object in memory. Each method call analyses the source text directly (but using internal caching for efficiency), which allows you to see an exact representation of the document, even with overlapping elements.
|
|
|
|
|
As Jonathan points out in his posting, SGMLReader posted by Microsoft on www.gotdotnet.com creates well formed HTML. It also does it without generating a DOM. Because it inherits from the Reader class, you can use it wherever you would use a reader.
Regards
Bill Seddon
|
|
|
|
|
Have you tried running it through ikvmc?
|
|
|
|
|
I've been waiting for someone to try this for a while. Good work! It's especially difficult to parse information that's not well-formed.
With Microsoft and others being committed to the future of Virtual Execution Systems (i.e. - .NET) there will be a strong need in the future for software written purely on the CLI. Yup, that's right folks. All those wonderful C/C++ libraries will need to be rewritten. In the not to distant future, the .NET runtime will not sit on top of the Windows/Linux/Mac operating system, it will BE the operating system.
.NET programmers shouldn't be afraid to "reinvent the wheel" because the wheel still needs a lot of work.
|
|
|
|
|
You are not forced to follow any thing MS proposes.
If it should really go the way you are prognosing here, there is always the possibility to switch to ReactOS[^]
This is pure WIN32 and will provide compatibility to Windows NT if MS really might drop it. (however I strongly doubt this)
Martin Fuchs
martin-fuchs@gmx.net
|
|
|
|
|
Whoa... I'm not trying to start any debates about legacy windows code versus the .Net initiative, Microsoft’s market share, or Windows versus Linux versus Mac. I love each platform and I’m not biased in any way. I’m simply speaking as an experienced programmer and someone who knows the economics of computers, not just the hobby of them.
Obviously I’m enthusiastic about the platform. In a more appropriate forum I would be happy to explain why most of the industry sees Virtual Execution Systems as the future, but that is not appropriate for this thread.
I’m just giving kudos to the man’s hard work on this project and reassuring all .Net programmers that their hard work will be noticed.
|
|
|
|
|
|
- congrats, unless the below I wouldn't say you are trying to reinvent the wheel. Quite the opposite, I think if this goes on, this could provide a good alternative to IE-interoping which is a good thing. The html parser is already a great service, and a simplified renderer would make your article a lot of attraction (hint! ). Removing a heavy dependency like IE is always good, as a benefit.
- why multiple passes? For the sake of simplicity, wouldn't it be simpler with a single pass?
- I think there is something missing, it's the semantics of non-closing tags like <br> and <p>. Bear in mind that those tags have a totally different semantics internally since <br> has no ending-tag counterpart, while <p> may or may not have an ending-tag counterpart, and in addition sequences like <p>...<p>must be treated like <p>...</p><p>
- since you are using .NET, you could even add jscript interpreting thanks to the support of JScript.NET
|
|
|
|
|
Hello,
I agree.
Jscript interpreting and mapping to DOM and Navigator objects will be nice. SO we could get rid of MSHTML.
In the other hand if someone could come up with something robust to load URLs (and manage cookies). I found some URLs that only IE was able to load when parsing a HTML document.
For the renderer I don't know. This is so much work !
Nice work,
R. LOPES
R. LOPES
Just programmer.
|
|
|
|
|
GriffonRL wrote:
In the other hand if someone could come up with something robust to load URLs (and manage cookies).
urlmon.dll provides all the facilities for that matter, including UrlDownloadToFile or something like that (there is even an article on codeproject about it).
Downloading a url may be harder than predicted because of a combination of session id, cookie, or both. You have to build the appropriate HTTP cookie header and query string indeed.
GriffonRL wrote:
For the renderer I don't know. This is so much work !
While a full fledged renderer is indeed a lot of work (and fun!), it's possible to come up with one that manages pretty much anything except tables and pics in just a week. And this is useful already. In addition, make sure to lookup a little undocumented html renderer known as htmllite.dll (used in the VS.NET setup for instance).
|
|
|
|
|
Bonjour Stephane,
I don't know about URLmon.dll. I use it currently in a C# project but honestly I'm looking now for a pure C# library. I am more and more for 100% managed code. I'm fed up of interops and so on...
For the renderer I did a basic one in Java a long time ago when HTML was the only king. Today I would maybe try again in C#...
But my point is how you never came up with a complete replacement for IE in C#/.NET ? You know a lot about browsers and IE in particular. You should be able to produce a nice library or control.
Don't tell me you don't have the time for that . This is a too usual reason.
By the way, why noboby came up with a wrapper around Mozilla or tried to port some part of Mozilla to .NET ? This project is full of good stuff an teaching material for apprentice browsers makers.
I should probably browse the source again. This could be a good inspiration to right a HTML parser and other things.
Regards,
R. LOPES
Just programmer.
|
|
|
|
|
GriffonRL wrote:
I don't know about URLmon.dll. I use it currently in a C# project but honestly I'm looking now for a pure C# library. I am more and more for 100% managed code. I'm fed up of interops and so on...
T'as pas dû chercher bien longtemps
HttpWebRequest h = (HttpWebRequest)
WebRequest.Create("http://weblogs.asp.net/heatherleigh/contact.aspx");
StreamReader sr = new StreamReader( h.GetResponse().GetResponseStream() );
MessageBox.Show( sr.ReadToEnd() );
Et le tour est joué!
GriffonRL wrote:
You should be able to produce a nice library or control.
En fait, je l'ai développée. Une capture d'écran de LongSleeves ici.
GriffonRL wrote:
By the way, why noboby came up with a wrapper around Mozilla or tried to port some part of Mozilla to .NET ?
Basculement en anglais car ça peut toujours être utile aux autres. Mozilla has a wrapper Ax (mozctlx.dll) that not only mimics the IE renderer, but also exposes the same API, with appropriate events, and all that stuff. It's not that it doesn't exist, it's just that, in sites like codeproject.com, you get less links from non MS-components than otherwise. I wonder why btw when you know how much IE is full of bugs of many different natures and severities.
GriffonRL wrote:
I should probably browse the source again. This could be a good inspiration to right a HTML parser and other things.
Don't watch too much the source code in this article though, I think it's not worth it until a major rewrite. To me, a real strong html parser is one that can read html as well as xml, without changing a single line of code, and that provides at the same time a DOM model (read everything, store everything in memory) as well as an event-driven model (only the latest elements and attributes are known). May be SgmlReader (linked by Jonathan above) should be a given a look. At least SgmlReader is written by MS Chris Lovett, one of the fathers of msxml.
|
|
|
|
|
Stephane,
Stephane Rodriguez. wrote:
T'as pas dû chercher bien longtemps
Si à vrai dire j'ai joué aussi pas mal avec les classes réseaux de .NET, mais par rapport à une librairie comme URLmon, tu as du travail avant d'arriver à la même simplicité d'utilisation et à la même qualité face à des sites et des URLs qui sont parfois un peu batardes.
Mais je suis certain qu'on peut obtenir une très bonne librairie, avec SSL, authentifications, cookies et tout et tout et facile à utiliser. Reste plus qu'à se lancer !
Stephane Rodriguez. wrote:
En fait, je l'ai développée. Une capture d'écran de LongSleeves ici.
Intéressant mais où en es-tu de ce projet ?
Stephane Rodriguez. wrote:
Mozilla has a wrapper Ax (mozctlx.dll)
I know this one. Unfortunately this is the only one. I was delighted to found it but I soon realised that the IE API was not fully implemented and was missing some important functions I need. Maybe they updated it very recently... However, I will be very glad to see something similar with a .NET wrapper, even without the same API as IE.
Stephane Rodriguez. wrote:
Don't watch too much the source code in this article though, I think it's not worth it until a major rewrite. To me, a real strong html parser is one that can read html as well as xml, without changing a single line of code, and that provides at the same time a DOM model (read everything, store everything in memory) as well as an event-driven model (only the latest elements and attributes are known). May be SgmlReader (linked by Jonathan above) should be a given a look. At least SgmlReader is written by MS Chris Lovett, one of the fathers of msxml.
I took a look at it. It looks fantastic but seems to have 2 drawbacks:
First, it is quite slow in term of throughput of converted pages per seconds if you plan to use it for a large amount of HTML pages (benchmark may vary with hardware and application architecture). However I saw that with preliminary tests. Performance is important for one of my applications.
Second, it doesn't handle malformed HTML like IE does. It expects well formed HTML, so you might fail on a lot of badly designed pages. But if your applications target such pages you have no choice but to be able to parse them anyway. However the source code is available for such an enhancement.
Thanks,
R. LOPES
Just programmer.
|
|
|
|
|
GriffonRL wrote:
Mais je suis certain qu'on peut obtenir une très bonne librairie, avec SSL, authentifications, cookies et tout et tout et facile à utiliser
C'est marrant que tu dises ça, parce que ça pourrait laisser à penser qu'il manque quelque chose à HttpWebRequest alors que ce n'est pas le cas. Il y a un cookiecontainer, ce qu'il faut pour gérer le challenge response, la partie SSL est gérée en dessous comme pour n'importe quelle url de type https:// d'ailleurs. Donc je ne vois pas ce qui manque. Ah si, peut-être un parser html...
GriffonRL wrote:
Intéressant mais où en es-tu de ce projet ?
Je l'ai mis en pause, car d'autres produits stratégiques sont devenus prioritaires (taf).
GriffonRL wrote:
I was delighted to found it but I soon realised that the IE API was not fully implemented and was missing some important functions I need.
I think the debate could grow and grow forever since we are actually talking a large set of underlying layers of code including :
- an http client
- an http client that handles challenge response, ssl, cookies, ...
- an html parser, with/without malformed tolerance
- an html dom
- an html renderer
- an interactive UI on top of the html renderer, just like IE5.5 edit mode
- ...
I guess that if you really need all of that in your app, your best bet is to take IE. I am not sure though that most web-connected apps require much more than a simple rond-tripping html stuff. And may be when people try using IE and expect to use it like a stateful component, as in the rich client world, may be it's time to show them that they have probably taken not-so-good choices. When you know the lack of support for the simple "back button" scenario, it's funny to see how many people are still building apps on that...But eh, LH is supposed to sweep all that bricolage away.
GriffonRL wrote:
Second, it doesn't handle malformed HTML like IE does. It expects well formed HTML, so you might fail on a lot of badly designed pages
I didn't know. I still believe the source code can be the basis for code improvement. That's pretty much open source after all;
|
|
|
|
|
QHTM is a great renderer if you don't want to load IE to render a little bit of HTML onto a HDC:
http://www.gipsysoft.com/qhtm/
|
|
|
|
|
You can use mshtml.dll's activex component to do that. it has it's problems, but it's better than reinvening the wheel...
is this a sig?
|
|
|
|
|