MIL HTML Parser

Member 987427

Rate me:

4.79/5 (77 votes)

30 Mar 20041 min read

394.8K

7.8K

154

A non-well-formed HTML parser for .NET

Library and demonstration application - 31.3 Kb

Introduction

This library produces a domain tree of a given HTML document, allowing the developer to navigate and change the document in an methodical way. In addition to the basic HTML production, this library can also be used to produce XHTML documents, as it includes an HTML 4 entity encoder. Included in this release is a demonstration application in VB.NET showing how to use the library. I hope that it is all fairly self-explanatory.

Background

This library was written to avoid having to convert a document into XML prior to reading, whilst preserving the distinct HTML qualities. This gets round some deployment issues I had with different platforms.

Using the code

The simplest way to use the code is to add it into your solution as a C# class library. There are no third-party dependencies so it is just a matter of adding the source files in. Alternatively, you can build the DLL and add it as a reference.

Points of Interest

The XHTML production is fairly basic - there is no built-in DTD checking. So far, I have had no problems in the generation, but I'm keen on getting that sorted.

History

1.4
- Bugfix: XHTML elements are in the proper case
- Bugfix: XHTML <html> will default to xmlns="http://www.w3.org/1999/xhtml"
1.3
- Bugfix: <!DOCTYPE...> and <!...> now treated as comments
- Bugfix: Malformed or incomplete attribute values causing infinite loop fixed
1.2
- Bugfix: <tag/> now handled properly
- Bugfix: Parse errors of scripts
- Bugfix: Parse errors of styles
- HTML 4 entity encoding
- DOM tree navigation
- Basic node searching
- HTML production
- XHTML production (as per http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd)
- Added some component model stuff & comments
- Hid the parser
1.1
- Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Written By

Member 987427

United Kingdom

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><br /> <html xmlns="http://www.w3.org/1999/xhtml"><br /> <body>bla</body><br /> </html>

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.