Click here to Skip to main content
Click here to Skip to main content

A non-well-formed HTML Parser and CSS Resolver

By , 20 Jul 2007
 

Download DOLS_HTML.zip - 364.6 KB (10:52, 07/21/2007, GMT +8)

demo:
Screenshot - demo.jpg

The program is very simple to demonstrate the function of library,
it is similar to demo program of MIL HTML Parser (http://www.codeproject.com/dotnet/apmilhtml.asp).

Introduction

This library produces a tree which like DOM tree of a given non-well-formed HTML document,
allowing the developer to read, compose, and modify the tree in a methodical way.
The library is based on MIL HTML Parser, and I try to improve the codepage
encoding problem, tolerance of tag missing, CSS Resolver and efficiency.

Background

This library was written to avoid having to convert a non-well-formed HTML
into XML prior to reading, whilst preserving the distinct HTML qualities.

Using the code

// Open HTML file "Google News.htm"
DOL.DHtml.DHtmlParser.DHtmlGeneralParser parser =
  new DOL.DHtml.DHtmlParser.DHtmlGeneralParser();
DOL.DHtml.DHtmlParser.DHtmlDocument htmlDoc =
  new DOL.DHtml.DHtmlParser.DHtmlDocument(parser);
htmlDoc.Load(@"..\Google News.htm");


//You can modify the HTML tree with htmlDoc.Nodes

htmlDoc.Save(@"..\Rebuild.htm");


// Dump the information about HTML tree in IDE debug output window
StringBuilder builder = new StringBuilder();
 htmlDoc.Dump(builder, ""); 
System.Diagnostics.Debug.Write("\n" + builder.ToString());

Debug Output information
├Object DHtmlDocument Dump :
│ DHtmlNode number: 6
│ Deep dump in the following:
│ │
│ ├Object DHtmlComment Dump :
│ │ Node ID: 1
│ │ Comment content:
================================================
DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
================================================
│ │
│ ├Object DHtmlText Dump :
│ │ Node ID: 2
│ │ Text content is white space
│ │
│ ├Object DHtmlComment Dump :
│ │ Node ID: 3
│ │ Comment content:
================================================
saved from url=(0033)http://www.google.com/news?ned=us
================================================
│ │
│ ├Object DHtmlText Dump :
│ │ Node ID: 4
│ │ Text content is white space
│ │
│ ├Object DHtmlElement Dump :
│ │ Node ID: 5
│ │ HTML Tag: <html>
│ │ DHtmlNode number: 3
│ │ Child Object deep dump in the following:
│ │ │
│ │ ├Object DHtmlElement Dump :
│ │ │ Node ID: 6
│ │ │ HTML Tag: <head>
│ │ │ DHtmlNode number: 30
│ │ │ Child Object deep dump in the following:
│ │ │ │
│ │ │ ├Object DHtmlElement Dump :
│ │ │ │ Node ID: 7
│ │ │ │ HTML Tag: <title>
│ │ │ │ DHtmlNode number: 1
│ │ │ │ Child Object deep dump in the following:
│ │ │ │ │
│ │ │ │ ├Object DHtmlText Dump :
│ │ │ │ │ Node ID: 8
│ │ │ │ │ Text content: "Google News"

Structural diagram

HTML Parser

CSS Resolver

History

  • 2007/07/21 Modify to create a new StringBuilder instance in each method that needs one in DHtmlTextProcessor
  • 2007/05/13 Added structural diagram
  • 2007/05/01 Improved tolerance of of attribute structure error
  • 2007/04/29 Fixed one bug about tag missing
  • 2007/03/28 Updated demo program (Added CSS Resolver demo)
  • 2007/03/27 Fixed one bug in initiation of DHtmlElement
  • 2007/03/26
    1. New demo program
    2. Supported "Visitor Patten" in node hierarchy
  • 2007/03/22 Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

James S.F. Hsieh
Web Developer
United States United States
Member
James S.F. Hsieh(Nomad Libra) Working as engineer for "Corel Intervideo" company situated in Taiwan.
He received his master degree in Graduate Institute of Network Learning Technology, National Central University, Taiwan in 2006.
His research interests are semantic Web services, intelligent software agent, machine learning, algorithm, software
engineering and multimedia programming.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralBug when removing elements Pinmemberevald8018 Jan '10 - 0:20 
QuestionLicence Type or Public Domain? PinmemberThomas Maierhofer20 Jun '09 - 22:07 
Questioncould you rebuild it like IEDevToolBar? Pinmembersamsong2 Mar '09 - 7:53 
QuestionIs there a control? PinmemberAmrykid26 Jul '08 - 10:24 
AnswerRe: Is there a control? PinmemberThomas Maierhofer20 Jun '09 - 23:18 
Generalgreat work! Pinmemberrafi_mail28 Jun '08 - 13:16 
GeneralNeed innerHTML Property Pinmemberfperugini17 Jan '08 - 9:01 
GeneralRe: Need innerHTML Property PinmemberThomas Maierhofer22 Jun '09 - 0:09 
QuestionHTML marked Pinmembermaingaosuong25 Sep '07 - 16:16 
QuestionWhat About files with .css extension Pinmemberasalpekar009879 Aug '07 - 1:42 
Questionhow can i write unittests? Pinmemberlak-b29 Jul '07 - 0:15 
AnswerRe: how can i write unittests? [modified] PinmemberJames S.F. Hsieh1 Aug '07 - 14:34 
GeneralBug when using in multiple threads PinmemberJon Okie19 Jul '07 - 6:16 
GeneralRe: Bug when using in multiple threads PinmemberJames S.F. Hsieh20 Jul '07 - 16:50 
GeneralHandling colon in attribute names PinmemberJon Okie18 Jul '07 - 9:50 
GeneralRe: Handling colon in attribute names PinmemberJames S.F. Hsieh18 Jul '07 - 14:35 
QuestionHello,Can I use the css parser in commercial applications? Pinmemberweiniannianwei9 Jul '07 - 0:06 
AnswerRe: Hello,Can I use the css parser in commercial applications? PinmemberJames S.F. Hsieh15 Jul '07 - 7:07 
GeneralRe: Hello,Can I use the css parser in commercial applications? Pinmemberweiniannianwei15 Jul '07 - 15:39 
GeneralRe: Hello,Can I use the css parser in commercial applications? PinmemberJames S.F. Hsieh15 Jul '07 - 17:30 
GeneralRe: Hello,Can I use the css parser in commercial applications? [modified] Pinmemberweiniannianwei15 Jul '07 - 20:06 
GeneralRe: Hello,Can I use the css parser in commercial applications? PinmemberJames S.F. Hsieh16 Jul '07 - 8:01 
GeneralRe: Hello,Can I use the css parser in commercial applications? Pinmemberweiniannianwei16 Jul '07 - 14:55 
Generalhi ~ the interest to your library Pinmembergavintom12 May '07 - 4:45 
GeneralRe: hi ~ the interest to your library PinmemberJames S.F. Hsieh13 May '07 - 6:21 
GeneralRe: hi ~ the interest to your library Pinmembergavintom14 May '07 - 14:10 
QuestionExcelent work, but there may be a bug? PinmemberDiablo_m28 Apr '07 - 5:17 
AnswerRe: Excelent work, but there may be a bug? PinmemberJames S.F. Hsieh28 Apr '07 - 20:35 
GeneralRe: Excelent work, but there may be a bug? PinmemberDiablo_m28 Apr '07 - 22:46 
GeneralRe: Excelent work, but there may be a bug? PinmemberJames S.F. Hsieh29 Apr '07 - 4:08 
GeneralRe: Excelent work, but there may be a bug? PinmemberJames S.F. Hsieh1 May '07 - 1:51 
GeneralRe: Excelent work, but there may be a bug? PinmemberDiablo_m2 May '07 - 9:08 
Generalreg:Extracting tags Pinmemberrama jayapal29 Mar '07 - 20:21 
GeneralRe: reg:Extracting tags [modified] PinmemberJames S.F. Hsieh29 Mar '07 - 22:07 
GeneralRe: reg:Extracting tags Pinmemberrama jayapal30 Mar '07 - 0:01 
GeneralRe: reg:Extracting tags [modified] PinmemberJames S.F. Hsieh30 Mar '07 - 0:27 
GeneralRe: reg:Extracting tags Pinmemberrama jayapal30 Mar '07 - 1:52 
GeneralRe: reg:Extracting tags PinmemberJames S.F. Hsieh30 Mar '07 - 2:13 
GeneralRe: reg:Extracting tags Pinmemberrama jayapal30 Mar '07 - 3:03 
GeneralZip file is corrupted Pinmemberchrisv22 Mar '07 - 4:36 
GeneralRe: Zip file is corrupted PinmemberJames S.F. Hsieh22 Mar '07 - 5:00 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130523.1 | Last Updated 20 Jul 2007
Article Copyright 2007 by James S.F. Hsieh
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid