Click here to Skip to main content
15,867,308 members
Articles / Web Development / XHTML
Article

A non-well-formed HTML Parser and CSS Resolver

Rate me:
Please Sign up or sign in to vote.
2.86/5 (14 votes)
20 Jul 20072 min read 108.2K   989   57   41
A non-well-formed HTML parser and CSS Resolver builded by pure .NET C#

Download DOLS_HTML.zip - 364.6 KB (10:52, 07/21/2007, GMT +8)

demo:
Screenshot - demo.jpg

The program is very simple to demonstrate the function of library,
it is similar to demo program of MIL HTML Parser (http://www.codeproject.com/dotnet/apmilhtml.asp).

Introduction

This library produces a tree which like DOM tree of a given non-well-formed HTML document,
allowing the developer to read, compose, and modify the tree in a methodical way.
The library is based on MIL HTML Parser, and I try to improve the codepage
encoding problem, tolerance of tag missing, CSS Resolver and efficiency.

Background

This library was written to avoid having to convert a non-well-formed HTML
into XML prior to reading, whilst preserving the distinct HTML qualities.

Using the code

// Open HTML file "Google News.htm"
DOL.DHtml.DHtmlParser.DHtmlGeneralParser parser =
  new DOL.DHtml.DHtmlParser.DHtmlGeneralParser();
DOL.DHtml.DHtmlParser.DHtmlDocument htmlDoc =
  new DOL.DHtml.DHtmlParser.DHtmlDocument(parser);
htmlDoc.Load(@"..\Google News.htm");


//You can modify the HTML tree with htmlDoc.Nodes

htmlDoc.Save(@"..\Rebuild.htm");


// Dump the information about HTML tree in IDE debug output window
StringBuilder builder = new StringBuilder();
 htmlDoc.Dump(builder, ""); 
System.Diagnostics.Debug.Write("\n" + builder.ToString());

Debug Output information
├Object DHtmlDocument Dump :
│ DHtmlNode number: 6
│ Deep dump in the following:
│ │
│ ├Object DHtmlComment Dump :
│ │ Node ID: 1
│ │ Comment content:
================================================
DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
================================================
│ │
│ ├Object DHtmlText Dump :
│ │ Node ID: 2
│ │ Text content is white space
│ │
│ ├Object DHtmlComment Dump :
│ │ Node ID: 3
│ │ Comment content:
================================================
saved from url=(0033)http://www.google.com/news?ned=us
================================================
│ │
│ ├Object DHtmlText Dump :
│ │ Node ID: 4
│ │ Text content is white space
│ │
│ ├Object DHtmlElement Dump :
│ │ Node ID: 5
│ │ HTML Tag: <html>
│ │ DHtmlNode number: 3
│ │ Child Object deep dump in the following:
│ │ │
│ │ ├Object DHtmlElement Dump :
│ │ │ Node ID: 6
│ │ │ HTML Tag: <head>
│ │ │ DHtmlNode number: 30
│ │ │ Child Object deep dump in the following:
│ │ │ │
│ │ │ ├Object DHtmlElement Dump :
│ │ │ │ Node ID: 7
│ │ │ │ HTML Tag: <title>
│ │ │ │ DHtmlNode number: 1
│ │ │ │ Child Object deep dump in the following:
│ │ │ │ │
│ │ │ │ ├Object DHtmlText Dump :
│ │ │ │ │ Node ID: 8
│ │ │ │ │ Text content: "Google News"

Structural diagram

HTML Parser
Image 2

CSS Resolver
Image 3

History

  • 2007/07/21 Modify to create a new StringBuilder instance in each method that needs one in DHtmlTextProcessor
  • 2007/05/13 Added structural diagram
  • 2007/05/01 Improved tolerance of of attribute structure error
  • 2007/04/29 Fixed one bug about tag missing
  • 2007/03/28 Updated demo program (Added CSS Resolver demo)
  • 2007/03/27 Fixed one bug in initiation of DHtmlElement<chsdate w:st="on" year="2007" month="3" day="22" islunardate="False" isrocdate="False">
  • 2007/03/26
    1. New demo program
    2. Supported "Visitor Patten" in node hierarchy
  • 2007/03/22 Initial release

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Web Developer
United States United States
James S.F. Hsieh(Nomad Libra) Working as engineer for "Corel Intervideo" company situated in Taiwan.
He received his master degree in Graduate Institute of Network Learning Technology, National Central University, Taiwan in 2006.
His research interests are semantic Web services, intelligent software agent, machine learning, algorithm, software
engineering and multimedia programming.

Comments and Discussions

 
GeneralBug when removing elements Pin
evald8018-Jan-10 0:20
evald8018-Jan-10 0:20 
Hello,
i'm trying to remove text and comments from a html file and so i have created this code but it goes in error.
any idea how to fix it?
thank you

public void RemoveWhAndComments(DHtmlNode node)
{
DHtmlText text = node as DOL.DHtml.DHtmlParser.Node.DHtmlText;
if(text != null)
{
if(text.IsWhiteSpace)
{
node.Parent.Nodes.RemoveAt(text.NodeID);
}
return;
}

DHtmlElement element = node as DOL.DHtml.DHtmlParser.Node.DHtmlElement;
if(element != null)
{

for (int i = 0; i < element.Nodes.Count; i++)
{
RemoveWhAndComments(element.Nodes[i]);
}
return;
}
}
QuestionLicence Type or Public Domain? Pin
Thomas Maierhofer (Tom)20-Jun-09 22:07
Thomas Maierhofer (Tom)20-Jun-09 22:07 
Questioncould you rebuild it like IEDevToolBar? Pin
samsong2-Mar-09 7:53
samsong2-Mar-09 7:53 
QuestionIs there a control? Pin
User 451897326-Jul-08 10:24
User 451897326-Jul-08 10:24 
AnswerRe: Is there a control? Pin
Thomas Maierhofer (Tom)20-Jun-09 23:18
Thomas Maierhofer (Tom)20-Jun-09 23:18 
Generalgreat work! Pin
rafi_mail28-Jun-08 13:16
rafi_mail28-Jun-08 13:16 
GeneralNeed innerHTML Property Pin
fperugini17-Jan-08 9:01
fperugini17-Jan-08 9:01 
GeneralRe: Need innerHTML Property Pin
Thomas Maierhofer (Tom)22-Jun-09 0:09
Thomas Maierhofer (Tom)22-Jun-09 0:09 
QuestionHTML marked Pin
maingaosuong25-Sep-07 16:16
maingaosuong25-Sep-07 16:16 
QuestionWhat About files with .css extension Pin
asalpekar009879-Aug-07 1:42
asalpekar009879-Aug-07 1:42 
Questionhow can i write unittests? Pin
lak-b29-Jul-07 0:15
lak-b29-Jul-07 0:15 
AnswerRe: how can i write unittests? [modified] Pin
James S.F. Hsieh1-Aug-07 14:34
James S.F. Hsieh1-Aug-07 14:34 
GeneralBug when using in multiple threads Pin
Jon Okie19-Jul-07 6:16
Jon Okie19-Jul-07 6:16 
GeneralRe: Bug when using in multiple threads Pin
James S.F. Hsieh20-Jul-07 16:50
James S.F. Hsieh20-Jul-07 16:50 
GeneralHandling colon in attribute names Pin
Jon Okie18-Jul-07 9:50
Jon Okie18-Jul-07 9:50 
GeneralRe: Handling colon in attribute names Pin
James S.F. Hsieh18-Jul-07 14:35
James S.F. Hsieh18-Jul-07 14:35 
QuestionHello,Can I use the css parser in commercial applications? Pin
weiniannianwei9-Jul-07 0:06
weiniannianwei9-Jul-07 0:06 
AnswerRe: Hello,Can I use the css parser in commercial applications? Pin
James S.F. Hsieh15-Jul-07 7:07
James S.F. Hsieh15-Jul-07 7:07 
GeneralRe: Hello,Can I use the css parser in commercial applications? Pin
weiniannianwei15-Jul-07 15:39
weiniannianwei15-Jul-07 15:39 
GeneralRe: Hello,Can I use the css parser in commercial applications? Pin
James S.F. Hsieh15-Jul-07 17:30
James S.F. Hsieh15-Jul-07 17:30 
GeneralRe: Hello,Can I use the css parser in commercial applications? [modified] Pin
weiniannianwei15-Jul-07 20:06
weiniannianwei15-Jul-07 20:06 
GeneralRe: Hello,Can I use the css parser in commercial applications? Pin
James S.F. Hsieh16-Jul-07 8:01
James S.F. Hsieh16-Jul-07 8:01 
GeneralRe: Hello,Can I use the css parser in commercial applications? Pin
weiniannianwei16-Jul-07 14:55
weiniannianwei16-Jul-07 14:55 
Generalhi ~ the interest to your library Pin
gavintom12-May-07 4:45
gavintom12-May-07 4:45 
GeneralRe: hi ~ the interest to your library Pin
James S.F. Hsieh13-May-07 6:21
James S.F. Hsieh13-May-07 6:21 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.