Click here to Skip to main content
15,609,640 members
Articles / Programming Languages / C++
Posted 16 Nov 2004


118 bookmarked


Rate me:
Please Sign up or sign in to vote.
4.84/5 (63 votes)
12 Jan 20057 min read
An article on parsing and displaying HTML.


For one of my projects, I needed to create PDF documents with rich text in a component to be used from ASP code. So, to give the ASP programmers an easy interface, I did a simple HTML parser together with my PDF creator. Only the PDF creator survived, so by sharing the HTML parser here prevents it from falling into the dark hole of the forgotten cyberspace.

When I had done the basic HTML parsing, I couldn't stop, and began parsing of tables and images, and ended up loading pages from the Internet and comparing them with how they looked in Internet Explorer. There are still a lot of things to be done with this parser, but it still has more HTML support than the two other HTML parsers I found on Code Project: A Simple HTML drawing class and XHTMLStatic - An extra-lean custom control to display HTML.


The basic idea is to break down the HTML in its smallest parts, which are words, images and compounds. Each compound, i.e., TD, PRE and P or DIV with alignment is treated as a totally new HTML document and nested tables are recursed down at any level.

When the HTML is broken down in its smallest parts, no further parsing is needed, and the size of texts is not recalculated, so only the positions of the parts has to be recalculated if the display size is changed.

TABLE is tricky for many reasons:

  • The content of a TD should not be broken if there is enough space.
  • The content of a TD should be broken if there is not enough space.
  • The TD which has the biggest smallest width defines the smallest width for the whole column.
  • Each TD can have a specified width which will set the width of the whole column if no other TD in the column has a greater smallest width.
  • TD can span over more than one column. All the rules above must still be applied. The rowspan can be greater than the actual number of columns.
  • A width can be specified for the whole table, and if it is greater than the smallest width of all columns, the columns must be expanded.
  • If the specified width of the table is smaller than the smallest width of all columns, the table width is expanded. (I have found out that IE shrinks the columns instead...)
  • Heights must be treated in a similar way as widths.
  • TD can be span over more than one row.
  • Cellpadding, border thickness etc...

There are also a lot of undocumented features in Internet Explorer:

  • <P> does have an end tag but is never nested, one level only.
  • If the sum of the specified width of the <TD>s are more than the specified with of the <TABLE>, the <TD>s are shrunken if possible. This is supported by CHTMLViewer but the same applies for heights, which is not supported.
  • <DIV>, <TD>, <TR>, <P> do not require end tags.
  • A <HR> breaks a <P>.
  • Plus a lot more...

The following table shows what tags and properties are supported, which means that all other things are not supported.

clear=all: Clears all aligned tables and images.
<I> </I>Italic, no properties
<B> </B> <STRONG> </STRONG>Bold, no properties
<U> </U>Underline, no properties
<A> </A>Anchor
href = URL
<SMALL> </SMALL>Small text, no properties
<BIG> </BIG>Big text, no properties
<CENTER> </CENTER>Centered content, no properties
<P> </P>Paragraph
align = Alignment, left, right, justify or center
<FONT> </FONT>Font
face = Font name
Size = Font size
Color = Font color
<IMG> </IMG>Image
src = Source URL
width = Image width
Height = Image height
align = floating alignment, can be left or right
<H1> </H1> <H2> </H2> <H3> </H3>Headings, no properties
<DIV> </DIV>Divider
bgcolor = fill color
width = divider width
Border = divider border
align = divider align, can be center, justify or right, left is default
<HR>Line break, no properties
bgcolor = fill color
width = table width, can be percentage
height = table height
Border = table border
cellpadding = cell padding
cellspacing = cell spacing
background = background image
align = floating alignment, can be left or right. center is treated as a <center> tag
<TR>Table row
bgcolor = background color
align = row alignment; center, left, right or justify
<TD> </TD>Table data
bgcolor = fill color
width = TD width
height = TD height
colspan = column span
rowspan = row span
align = TD align, can be center, justify or right, left is default
valign = TD vertical align, can be center or bottom, top is default
background = URL to background image
nowrap nobreak = no wrap
<PRE> </PRE>Unparsed, no properties
<INPUT>Input control
type = control type, can be password, button, submit, radio, checkbox, hidden, image or text which is default
size = size of text box
maxlength = maximum number of characters in text box
value = value of control
name = name of control
src = source of image control
width = width of image control
height = height of image control
<SELECT>Dropdown box control
name = name of control
<OPTION>Dropdown box item
value = value of item if not title
<SUB> </SUB>Sub-text, no properties
<SUP> </SUP>Sup-text, no properties
<STRIKE> </STRIKE>Strike-text, no properties
<TEXTAREA> </TEXTAREA>Multiline text box
name = control name
rows = visible rows
cols = columns
<FORM> </FORM>Formula
method = method, get or post
action = URL to post data to
<UL> </UL>Bullet list, no properties
<OL> </OL>Number list, no properties
<IL>List item to number or bullet list, no properties
<BODY>Document body
bgcolor = background color
background = URL to background image
<TITLE>Document title to be displayed in window header

Using the code

The class to use is CHTMLViewer, and this class is making callbacks to CHTMLProgress which the user of CHTMLViewer has to implement. CHTMLProgress is receiving notifications of parsing status, title, cursor and links. It is also used for fetching images referenced by URLs.

Relative URLs are handled by providing the class with the URL of the current page.

// Step 1: Create the class and provide it with a pointer 
//         to your CHTMLProgress implementation:
CHTMLViewer pHTMLViewer = new CHTMLViewer(g_pProgress);

// Step 2: Provide the class with the HTML, the default font and
//         font-size, the default font color and current URL.

// Step 3: When the SetHTML has been called, or whenever the size
//         of the display is changed, the positions of the items in
//         the document has to be recalculated.
//         Pass a RECT to the function with the dimensions of the
//         display. The RECT will receive the dimensions of the whole
//         document:

// Step 4: Draw the HTML on a device content. Pass the display size 
//         and scroll positions:

These are the functions in CHTMLProgress that you have to implement:

// Display processing. If using blocking image load, the page is
// done then nPos==nSize
virtual void Reading(BOOL bImage, int nPos, int nSize) = 0;
// This function is called from CHTMLViewer::OnMouseClick when the 
// mouse position matches a link. Remember to offset the mouse position
// according to the scroll position!
virtual void LinkClicked(char *szLink) = 0;
// This function is called from CHTMLViewer::OnMouseOver to inform if
// the mouse is over a link
virtual void SetCursor(BOOL bLink,char *szLink) = 0;
// Set the title of the main window to the value of szTitle
virtual void SetTitle(char *szTitle) = 0;
// You have to implement how data is passed to CHTMLViewer.
// This function is only called when images needs to be downloaded
virtual char *ReadHTTPData(char *szUrl,int *pnSize) = 0;
// An Image is finnshed downloaded.
// Call CHTMLViewer::CalcPositins to recalculate
// positions of HTML objects and redraw entire screen.
// When using non-blocking image
// load, the page is done when nDone==nCount
virtual void ImageLoaded(int nDone, int nCount) = 0;
// Control functions, return -1 if no control support.
// Enables you to have a list of forms.
// LinkClicked with a link named "Sumit(#Number)"
// is called when a link submitts a form,
// the #Number will be the value returned from
// this function
virtual int CreateForm(char *szAction, char *szMethod) = 0;
// Create controls. The returned value will be passed
// to the MoveControl function when
// the controls are positioned
virtual int CreateTextBox(int nSize, int nMaxChar,SIZE scSize,char *szText, 
    BOOL bMultiline,char *szName,int nForm,BOOL bPassword) = 0;
virtual int CreateButton(SIZE scSize,char *szText,char *szName,int nForm) = 0;
virtual int CreateRadio(SIZE scSize,char *szValue,char *szName,int nForm) = 0;
virtual int CreateCheckbox(SIZE scSize,char *szValue,char *szName,int nForm) = 0;
virtual int CreateListbox(SIZE scSize,CHTMLStringTable &stOptions,int nSelected,
    char *szName,int nForm) = 0;
// Positioning of controls
virtual void MoveControl(int nControlID, POINT pPos) = 0;

// Updates an area of the screen. This function
// is called when an animated GIF changes
// frame or when the mouse pointer is over a link.
virtual void UpdateScreen(RECT &r) = 0;

The Demo

The demo project includes an implementation of CHTMLProgress with scrolling. To open a file from disk, you must specify the URL with protocol, example: file://c:\folder1\file1.htm. To copy selected text, right click and select Copy in the floating menu.

The demo exe is compiled with CxImage to allow PNG images, transparency and animations. The code has a #define to switch CxImage on and off. The animations do not work for all images; either there are bugs in CxImage or I don't use it the proper way.

The demo shows also what I think is a cool use of my CHTMLViewer: fully formatted tooltips. Or maybe they are just annoying.


There are still much HTML functions that are not implemented in my classes before it can be an independent open source browser, which I think would be awesome to have. Feel free to implement them, and optimize and keep me posted if you do so.

Here is a list of things that I don't know how to handle, help anyone?:

  1. Styles... I have found code that translates HTML 3.2 to HTML with CSS, but not the other way. Does there exist any open source that does this?
  2. The calculation of positions can be slow if the HTML consists of large levels of nested tables, so some optimization could improve performance.
  3. MAP/Area for images.
  4. JavaScript... hehe...
  5. CxImage doesn't handle animations correctly.


  • 2004.11.16
    • Published.
  • 2004.11.18
    • Added support for text selection and copying.
    • Added support for <PRE> tags.
  • 2004.11.30
    • Integrated CxImage.
    • Added support for a lot more of tags and properties.
    • Added support for forms and posting data.
  • 2004.12.03
    • Added support for animations.
    • Added support for non-blocking download of images.
  • 2004.12.17
    • Demo compiled with newest version of CxImage which makes most animations work (still not all).
    • Added support for blocking download of images (if wanted).
    • Small background images where very time-consuming to draw on screen, therefore the memory image is enlarged for small images.
  • 2005.01.10
    • Added alignment justify.
    • Added support for aligned tables and images.


This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

Written By
Software Developer (Senior)
Sweden Sweden
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

GeneralMy vote of 5 Pin
Steve Ween2-Apr-12 17:18
Steve Ween2-Apr-12 17:18 
Excellent Job.
Questionhow to set the proxy details in CHtmlView Pin
Member 411301513-Aug-08 2:01
Member 411301513-Aug-08 2:01 
AnswerRe: how to set the proxy details in CHtmlView Pin
Karl Runmo14-Aug-08 4:43
Karl Runmo14-Aug-08 4:43 
GeneralRe: how to set the proxy details in CHtmlView Pin
Member 411301518-Aug-08 0:13
Member 411301518-Aug-08 0:13 
QuestionC# version ? Pin
Bizounours2-Jul-08 22:02
Bizounours2-Jul-08 22:02 
AnswerRe: C# version ? Pin
Karl Runmo3-Jul-08 2:54
Karl Runmo3-Jul-08 2:54 
GeneralRe: C# version ? Pin
Bizounours3-Jul-08 3:53
Bizounours3-Jul-08 3:53 
QuestionGreat work, but UNICODE version ? Pin
mas_per21-Jul-07 4:39
mas_per21-Jul-07 4:39 
QuestionStyle sheets? Pin
Steve Johnson (Sven)26-Jan-07 13:35
Steve Johnson (Sven)26-Jan-07 13:35 
AnswerRe: Style sheets? Pin
c-smile2-Feb-08 19:14
c-smile2-Feb-08 19:14 
GeneralCant compile Pin
xSoptik24-Jul-05 21:18
xSoptik24-Jul-05 21:18 
GeneralRe: Cant compile Pin
xSoptik24-Jul-05 21:39
xSoptik24-Jul-05 21:39 
GeneralGreat Work Pin
Paul-T1-May-05 16:09
Paul-T1-May-05 16:09 
GeneralRe: Great Work Pin
Sungjunim12-May-05 16:39
Sungjunim12-May-05 16:39 
GeneralForm elements disappearing Pin
Anatoly Ivasyuk12-Jan-05 15:56
Anatoly Ivasyuk12-Jan-05 15:56 
GeneralRe: Form elements disappearing Pin
Karl Runmo12-Jan-05 21:27
Karl Runmo12-Jan-05 21:27 
AnswerRe: Form elements disappearing Pin
napalm2k16-Aug-07 10:10
napalm2k16-Aug-07 10:10 
GeneralCant compile your demo. Pin
sfirouza2-Jan-05 18:20
sfirouza2-Jan-05 18:20 
GeneralRe: Cant compile your demo. Pin
Karl Runmo2-Jan-05 20:47
Karl Runmo2-Jan-05 20:47 
GeneralRe: Cant compile your demo. Pin
sfirouza3-Jan-05 4:20
sfirouza3-Jan-05 4:20 
GeneralThankx Pin
sfirouza3-Jan-05 4:21
sfirouza3-Jan-05 4:21 
GeneralCollaboration Pin
gualo16-Dec-04 22:12
gualo16-Dec-04 22:12 
GeneralRe: Collaboration Pin
xryl66916-Dec-04 23:50
xryl66916-Dec-04 23:50 
GeneralRe: Collaboration Pin
Karl Runmo17-Dec-04 0:13
Karl Runmo17-Dec-04 0:13 
GeneralRe: Collaboration Pin
gualo17-Dec-04 0:51
gualo17-Dec-04 0:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.