CHTMLViewer

Karl Runmo

4.84/5 (59 votes)

Nov 16, 2004

7 min read

243165

4436

An article on parsing and displaying HTML.

Download demo and source - 295 Kb

Introduction

For one of my projects, I needed to create PDF documents with rich text in a component to be used from ASP code. So, to give the ASP programmers an easy interface, I did a simple HTML parser together with my PDF creator. Only the PDF creator survived, so by sharing the HTML parser here prevents it from falling into the dark hole of the forgotten cyberspace.

When I had done the basic HTML parsing, I couldn't stop, and began parsing of tables and images, and ended up loading pages from the Internet and comparing them with how they looked in Internet Explorer. There are still a lot of things to be done with this parser, but it still has more HTML support than the two other HTML parsers I found on Code Project: A Simple HTML drawing class and XHTMLStatic - An extra-lean custom control to display HTML.

Background

The basic idea is to break down the HTML in its smallest parts, which are words, images and compounds. Each compound, i.e., TD, PRE and P or DIV with alignment is treated as a totally new HTML document and nested tables are recursed down at any level.

When the HTML is broken down in its smallest parts, no further parsing is needed, and the size of texts is not recalculated, so only the positions of the parts has to be recalculated if the display size is changed.

TABLE is tricky for many reasons:

The content of a TD should not be broken if there is enough space.
The content of a TD should be broken if there is not enough space.
The TD which has the biggest smallest width defines the smallest width for the whole column.
Each TD can have a specified width which will set the width of the whole column if no other TD in the column has a greater smallest width.
TD can span over more than one column. All the rules above must still be applied. The rowspan can be greater than the actual number of columns.
A width can be specified for the whole table, and if it is greater than the smallest width of all columns, the columns must be expanded.
If the specified width of the table is smaller than the smallest width of all columns, the table width is expanded. (I have found out that IE shrinks the columns instead...)
Heights must be treated in a similar way as widths.
TD can be span over more than one row.
Cellpadding, border thickness etc...

There are also a lot of undocumented features in Internet Explorer:

<P> does have an end tag but is never nested, one level only.
If the sum of the specified width of the <TD>s are more than the specified with of the <TABLE>, the <TD>s are shrunken if possible. This is supported by CHTMLViewer but the same applies for heights, which is not supported.
<DIV>, <TD>, <TR>, <P> do not require end tags.
A <HR> breaks a <P>.
Plus a lot more...

The following table shows what tags and properties are supported, which means that all other things are not supported.

Tag	Properties
`<BR>`	Break `clear=all`: Clears all aligned tables and images.
`<I> </I>`	Italic, no properties
`<B> </B> <STRONG> </STRONG>`	Bold, no properties
`<U> </U>`	Underline, no properties
`<A> </A>`	Anchor `href` = URL
`<SMALL> </SMALL>`	Small text, no properties
`<BIG> </BIG>`	Big text, no properties
`<CENTER> </CENTER>`	Centered content, no properties
`<P> </P>`	Paragraph `align` = Alignment, `left`, `right`, `justify` or `center`
`<FONT> </FONT>`	Font `face` = Font name `Size` = Font size `Color` = Font color
`<IMG> </IMG>`	Image `src` = Source URL `width` = Image width `Height` = Image height `align` = floating alignment, can be `left` or `right`
`<H1> </H1> <H2> </H2> <H3> </H3>`	Headings, no properties
`<DIV> </DIV>`	Divider `bgcolor` = fill color `width` = divider width `Border` = divider border `align` = divider align, can be `center`, `justify` or `right`, `left` is default
`<HR>`	Line break, no properties
`<TABLE> </TABLE>`	Table `bgcolor` = fill color `width` = table width, can be percentage `height` = table height `Border` = table border `cellpadding` = cell padding `cellspacing` = cell spacing `background` = background image `align` = floating alignment, can be `left` or `right`. `center` is treated as a `<center>` tag
`<TR>`	Table row `bgcolor` = background color `align` = row alignment; `center`, `left`, `right` or `justify`
`<TD> </TD>`	Table data `bgcolor` = fill color `width` = `TD` width `height` = `TD` height `colspan` = column span `rowspan` = row span `align` = `TD` align, can be `center`, `justify` or `right`, `left` is default `valign` = `TD` vertical align, can be `center` or `bottom`, `top` is default `background` = URL to background image `nowrap nobreak` = no wrap
`<PRE> </PRE>`	Unparsed, no properties
`<INPUT>`	Input control `type` = control type, can be `password`, `button`, `submit`, `radio`, `checkbox`, `hidden`, `image` or `text` which is default `size` = size of text box `maxlength` = maximum number of characters in text box `value` = value of control `name` = name of control `src` = source of image control `width` = width of image control `height` = height of image control
`<SELECT>`	Dropdown box control `name` = name of control
`<OPTION>`	Dropdown box item `value` = value of item if not title
`<SUB> </SUB>`	Sub-text, no properties
`<SUP> </SUP>`	Sup-text, no properties
`<STRIKE> </STRIKE>`	Strike-text, no properties
`<TEXTAREA> </TEXTAREA>`	Multiline text box `name` = control name `rows` = visible rows `cols` = columns
`<FORM> </FORM>`	Formula `method` = method, `get` or `post` `action` = URL to post data to
`<UL> </UL>`	Bullet list, no properties
`<OL> </OL>`	Number list, no properties
`<IL>`	List item to number or bullet list, no properties
`<BODY>`	Document body `bgcolor` = background color `background` = URL to background image
`<TITLE>`	Document title to be displayed in window header

Using the code

The class to use is CHTMLViewer, and this class is making callbacks to CHTMLProgress which the user of CHTMLViewer has to implement. CHTMLProgress is receiving notifications of parsing status, title, cursor and links. It is also used for fetching images referenced by URLs.

Relative URLs are handled by providing the class with the URL of the current page.

//
// Step 1: Create the class and provide it with a pointer 
//         to your CHTMLProgress implementation:
//
CHTMLViewer pHTMLViewer = new CHTMLViewer(g_pProgress);

//
// Step 2: Provide the class with the HTML, the default font and
//         font-size, the default font color and current URL.
//
pHTMLViewer->SetHTML(szHTMLData,"Verdana",15,RGB(0,0,0),szUrl);

//
// Step 3: When the SetHTML has been called, or whenever the size
//         of the display is changed, the positions of the items in
//         the document has to be recalculated.
//         Pass a RECT to the function with the dimensions of the
//         display. The RECT will receive the dimensions of the whole
//         document:
RECT r;
pHTMLViewer->CalcPositions(r);

//
// Step 4: Draw the HTML on a device content. Pass the display size 
//         and scroll positions:
pHTMLViewer->Draw(hDC,r,nXScrollPos,nYScrollPos);

These are the functions in CHTMLProgress that you have to implement:

//
// Display processing. If using blocking image load, the page is
// done then nPos==nSize
virtual void Reading(BOOL bImage, int nPos, int nSize) = 0;
//
// This function is called from CHTMLViewer::OnMouseClick when the 
// mouse position matches a link. Remember to offset the mouse position
// according to the scroll position!
virtual void LinkClicked(char *szLink) = 0;
//
// This function is called from CHTMLViewer::OnMouseOver to inform if
// the mouse is over a link
virtual void SetCursor(BOOL bLink,char *szLink) = 0;
//
// Set the title of the main window to the value of szTitle
virtual void SetTitle(char *szTitle) = 0;
//
// You have to implement how data is passed to CHTMLViewer.
// This function is only called when images needs to be downloaded
virtual char *ReadHTTPData(char *szUrl,int *pnSize) = 0;
//
// An Image is finnshed downloaded.
// Call CHTMLViewer::CalcPositins to recalculate
// positions of HTML objects and redraw entire screen.
// When using non-blocking image
// load, the page is done when nDone==nCount
virtual void ImageLoaded(int nDone, int nCount) = 0;
//////////////////////////////////////////////////////////////////////////
// Control functions, return -1 if no control support.
//
// Enables you to have a list of forms.
// LinkClicked with a link named "Sumit(#Number)"
// is called when a link submitts a form,
// the #Number will be the value returned from
// this function
virtual int CreateForm(char *szAction, char *szMethod) = 0;
// 
// Create controls. The returned value will be passed
// to the MoveControl function when
// the controls are positioned
virtual int CreateTextBox(int nSize, int nMaxChar,SIZE scSize,char *szText, 
    BOOL bMultiline,char *szName,int nForm,BOOL bPassword) = 0;
virtual int CreateButton(SIZE scSize,char *szText,char *szName,int nForm) = 0;
virtual int CreateRadio(SIZE scSize,char *szValue,char *szName,int nForm) = 0;
virtual int CreateCheckbox(SIZE scSize,char *szValue,char *szName,int nForm) = 0;
virtual int CreateListbox(SIZE scSize,CHTMLStringTable &stOptions,int nSelected,
    char *szName,int nForm) = 0;
//
// Positioning of controls
virtual void MoveControl(int nControlID, POINT pPos) = 0;

//
// Updates an area of the screen. This function
// is called when an animated GIF changes
// frame or when the mouse pointer is over a link.
virtual void UpdateScreen(RECT &r) = 0;

The Demo

The demo project includes an implementation of CHTMLProgress with scrolling. To open a file from disk, you must specify the URL with protocol, example: file://c:\folder1\file1.htm. To copy selected text, right click and select Copy in the floating menu.

The demo exe is compiled with CxImage to allow PNG images, transparency and animations. The code has a #define to switch CxImage on and off. The animations do not work for all images; either there are bugs in CxImage or I don't use it the proper way.

The demo shows also what I think is a cool use of my CHTMLViewer: fully formatted tooltips. Or maybe they are just annoying.

Remarks

There are still much HTML functions that are not implemented in my classes before it can be an independent open source browser, which I think would be awesome to have. Feel free to implement them, and optimize and keep me posted if you do so.

Here is a list of things that I don't know how to handle, help anyone?:

Styles... I have found code that translates HTML 3.2 to HTML with CSS, but not the other way. Does there exist any open source that does this?
The calculation of positions can be slow if the HTML consists of large levels of nested tables, so some optimization could improve performance.
MAP/Area for images.
JavaScript... hehe...
CxImage doesn't handle animations correctly.

History

2004.11.16
- Published.
2004.11.18
- Added support for text selection and copying.
- Added support for <PRE> tags.
2004.11.30
- Integrated CxImage.
- Added support for a lot more of tags and properties.
- Added support for forms and posting data.
2004.12.03
- Added support for animations.
- Added support for non-blocking download of images.
2004.12.17
- Demo compiled with newest version of CxImage which makes most animations work (still not all).
- Added support for blocking download of images (if wanted).
- Small background images where very time-consuming to draw on screen, therefore the memory image is enlarged for small images.
2005.01.10
- Added alignment justify.
- Added support for aligned tables and images.