CHTMLViewer






4.84/5 (59 votes)
Nov 16, 2004
7 min read

241172

4436
An article on parsing and displaying HTML.
Introduction
For one of my projects, I needed to create PDF documents with rich text in a component to be used from ASP code. So, to give the ASP programmers an easy interface, I did a simple HTML parser together with my PDF creator. Only the PDF creator survived, so by sharing the HTML parser here prevents it from falling into the dark hole of the forgotten cyberspace.
When I had done the basic HTML parsing, I couldn't stop, and began parsing of tables and images, and ended up loading pages from the Internet and comparing them with how they looked in Internet Explorer. There are still a lot of things to be done with this parser, but it still has more HTML support than the two other HTML parsers I found on Code Project: A Simple HTML drawing class and XHTMLStatic - An extra-lean custom control to display HTML.
Background
The basic idea is to break down the HTML in its smallest parts, which are words, images and compounds. Each compound, i.e., TD
, PRE
and P
or DIV
with alignment is treated as a totally new HTML document and nested tables are recursed down at any level.
When the HTML is broken down in its smallest parts, no further parsing is needed, and the size of texts is not recalculated, so only the positions of the parts has to be recalculated if the display size is changed.
TABLE
is tricky for many reasons:
- The content of a
TD
should not be broken if there is enough space. - The content of a
TD
should be broken if there is not enough space. - The
TD
which has the biggest smallest width defines the smallest width for the whole column. - Each
TD
can have a specified width which will set the width of the whole column if no otherTD
in the column has a greater smallest width. TD
can span over more than one column. All the rules above must still be applied. Therowspan
can be greater than the actual number of columns.- A width can be specified for the whole table, and if it is greater than the smallest width of all columns, the columns must be expanded.
- If the specified width of the table is smaller than the smallest width of all columns, the table width is expanded. (I have found out that IE shrinks the columns instead...)
- Heights must be treated in a similar way as widths.
TD
can be span over more than one row.- Cellpadding, border thickness etc...
There are also a lot of undocumented features in Internet Explorer:
<P>
does have an end tag but is never nested, one level only.- If the sum of the specified width of the
<TD>
s are more than the specified with of the<TABLE>
, the<TD>
s are shrunken if possible. This is supported byCHTMLViewer
but the same applies for heights, which is not supported. <DIV>
,<TD>
,<TR>
,<P>
do not require end tags.- A
<HR>
breaks a<P>
. - Plus a lot more...
The following table shows what tags and properties are supported, which means that all other things are not supported.
Tag | Properties |
<BR> |
Breakclear=all : Clears all aligned tables and images. |
<I> </I> |
Italic, no properties |
<B> </B> <STRONG> </STRONG> |
Bold, no properties |
<U> </U> |
Underline, no properties |
<A> </A> |
Anchorhref = URL |
<SMALL> </SMALL> |
Small text, no properties |
<BIG> </BIG> |
Big text, no properties |
<CENTER> </CENTER> |
Centered content, no properties |
<P> </P> |
Paragraphalign = Alignment, left , right , justify or center |
<FONT> </FONT> |
Fontface = Font nameSize = Font sizeColor = Font color |
<IMG> </IMG> |
Imagesrc = Source URLwidth = Image widthHeight = Image heightalign = floating alignment, can be left or right |
<H1> </H1> <H2> </H2> <H3> </H3> |
Headings, no properties |
<DIV> </DIV> |
Dividerbgcolor = fill colorwidth = divider widthBorder = divider borderalign = divider align, can be center , justify or right , left is default |
<HR> |
Line break, no properties |
<TABLE> </TABLE> |
Tablebgcolor = fill colorwidth = table width, can be percentageheight = table heightBorder = table bordercellpadding = cell paddingcellspacing = cell spacingbackground = background imagealign = floating alignment, can be left or right . center is treated as a <center> tag |
<TR> |
Table rowbgcolor = background coloralign = row alignment; center , left , right or justify |
<TD> </TD> |
Table databgcolor = fill colorwidth = TD widthheight = TD heightcolspan = column spanrowspan = row spanalign = TD align, can be center , justify or right , left is defaultvalign = TD vertical align, can be center or bottom , top is defaultbackground = URL to background imagenowrap nobreak = no wrap |
<PRE> </PRE> |
Unparsed, no properties |
<INPUT> |
Input controltype = control type, can be password , button , submit , radio , checkbox , hidden , image or text which is defaultsize = size of text boxmaxlength = maximum number of characters in text boxvalue = value of controlname = name of controlsrc = source of image controlwidth = width of image controlheight = height of image control |
<SELECT> |
Dropdown box controlname = name of control |
<OPTION> |
Dropdown box itemvalue = value of item if not title |
<SUB> </SUB> |
Sub-text, no properties |
<SUP> </SUP> |
Sup-text, no properties |
<STRIKE> </STRIKE> |
Strike-text, no properties |
<TEXTAREA> </TEXTAREA> |
Multiline text boxname = control namerows = visible rowscols = columns |
<FORM> </FORM> |
Formulamethod = method, get or post action = URL to post data to |
<UL> </UL> |
Bullet list, no properties |
<OL> </OL> |
Number list, no properties |
<IL> |
List item to number or bullet list, no properties |
<BODY> |
Document bodybgcolor = background colorbackground = URL to background image |
<TITLE> |
Document title to be displayed in window header |
Using the code
The class to use is CHTMLViewer
, and this class is making callbacks to CHTMLProgress
which the user of CHTMLViewer
has to implement. CHTMLProgress
is receiving notifications of parsing status, title, cursor and links. It is also used for fetching images referenced by URLs.
Relative URLs are handled by providing the class with the URL of the current page.
// // Step 1: Create the class and provide it with a pointer // to your CHTMLProgress implementation: // CHTMLViewer pHTMLViewer = new CHTMLViewer(g_pProgress); // // Step 2: Provide the class with the HTML, the default font and // font-size, the default font color and current URL. // pHTMLViewer->SetHTML(szHTMLData,"Verdana",15,RGB(0,0,0),szUrl); // // Step 3: When the SetHTML has been called, or whenever the size // of the display is changed, the positions of the items in // the document has to be recalculated. // Pass a RECT to the function with the dimensions of the // display. The RECT will receive the dimensions of the whole // document: RECT r; pHTMLViewer->CalcPositions(r); // // Step 4: Draw the HTML on a device content. Pass the display size // and scroll positions: pHTMLViewer->Draw(hDC,r,nXScrollPos,nYScrollPos);
These are the functions in CHTMLProgress
that you have to implement:
// // Display processing. If using blocking image load, the page is // done then nPos==nSize virtual void Reading(BOOL bImage, int nPos, int nSize) = 0; // // This function is called from CHTMLViewer::OnMouseClick when the // mouse position matches a link. Remember to offset the mouse position // according to the scroll position! virtual void LinkClicked(char *szLink) = 0; // // This function is called from CHTMLViewer::OnMouseOver to inform if // the mouse is over a link virtual void SetCursor(BOOL bLink,char *szLink) = 0; // // Set the title of the main window to the value of szTitle virtual void SetTitle(char *szTitle) = 0; // // You have to implement how data is passed to CHTMLViewer. // This function is only called when images needs to be downloaded virtual char *ReadHTTPData(char *szUrl,int *pnSize) = 0; // // An Image is finnshed downloaded. // Call CHTMLViewer::CalcPositins to recalculate // positions of HTML objects and redraw entire screen. // When using non-blocking image // load, the page is done when nDone==nCount virtual void ImageLoaded(int nDone, int nCount) = 0; ////////////////////////////////////////////////////////////////////////// // Control functions, return -1 if no control support. // // Enables you to have a list of forms. // LinkClicked with a link named "Sumit(#Number)" // is called when a link submitts a form, // the #Number will be the value returned from // this function virtual int CreateForm(char *szAction, char *szMethod) = 0; // // Create controls. The returned value will be passed // to the MoveControl function when // the controls are positioned virtual int CreateTextBox(int nSize, int nMaxChar,SIZE scSize,char *szText, BOOL bMultiline,char *szName,int nForm,BOOL bPassword) = 0; virtual int CreateButton(SIZE scSize,char *szText,char *szName,int nForm) = 0; virtual int CreateRadio(SIZE scSize,char *szValue,char *szName,int nForm) = 0; virtual int CreateCheckbox(SIZE scSize,char *szValue,char *szName,int nForm) = 0; virtual int CreateListbox(SIZE scSize,CHTMLStringTable &stOptions,int nSelected, char *szName,int nForm) = 0; // // Positioning of controls virtual void MoveControl(int nControlID, POINT pPos) = 0; // // Updates an area of the screen. This function // is called when an animated GIF changes // frame or when the mouse pointer is over a link. virtual void UpdateScreen(RECT &r) = 0;
The Demo
The demo project includes an implementation of CHTMLProgress
with scrolling. To open a file from disk, you must specify the URL with protocol, example: file://c:\folder1\file1.htm. To copy selected text, right click and select Copy in the floating menu.
The demo exe is compiled with CxImage to allow PNG images, transparency and animations. The code has a #define
to switch CxImage
on and off. The animations do not work for all images; either there are bugs in CxImage
or I don't use it the proper way.
The demo shows also what I think is a cool use of my CHTMLViewer
: fully formatted tooltips. Or maybe they are just annoying.
Remarks
There are still much HTML functions that are not implemented in my classes before it can be an independent open source browser, which I think would be awesome to have. Feel free to implement them, and optimize and keep me posted if you do so.
Here is a list of things that I don't know how to handle, help anyone?:
- Styles... I have found code that translates HTML 3.2 to HTML with CSS, but not the other way. Does there exist any open source that does this?
- The calculation of positions can be slow if the HTML consists of large levels of nested tables, so some optimization could improve performance.
MAP
/Area
for images.- JavaScript... hehe...
CxImage
doesn't handle animations correctly.
History
- 2004.11.16
- Published.
- 2004.11.18
- Added support for text selection and copying.
- Added support for
<PRE>
tags.
- 2004.11.30
- Integrated
CxImage
. - Added support for a lot more of tags and properties.
- Added support for forms and posting data.
- Integrated
- 2004.12.03
- Added support for animations.
- Added support for non-blocking download of images.
- 2004.12.17
- Demo compiled with newest version of
CxImage
which makes most animations work (still not all). - Added support for blocking download of images (if wanted).
- Small background images where very time-consuming to draw on screen, therefore the memory image is enlarged for small images.
- Demo compiled with newest version of
- 2005.01.10
- Added alignment justify.
- Added support for aligned tables and images.