Click here to Skip to main content
15,891,607 members
Articles / Programming Languages / C#

HtmlHelp library and example viewer

Rate me:
Please Sign up or sign in to vote.
4.90/5 (65 votes)
11 Aug 2004CPOL26 min read 477K   15.5K   231  
A class library for reading compiled HTML help (chm) files and a sample viewer application using this library.
This file describes the format of the dump file. The data stored in the dumpfile are dynamic, depending on
the flags you have set in the DumpingInfo class.

Using the dump file can speed up the loading process. Especially if the CHM file contains a large sitemap table
of contents/index file (~800KB may take up to 10 sec on 1.8GHz CPUs). Switching from the sitemap-format 
to the dump file can speed up the loading process over 90% (same toc read from dump in ~<0.5 sec on 1.8GHz CPUs).
If your CHM contains a large binary index/toc the dump-file usage aprox. halves the load time.
If you use dumping for files with a large amount of index/toc entries, it may be performant if you add the #URLSTR
and #STRINGS files to the dump (using the dumping flags).

Examples:
Test done on: Intel Xeon 3.06GHz HT, 1GB Ram, WinXP, SCSI-HDs
DumpingFlags set: DumpingFlags.DumpBinaryTOC | DumpingFlags.DumpTextTOC | 
				  DumpingFlags.DumpTextIndex | DumpingFlags.DumpBinaryIndex | 
				  DumpingFlags.DumpUrlStr | DumpingFlags.DumpStrings
				  
Dump comrpession: DumpCompression.Medium

DirectX9 SDK CHM (binary index and binary TOC):
	Read time without dump:   --- HtmlHelp file read in 00:00:02.1874784
	Write time of dump data*: --- Dump written in 00:00:01.0937360        (dump file size: ~780KB)

	Read time with dump:      --- HtmlHelp file read in 00:00:00.7499904  (dump file size: ~780KB)
	Net read time of dump:    --- Dump read in 00:00:00.5781176           (dump file size: ~780KB)


CHM with a ~900KB sitemap TOC (binary index and text-based TOC):
	Read time without dump:   --- HtmlHelp file read in 00:00:05.7811760  (slow RegEx parsing :( )
	Write time of dump data*: --- Dump written in 00:00:00.5156184 (dump file size: ~300KB)

	Read time with dump:      --- HtmlHelp file read in 00:00:00.3437456  (dump file size: ~300KB)
	Net read time of dump:    --- Dump read in 00:00:00.2187472           (dump file size: ~300KB)

* the write time of the dump is not included in the "Read time without dump" timespan.

I think the two examples above shows how the usage of the data dumping can speed up the loading process.

Another pro of using dump files is, that you will save a few MBs of memory, because only the necessary fields
are stored and loaded from the dump file. Also the initial size of strings is known when reading the dump,
so the .NET Framework can instantiate the string instance with an initial size.

The dumpfile starts with the following header:

BYTE		size of signature text (n)
BYTEs		n Bytes which forms a signature string
DWORD		compression level

compression level		description
------------------------------------------------------------------
	0					None, no compression stream is used
	1					Minimum compression
	2					Medium compression
	3					Maximum compression


Depending on the compression level, the following block differs (see ChmDecoing\DumpingInfo.cs line 312 
(for writing) and line 369 (for reading)).
If compression level = 0 (no compression), the following format can be read 1:1 from the
dump file. If compression level > 0, you have to create an InflaterInputStream() instance
from the ICSharpCode.SharpZibLib library and attach it to the current file stream.
Using this inflater stream, you can read the compressed data as described below (decompression
will be done by the inflater):

BYTE			size of signature text (n)  (same as in header above !)
BYTEs			n Bytes which forms a signature string (same as above)
BYTE			size of timestamp string (m)
BYTEs			m Bytes which forms a date/time for the last write access of the CHM
				The string has the following format: "dd.MM.yyyy HH:mm:ss.ffffff"
DWORD			used encoding ansi-codepage

BYTE			boolean flag (0=false, 1=true) specifying if the dump should contain the data
				of the #STRINGS file.
	
	if the previous flag returns TRUE
		BYTE		boolean flag (0=false, 1=true) specifying if the #STRINGS data is supported by the CHM
		
		if the previous flag returns TRUE the dump of #STRINGS follows
			DWORD	number of dictionary pairs (key, value) (cnt)
			
			for each dictionary entry (cnt times)
				DWORD	offset of the string entry in the #STRINGS file (=key)
				BYTE	size of following string (n)
				BYTEs	n Bytes which forms the value string
		
		
BYTE			boolean flag (0=false, 1=true) specifying if the dump should contain the data
				of the #URLSTR file.
	
	if the previous flag returns TRUE
		BYTE		boolean flag (0=false, 1=true) specifying if the #URLSTR data is supported by the CHM
		
		if the previous flag returns TRUE the dump of #URLSTR follows		
			DWORD	number of dictionary pairs for urls (key, value) (cnt)
			
			for each dictionary entry (cnt times)
				DWORD	offset of the string entry in the #STRINGS file (=key)
				BYTE	size of following string (n)
				BYTEs	n Bytes which forms the value string
			
			DWORD	number of dictionary pairs for frame names (key, value) (cnt)
			
			for each dictionary entry (cnt times)
				DWORD	offset of the string entry in the #STRINGS file (=key)
				BYTE	size of following string (n)
				BYTEs	n Bytes which forms the value string
		
BYTE			boolean flag (0=false, 1=true) specifying if the dump should contain the data
				of the #URLTBL file.
	
	if the previous flag returns TRUE
		BYTE		boolean flag (0=false, 1=true) specifying if the #URLTBL data is supported by the CHM
		
		if the previous flag returns TRUE the dump of #URLTBL follows
			DWORD	number of urltable entries (cnt)
			
			for each urltable entry (cnt times)
				DWORD	offset into urlstr file
				DWORD	offset of this entry
				DWORD	index into topics file
				DWORD	offset into urlstr file
		
BYTE			boolean flag (0=false, 1=true) specifying if the dump should contain the data
				of the #TOPICS file.
	
	if the previous flag returns TRUE
		BYTE		boolean flag (0=false, 1=true) specifying if the #TOPICS data is supported by the CHM
		
		if the previous flag returns TRUE the dump of #TOPICS follows
			DWORD	number of topic entries (cnt)
			
			for each topic entry (cnt times)
				DWORD	offset of the entry
				DWORD	offset into tocidx file (binary toc)
				DWORD	offset into strings for the topic title
				DWORD	offset into urltable
				DWORD	visibility mode
				DWORD	unknown mode
		
BYTE			boolean flag (0=false, 1=true) specifying if the dump should contain the data
				of the $FIftiMain (Full-text search) file.
	
	if the previous flag returns TRUE
		BYTE		boolean flag (0=false, 1=true) specifying if the $FIftiMain data is supported by the CHM
		
		if the previous flag returns TRUE the dump of $FIftiMain follows
			Header of full-text engine
				DWORD	number of index files
				DWORD	root offset
				DWORD	page count
				DWORD	depth of the tree
				DWORD	scale for document index
				DWORD	root for document index
				DWORD	scale for code count
				DWORD	root for code count
				DWORD	scale for location codes
				DWORD	root for location codes
				DWORD	size of the index/leaf nodes
				DWORD	length of longest word
				DWORD	total number of words
				DWORD	total number of unique words
			End of header
			
			DWORD	number of bytes following (binary full-text index) (n)
			BYTEs	n Bytes which represent the binary full-text index byte array
		
BYTE			boolean flag (0=false, 1=true) specifying if a table of contents is in the dump file

	if previous flag returns TRUE
		DWORD		number of TOC items (n)
		
		A: for each toc item (n times)
			DWORD		toc mode (0 = text based toc, 1 = binary toc)
			DWORD		offset into topics file
			BYTE		size of the following string
			BYTEs		string which forms the name of the toc item
			
			if toc mode = text based and topics offset < 0
				BYTE	size of the following string
				BYTEs	string which forms the local of the toc item
			
			DWORD		image index of this toc item
			BYTE		size of the following string
			BYTEs		string which forms a merge link (e.g. xy.chm::/toc.hhc)
			DWORD		number of information type associations (o)
			
			for each information type association (o times)
				BYTE	size of following string
				BYTEs	string which forms the information type string
				
			DWORD		number of child toc items (m)
			
			for each child toc item (m times) 
				each child item is a toc item, so repeat reading from mark (A:)
			
BYTE			boolean flag (0=false, 1=true) specifying if a index is in the dump file

	if previous flag returns TRUE	
		DWORD		number of index items (ALinks) (n)
		
		for each index item (n times)
			BYTE		size of following string
			BYTEs		bytes which forms the keyword string
			BYTE		boolean flag, true if the index item is a see also keyword
			DWORD		index indent
			DWORD		number of information type associations (o)
			
			for each information type association (o times)
				BYTE	size of following string
				BYTEs	string which forms the information type string
				
			DWORD		number of see-also strings (cnt)
			
			for each see-also string (cnt times)
				BYTE	size of the string
				BYTEs	bytes forming a see-also keyword
			
			DWORD		number of topic entries (cnt)
			
			for each topic entry (cnt times)
				DWORD	topic mode (0...text-based, 1...binary)
				
				if topic mode = 0
					BYTE	size of following string
					BYTEs	bytes forming the title string
					BYTE	size of following string
					BYTEs	bytes forming the topic local
				
				if topic mode = 1
					DWORD	offset into topics file
					
DWORD	number of information types in dump (n)

for each information type (n times)
	DWORD	information type mode
	BYTE	size of following string
	BYTEs	string wich represents the name of the information type
	BYTE	size of following string
	BYTEs	string which represents the descriptin of the information type
	
DWORD	number of categories in dump (n)

for each category (n times)
	BYTE	size of following string
	BYTEs	string wich represents the name of the category
	BYTE	size of following string
	BYTEs	string which represents the descriptin of the category
	DWORD	number of information types assigned to this category (m)
	
	for each information type (m times)
		BYTE	size of following string
		BYTEs	string wich represents the name of the information type
		

By viewing downloads associated with this article you agree to the Terms of Service and the article's licence.

If a file you wish to view isn't highlighted, and is a text file (not binary), please let us know and we'll add colourisation support for it.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Austria Austria
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions