Word2CHM, Convert a Word Document to a CHM File






4.92/5 (18 votes)
Word2CHM, convert a Word document to a CHM file

Introduction
Word2CHM
is a open source C# program which can convert a Microsoft Word document (in 2000/2003 format) to a CHM document. It requires HTML Help Workshop and Microsoft Word 2003.
This is a screen snapshot.

Background
Many people write customer help documents with Microsoft Word, because Microsoft Word is very fit to write documents include text, images and tables.
But many customers did not want read help documents in Microsoft Word format, but they like CHM format. So it is useful to convert a Microsoft Word document to a CHM document. This is why I built Word2CHM
.
Word2CHM
In Word2CHM
, there are three steps in converting a Microsoft Word document to a CHM document. The first is to convert a Microsoft Word document to a single HTML file, the second is to split a single HTML file to multi HTML files, and third is to compile multi HTML files to a single CHM file.
First, Convert Microsoft Word Document to a Single HTML File
Microsoft Word application supports OLE automatic technology, a C# program can host a Microsoft Word application, open Microsoft Word binary document and save as a HTML file.
There is some sample C# code that hosts a Microsoft Word application.
private bool SaveWordToHtml(string docFileName, string htmlFileName)
{
// check doc file name
if (System.IO.File.Exists(docFileName) == false )
{
this.Alert("File '" + docFileName + "' not exist!");
return false;
}
// check output directory
string dir = System.IO.Path.GetDirectoryName(htmlFileName);
if (System.IO.Directory.Exists(dir) == false )
{
this.Alert("Directory '" + dir + "' not exist!");
return false;
}
object trueValue = true;
object falseValue = false;
object missValue = System.Reflection.Missing.Value;
object fileNameValue = docFileName;
// create word application instance
Microsoft.Office.Interop.Word.Application app =
new Microsoft.Office.Interop.Word.ApplicationClass();
// set word application visible
// if something is error and quit , user can close word application by self.
app.Visible = true;
// open document
Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(
ref fileNameValue,
ref missValue,
ref trueValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue);
// save a html file
object htmlFileNameValue = htmlFileName;
object format = Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatFilteredHTML;
doc.SaveAs(
ref htmlFileNameValue ,
ref format,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue,
ref missValue);
// close document and release resource
doc.Close(ref falseValue, ref missValue, ref missValue);
app.Quit(ref falseValue, ref missValue, ref missValue);
System.Runtime.InteropServices.Marshal.ReleaseComObject(doc);
System.Runtime.InteropServices.Marshal.ReleaseComObject(app);
return true;
}
In this C# source code, it is important to call function ReleaseComObject
. Using ReleaseComObject
function, a program can release all resources use by Word application.
In many programs which host Microsoft Word application (also Excel application), when program does not need Word application, program can call Quit
function of Word application. But sometimes, the word process is still alive, this can lead to a very serious resource leak. Using ReleaseComObject
can reduce this risk.
Second, Split a Single HTML File to Multi HTML File
The HTML file generates a Word application that includes all content of a Word document. For example, a Word document contains the following content:

I save this document as a filtered HTML file, the HTML file source code is as follows:
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=gb2312">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title>Header1</title>
<style>
some style code
</style>
</head>
<body lang=ZH-CN style='text-justify-trim:punctuation'>
<div class=Section1 style='layout-grid:15.6pt'>
<h1><span lang=EN-US>Header1</span></h1>
<p class=MsoNormal><span lang=EN-US>Content1</span></p>
<h2><span lang=EN-US>Header2</span></h2>
<p class=MsoNormal><span lang=EN-US>Content2</span></p>
</div>
</body>
</html>
In this HTML source code, a div
tag includes all content, Word2CHM
needs to split this HTML file to two files.
File0.html
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=gb2312">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title>Header1</title>
<style>
--------------
</style>
</head>
<body>
<h1>Header</h1><hr />
<p class=MsoNormal><span lang=EN-US>Content1</span></p>
<hr /><h1>Footer</h1>
</body>
</html>
File1.html
<html>
<head>
<meta http-equiv=Content-Type content="text/html; charset=gb2312">
<meta name=Generator content="Microsoft Word 11 (filtered)">
<title>Header1</title>
<style>
--------------
</style>
</head>
<body>
<h1>Header</h1><hr />
<p class=MsoNormal><span lang=EN-US>Content2</span></p>
<hr /><h1>Footer</h1>
</body>
</html>
Here, the program adds HTML source “<h1>Header</h1><hr />
” in front of HTML content source code, and adds “<hr /><h1>Footer</h1>
” after HTML content. Those additional HTML sources are used as header and footer. Word2CHM
uses the following C# code to split HTML file.
int index2 = strBody.IndexOf(">");
int index3 = strBody.IndexOf("</h" + Nativelevel + ">");
//read text in <h</h> as topic title
string strTitle = strBody.Substring(index2 + 1, index3 - index2 - 1);
while (strTitle.IndexOf("<") >= 0)
{
int index4 = strTitle.IndexOf("<");
int index5 = strTitle.IndexOf(">", index4);
strTitle = strTitle.Remove(index4, index5 - index4 + 1);
}
strBody = strBody.Substring(index3 + 5);
index = strBody.IndexOf("<h");
if (index == -1)
{
index = strBody.Length;
}
//read topic content
string strContent = strBody.Substring(0, index);
Using this C# code, Word2CHM
splits HTML file by using HTML tag H1
, H2
, H3
and Hn
. And set each HTML document’s title as content between HTML tag Hn
.
Third. Compile Multi HTML files to a Single CHM File
Word2CHM
cannot compile multi HTML file to a single CHM file by itself. It calls “HTML Help workshop” to generate CHM file. HTML Help workshop is a product of Microsoft. It can compile multi HTML file to a CHM file, It saves settings in a help project file whose extension name is hhp. Word2CHM
uses the following C# source to generate HHP file.
using (System.IO.StreamWriter myWriter = new System.IO.StreamWriter(
strHHP,
false,
System.Text.Encoding.GetEncoding(936)))
{
myWriter.WriteLine("[OPTIONS]");
myWriter.WriteLine("Compiled file=" + System.IO.Path.GetFileName(strCHM));
myWriter.WriteLine("Contents file=" + System.IO.Path.GetFileName(strHHC));
myWriter.WriteLine("Default topic=" + this.DefaultTopic);
myWriter.WriteLine("Default Window=main");
myWriter.WriteLine("Display compile progress=yes");
myWriter.WriteLine("Full-text search=" + (this.FullTextSearch ? "Yes" : "No"));
myWriter.WriteLine("Binary TOC=" + (this.BinaryToc ? "Yes" : "No"));
myWriter.WriteLine("Auto Index=" + (this.AutoIndex ? "Yes" : "No"));
myWriter.WriteLine("Binary Index=" + (this.BinaryIndex ? "Yes" : "No"));
myWriter.WriteLine("Title=" + this.Title);
myWriter.WriteLine("[FILES]");
foreach (CHMNode node in nodes)
{
if (HasContent(node.Local))
{
if (myFiles.Contains(node.Local) == false)
{
myFiles.Add(node.Local);
}
}
}
foreach (string fileName in myFiles)
{
myWriter.WriteLine(fileName);
}
}
Word2CHM
also generates HHC file to describe topic structure of CHM file. HHC file in XML format, Word2CHM
uses the following C# code to generate HHC XML content.
System.Xml.XmlDocument doc = RootElement.OwnerDocument;
System.Xml.XmlElement ulElement = doc.CreateElement("UL");
RootElement.AppendChild(ulElement);
foreach (CHMNode node in nodes)
{
System.Xml.XmlElement liElement = doc.CreateElement("LI");
ulElement.AppendChild(liElement);
System.Xml.XmlElement objElement = doc.CreateElement("OBJECT");
liElement.AppendChild(objElement);
objElement.SetAttribute("type", "text/sitemap");
AddParamElement(objElement, "Name", node.Name);
if (HasContent(node.Local))
{
AddParamElement(objElement, "Local", node.Local.Replace('\\', '/'));
}
if (HasContent(node.ImageNumber))
{
AddParamElement(objElement, "ImageNumber", node.ImageNumber);
}
if (node.Nodes.Count > 0)
{
ToHHCXMLElement(node.Nodes, ulElement);
}
}
After generating an HHP file and HHC file, Word2CHM
calls HHC.exe to open HHP file and generates CHM file, usually HHC.exe exists in directory “C:\Program Files\HTML Help Workshop”. There are C# sources to generate CHM file.
ProcessStartInfo start = new ProcessStartInfo(compilerExeFileName, "\"" + strHHP + "\"");
start.UseShellExecute = false;
start.CreateNoWindow = true;
start.RedirectStandardOutput = true;
start.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
System.Diagnostics.Process proc = System.Diagnostics.Process.Start(start);
proc.PriorityClass = System.Diagnostics.ProcessPriorityClass.BelowNormal;
this.strOutputText = proc.StandardOutput.ReadToEnd();
After completing these three steps, Word2CHM
can convert a Word document to a CHM file.
History
- 17th November, 2010: Initial post