Click here to Skip to main content
Click here to Skip to main content

Tagged as

Word2CHM, Convert a Word Document to a CHM File

, 17 Nov 2010
Rate this:
Please Sign up or sign in to vote.
Word2CHM, convert a Word document to a CHM file
word2chm.gif

Introduction

Word2CHM is a open source C# program which can convert a Microsoft Word document (in 2000/2003 format) to a CHM document. It requires HTML Help Workshop and Microsoft Word 2003.

This is a screen snapshot.

word2chm-snapshot.jpg

Background

Many people write customer help documents with Microsoft Word, because Microsoft Word is very fit to write documents include text, images and tables.

But many customers did not want read help documents in Microsoft Word format, but they like CHM format. So it is useful to convert a Microsoft Word document to a CHM document. This is why I built Word2CHM.

Word2CHM

In Word2CHM, there are three steps in converting a Microsoft Word document to a CHM document. The first is to convert a Microsoft Word document to a single HTML file, the second is to split a single HTML file to multi HTML files, and third is to compile multi HTML files to a single CHM file.

First, Convert Microsoft Word Document to a Single HTML File

Microsoft Word application supports OLE automatic technology, a C# program can host a Microsoft Word application, open Microsoft Word binary document and save as a HTML file.

There is some sample C# code that hosts a Microsoft Word application.

private bool SaveWordToHtml(string docFileName, string htmlFileName)
{
    // check doc file name
    if (System.IO.File.Exists(docFileName) == false )
    {
        this.Alert("File '" + docFileName + "' not exist!");
        return false;
    }
    // check output directory
    string dir = System.IO.Path.GetDirectoryName(htmlFileName);
    if (System.IO.Directory.Exists(dir) == false )
    {
        this.Alert("Directory '" + dir + "' not exist!");
        return false;
    }

    object trueValue = true;
    object falseValue = false;
    object missValue = System.Reflection.Missing.Value;
    object fileNameValue = docFileName;

    // create word application instance
    Microsoft.Office.Interop.Word.Application app = 
        new Microsoft.Office.Interop.Word.ApplicationClass();
    // set word application visible
    // if something is error and quit , user can close word application by self.
    app.Visible = true;
    // open document
    Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(
        ref fileNameValue,
        ref missValue,
        ref trueValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue);
    // save a html file
    object htmlFileNameValue = htmlFileName;
    object format = Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatFilteredHTML;
    doc.SaveAs(
        ref htmlFileNameValue , 
        ref format,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue);

    // close document and release resource
    doc.Close(ref falseValue, ref missValue, ref missValue);
    app.Quit(ref falseValue, ref missValue, ref missValue);

    System.Runtime.InteropServices.Marshal.ReleaseComObject(doc);
    System.Runtime.InteropServices.Marshal.ReleaseComObject(app);

    return true;
}

In this C# source code, it is important to call function ReleaseComObject. Using ReleaseComObject function, a program can release all resources use by Word application.

In many programs which host Microsoft Word application (also Excel application), when program does not need Word application, program can call Quit function of Word application. But sometimes, the word process is still alive, this can lead to a very serious resource leak. Using ReleaseComObject can reduce this risk.

Second, Split a Single HTML File to Multi HTML File

The HTML file generates a Word application that includes all content of a Word document. For example, a Word document contains the following content:

doc-sample.jpg

I save this document as a filtered HTML file, the HTML file source code is as follows:

<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=gb2312">
        <meta name=Generator content="Microsoft Word 11 (filtered)">
        <title>Header1</title>
        <style>
         some style code
        </style>
    </head>
    <body lang=ZH-CN style='text-justify-trim:punctuation'>
        <div class=Section1 style='layout-grid:15.6pt'>
            <h1><span lang=EN-US>Header1</span></h1>
            <p class=MsoNormal><span lang=EN-US>Content1</span></p>
            <h2><span lang=EN-US>Header2</span></h2>
            <p class=MsoNormal><span lang=EN-US>Content2</span></p>
        </div>
    </body>
</html>

In this HTML source code, a div tag includes all content, Word2CHM needs to split this HTML file to two files.

File0.html
<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=gb2312">
        <meta name=Generator content="Microsoft Word 11 (filtered)">
        <title>Header1</title>
    <style>
     --------------
    </style>
    </head>
    <body>
        <h1>Header</h1><hr />
        
        <p class=MsoNormal><span lang=EN-US>Content1</span></p>
        
        <hr /><h1>Footer</h1>
    </body>
</html>
File1.html
<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=gb2312">
        <meta name=Generator content="Microsoft Word 11 (filtered)">
        <title>Header1</title>
    <style>
     --------------
    </style>
    </head>
    <body>
        <h1>Header</h1><hr />
        
        <p class=MsoNormal><span lang=EN-US>Content2</span></p>
        
        <hr /><h1>Footer</h1>
    </body>
</html>

Here, the program adds HTML source “<h1>Header</h1><hr />” in front of HTML content source code, and adds “<hr /><h1>Footer</h1>” after HTML content. Those additional HTML sources are used as header and footer. Word2CHM uses the following C# code to split HTML file.

int index2 = strBody.IndexOf(">");
int index3 = strBody.IndexOf("</h" + Nativelevel + ">");
//read text in <h</h> as topic title
string strTitle = strBody.Substring(index2 + 1, index3 - index2 - 1);
while (strTitle.IndexOf("<") >= 0)
{
    int index4 = strTitle.IndexOf("<");
    int index5 = strTitle.IndexOf(">", index4);
    strTitle = strTitle.Remove(index4, index5 - index4 + 1);
}
strBody = strBody.Substring(index3 + 5);
index = strBody.IndexOf("<h");
if (index == -1)
{
    index = strBody.Length;
}
//read topic content
string strContent = strBody.Substring(0, index);

Using this C# code, Word2CHM splits HTML file by using HTML tag H1, H2, H3 and Hn. And set each HTML document’s title as content between HTML tag Hn.

Third. Compile Multi HTML files to a Single CHM File

Word2CHM cannot compile multi HTML file to a single CHM file by itself. It calls “HTML Help workshop” to generate CHM file. HTML Help workshop is a product of Microsoft. It can compile multi HTML file to a CHM file, It saves settings in a help project file whose extension name is hhp. Word2CHM uses the following C# source to generate HHP file.

using (System.IO.StreamWriter myWriter = new System.IO.StreamWriter(
           strHHP,
           false,
           System.Text.Encoding.GetEncoding(936)))
{
    myWriter.WriteLine("[OPTIONS]");
    myWriter.WriteLine("Compiled file=" + System.IO.Path.GetFileName(strCHM));
    myWriter.WriteLine("Contents file=" + System.IO.Path.GetFileName(strHHC));
    myWriter.WriteLine("Default topic=" + this.DefaultTopic);
    myWriter.WriteLine("Default Window=main");
    myWriter.WriteLine("Display compile progress=yes");
    myWriter.WriteLine("Full-text search=" + (this.FullTextSearch ? "Yes" : "No"));
    myWriter.WriteLine("Binary TOC=" + (this.BinaryToc ? "Yes" : "No"));
    myWriter.WriteLine("Auto Index=" + (this.AutoIndex ? "Yes" : "No"));
    myWriter.WriteLine("Binary Index=" + (this.BinaryIndex ? "Yes" : "No"));
    myWriter.WriteLine("Title=" + this.Title);
    myWriter.WriteLine("[FILES]");
    foreach (CHMNode node in nodes)
    {
        if (HasContent(node.Local))
        {
            if (myFiles.Contains(node.Local) == false)
            {
                myFiles.Add(node.Local);
            }
        }
    }
    foreach (string fileName in myFiles)
    {
        myWriter.WriteLine(fileName);
    }
}

Word2CHM also generates HHC file to describe topic structure of CHM file. HHC file in XML format, Word2CHM uses the following C# code to generate HHC XML content.

System.Xml.XmlDocument doc = RootElement.OwnerDocument;
System.Xml.XmlElement ulElement = doc.CreateElement("UL");
RootElement.AppendChild(ulElement);
foreach (CHMNode node in nodes)
{
    System.Xml.XmlElement liElement = doc.CreateElement("LI");
    ulElement.AppendChild(liElement);
    System.Xml.XmlElement objElement = doc.CreateElement("OBJECT");
    liElement.AppendChild(objElement);
    objElement.SetAttribute("type", "text/sitemap");
    AddParamElement(objElement, "Name", node.Name);
    if (HasContent(node.Local))
    {
        AddParamElement(objElement, "Local", node.Local.Replace('\\', '/'));
    }
    if (HasContent(node.ImageNumber))
    {
        AddParamElement(objElement, "ImageNumber", node.ImageNumber);
    }
    if (node.Nodes.Count > 0)
    {
        ToHHCXMLElement(node.Nodes, ulElement);
    }
}

After generating an HHP file and HHC file, Word2CHM calls HHC.exe to open HHP file and generates CHM file, usually HHC.exe exists in directory “C:\Program Files\HTML Help Workshop”. There are C# sources to generate CHM file.

ProcessStartInfo start = new ProcessStartInfo(compilerExeFileName, "\"" + strHHP + "\"");
start.UseShellExecute = false;
start.CreateNoWindow = true;
start.RedirectStandardOutput = true;
start.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
System.Diagnostics.Process proc = System.Diagnostics.Process.Start(start);
proc.PriorityClass = System.Diagnostics.ProcessPriorityClass.BelowNormal;
this.strOutputText = proc.StandardOutput.ReadToEnd();

After completing these three steps, Word2CHM can convert a Word document to a CHM file.

History

  • 17th November, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

yuan yong fu
Web Developer duchang soft
China China
yuan yong fu of duchang soft , come from CHINA , 2008 Microsoft MVP,Use GDI+,XML/XSLT, site:http://www.cnblogs.com/xdesigner/

Comments and Discussions

 
GeneralMy vote of 5 Pinmembermetaphysis29-Jul-14 3:56 
QuestionMy vote of 5 Pinmemberopulos6-Jun-14 13:51 
GeneralMessage Removed PinmemberMember 77402156-Jun-14 13:48 
GeneralMy vote of 5 PinmemberErik Rude1-May-13 1:26 
GeneralMy vote of 5 Pinmembersapien4u4-Mar-13 23:11 
QuestionFantastic! PinmemberHanover Fist6-Nov-12 7:43 
SuggestionOne little change [modified] PinmemberBrad Bruce3-Jun-12 10:08 
GeneralRe: One little change PinmemberErik Rude1-May-13 1:25 
SuggestionBrilliant PinmemberSean.Jansson23-Apr-12 23:45 
GeneralCan't load project Pinmembersssw288-May-11 21:08 
GeneralFound a Small Problem Pinmemberaaroncampf14-Mar-11 9:19 
GeneralRe: Found a Small Problem Pinmembersebyweb20-Jan-12 4:45 
GeneralRe: Found a Small Problem Pinmemberaaroncampf20-Jan-12 5:34 
GeneralRe: Found a Small Problem PinmemberBrad Bruce3-Jun-12 6:25 
GeneralMy vote of 5 Pinmemberaaroncampf14-Mar-11 9:16 
GeneralMy Vote of 5 PinmemberRaviRanjankr1-Mar-11 22:11 
GeneralMy vote of 5 Pinmvpthatraja1-Mar-11 21:44 
GeneralMy vote of 5 Pinmemberlitaooo18-Dec-10 3:31 
非常不错,我前段时间也在关注Word转CHM的相关介绍或工具,看了你这篇文章,思路清晰多了(以前总觉得应该在Word文档中按照标题进行章节拆分,没想到Word生成的HTML中已经按照H1这类方式进行拆分了)

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140821.2 | Last Updated 17 Nov 2010
Article Copyright 2010 by yuan yong fu
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid