Click here to Skip to main content
15,867,704 members
Articles / Programming Languages / C#

Word2CHM, Convert a Word Document to a CHM File

Rate me:
Please Sign up or sign in to vote.
4.92/5 (18 votes)
17 Nov 2010CPOL3 min read 198.8K   4.5K   47   17
Word2CHM, convert a Word document to a CHM file
word2chm.gif

Introduction

Word2CHM is a open source C# program which can convert a Microsoft Word document (in 2000/2003 format) to a CHM document. It requires HTML Help Workshop and Microsoft Word 2003.

This is a screen snapshot.

word2chm-snapshot.jpg

Background

Many people write customer help documents with Microsoft Word, because Microsoft Word is very fit to write documents include text, images and tables.

But many customers did not want read help documents in Microsoft Word format, but they like CHM format. So it is useful to convert a Microsoft Word document to a CHM document. This is why I built Word2CHM.

Word2CHM

In Word2CHM, there are three steps in converting a Microsoft Word document to a CHM document. The first is to convert a Microsoft Word document to a single HTML file, the second is to split a single HTML file to multi HTML files, and third is to compile multi HTML files to a single CHM file.

First, Convert Microsoft Word Document to a Single HTML File

Microsoft Word application supports OLE automatic technology, a C# program can host a Microsoft Word application, open Microsoft Word binary document and save as a HTML file.

There is some sample C# code that hosts a Microsoft Word application.

C#
private bool SaveWordToHtml(string docFileName, string htmlFileName)
{
    // check doc file name
    if (System.IO.File.Exists(docFileName) == false )
    {
        this.Alert("File '" + docFileName + "' not exist!");
        return false;
    }
    // check output directory
    string dir = System.IO.Path.GetDirectoryName(htmlFileName);
    if (System.IO.Directory.Exists(dir) == false )
    {
        this.Alert("Directory '" + dir + "' not exist!");
        return false;
    }

    object trueValue = true;
    object falseValue = false;
    object missValue = System.Reflection.Missing.Value;
    object fileNameValue = docFileName;

    // create word application instance
    Microsoft.Office.Interop.Word.Application app = 
        new Microsoft.Office.Interop.Word.ApplicationClass();
    // set word application visible
    // if something is error and quit , user can close word application by self.
    app.Visible = true;
    // open document
    Microsoft.Office.Interop.Word.Document doc = app.Documents.Open(
        ref fileNameValue,
        ref missValue,
        ref trueValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue);
    // save a html file
    object htmlFileNameValue = htmlFileName;
    object format = Microsoft.Office.Interop.Word.WdSaveFormat.wdFormatFilteredHTML;
    doc.SaveAs(
        ref htmlFileNameValue , 
        ref format,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue,
        ref missValue);

    // close document and release resource
    doc.Close(ref falseValue, ref missValue, ref missValue);
    app.Quit(ref falseValue, ref missValue, ref missValue);

    System.Runtime.InteropServices.Marshal.ReleaseComObject(doc);
    System.Runtime.InteropServices.Marshal.ReleaseComObject(app);

    return true;
}

In this C# source code, it is important to call function ReleaseComObject. Using ReleaseComObject function, a program can release all resources use by Word application.

In many programs which host Microsoft Word application (also Excel application), when program does not need Word application, program can call Quit function of Word application. But sometimes, the word process is still alive, this can lead to a very serious resource leak. Using ReleaseComObject can reduce this risk.

Second, Split a Single HTML File to Multi HTML File

The HTML file generates a Word application that includes all content of a Word document. For example, a Word document contains the following content:

doc-sample.jpg

I save this document as a filtered HTML file, the HTML file source code is as follows:

XML
<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=gb2312">
        <meta name=Generator content="Microsoft Word 11 (filtered)">
        <title>Header1</title>
        <style>
         some style code
        </style>
    </head>
    <body lang=ZH-CN style='text-justify-trim:punctuation'>
        <div class=Section1 style='layout-grid:15.6pt'>
            <h1><span lang=EN-US>Header1</span></h1>
            <p class=MsoNormal><span lang=EN-US>Content1</span></p>
            <h2><span lang=EN-US>Header2</span></h2>
            <p class=MsoNormal><span lang=EN-US>Content2</span></p>
        </div>
    </body>
</html>

In this HTML source code, a div tag includes all content, Word2CHM needs to split this HTML file to two files.

File0.html
HTML
<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=gb2312">
        <meta name=Generator content="Microsoft Word 11 (filtered)">
        <title>Header1</title>
    <style>
     --------------
    </style>
    </head>
    <body>
        <h1>Header</h1><hr />
        
        <p class=MsoNormal><span lang=EN-US>Content1</span></p>
        
        <hr /><h1>Footer</h1>
    </body>
</html>
File1.html
HTML
<html>
    <head>
        <meta http-equiv=Content-Type content="text/html; charset=gb2312">
        <meta name=Generator content="Microsoft Word 11 (filtered)">
        <title>Header1</title>
    <style>
     --------------
    </style>
    </head>
    <body>
        <h1>Header</h1><hr />
        
        <p class=MsoNormal><span lang=EN-US>Content2</span></p>
        
        <hr /><h1>Footer</h1>
    </body>
</html>

Here, the program adds HTML source “<h1>Header</h1><hr />” in front of HTML content source code, and adds “<hr /><h1>Footer</h1>” after HTML content. Those additional HTML sources are used as header and footer. Word2CHM uses the following C# code to split HTML file.

C#
int index2 = strBody.IndexOf(">");
int index3 = strBody.IndexOf("</h" + Nativelevel + ">");
//read text in <h</h> as topic title
string strTitle = strBody.Substring(index2 + 1, index3 - index2 - 1);
while (strTitle.IndexOf("<") >= 0)
{
    int index4 = strTitle.IndexOf("<");
    int index5 = strTitle.IndexOf(">", index4);
    strTitle = strTitle.Remove(index4, index5 - index4 + 1);
}
strBody = strBody.Substring(index3 + 5);
index = strBody.IndexOf("<h");
if (index == -1)
{
    index = strBody.Length;
}
//read topic content
string strContent = strBody.Substring(0, index);

Using this C# code, Word2CHM splits HTML file by using HTML tag H1, H2, H3 and Hn. And set each HTML document’s title as content between HTML tag Hn.

Third. Compile Multi HTML files to a Single CHM File

Word2CHM cannot compile multi HTML file to a single CHM file by itself. It calls “HTML Help workshop” to generate CHM file. HTML Help workshop is a product of Microsoft. It can compile multi HTML file to a CHM file, It saves settings in a help project file whose extension name is hhp. Word2CHM uses the following C# source to generate HHP file.

C#
using (System.IO.StreamWriter myWriter = new System.IO.StreamWriter(
           strHHP,
           false,
           System.Text.Encoding.GetEncoding(936)))
{
    myWriter.WriteLine("[OPTIONS]");
    myWriter.WriteLine("Compiled file=" + System.IO.Path.GetFileName(strCHM));
    myWriter.WriteLine("Contents file=" + System.IO.Path.GetFileName(strHHC));
    myWriter.WriteLine("Default topic=" + this.DefaultTopic);
    myWriter.WriteLine("Default Window=main");
    myWriter.WriteLine("Display compile progress=yes");
    myWriter.WriteLine("Full-text search=" + (this.FullTextSearch ? "Yes" : "No"));
    myWriter.WriteLine("Binary TOC=" + (this.BinaryToc ? "Yes" : "No"));
    myWriter.WriteLine("Auto Index=" + (this.AutoIndex ? "Yes" : "No"));
    myWriter.WriteLine("Binary Index=" + (this.BinaryIndex ? "Yes" : "No"));
    myWriter.WriteLine("Title=" + this.Title);
    myWriter.WriteLine("[FILES]");
    foreach (CHMNode node in nodes)
    {
        if (HasContent(node.Local))
        {
            if (myFiles.Contains(node.Local) == false)
            {
                myFiles.Add(node.Local);
            }
        }
    }
    foreach (string fileName in myFiles)
    {
        myWriter.WriteLine(fileName);
    }
}

Word2CHM also generates HHC file to describe topic structure of CHM file. HHC file in XML format, Word2CHM uses the following C# code to generate HHC XML content.

C#
System.Xml.XmlDocument doc = RootElement.OwnerDocument;
System.Xml.XmlElement ulElement = doc.CreateElement("UL");
RootElement.AppendChild(ulElement);
foreach (CHMNode node in nodes)
{
    System.Xml.XmlElement liElement = doc.CreateElement("LI");
    ulElement.AppendChild(liElement);
    System.Xml.XmlElement objElement = doc.CreateElement("OBJECT");
    liElement.AppendChild(objElement);
    objElement.SetAttribute("type", "text/sitemap");
    AddParamElement(objElement, "Name", node.Name);
    if (HasContent(node.Local))
    {
        AddParamElement(objElement, "Local", node.Local.Replace('\\', '/'));
    }
    if (HasContent(node.ImageNumber))
    {
        AddParamElement(objElement, "ImageNumber", node.ImageNumber);
    }
    if (node.Nodes.Count > 0)
    {
        ToHHCXMLElement(node.Nodes, ulElement);
    }
}

After generating an HHP file and HHC file, Word2CHM calls HHC.exe to open HHP file and generates CHM file, usually HHC.exe exists in directory “C:\Program Files\HTML Help Workshop”. There are C# sources to generate CHM file.

C#
ProcessStartInfo start = new ProcessStartInfo(compilerExeFileName, "\"" + strHHP + "\"");
start.UseShellExecute = false;
start.CreateNoWindow = true;
start.RedirectStandardOutput = true;
start.WindowStyle = System.Diagnostics.ProcessWindowStyle.Minimized;
System.Diagnostics.Process proc = System.Diagnostics.Process.Start(start);
proc.PriorityClass = System.Diagnostics.ProcessPriorityClass.BelowNormal;
this.strOutputText = proc.StandardOutput.ReadToEnd();

After completing these three steps, Word2CHM can convert a Word document to a CHM file.

History

  • 17th November, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Web Developer duchang soft
China China
yuan yong fu of duchang soft , come from CHINA , 2008 Microsoft MVP,Use GDI+,XML/XSLT, site:http://www.cnblogs.com/xdesigner/

Comments and Discussions

 
GeneralMy vote of 5 Pin
metaphysis29-Jul-14 3:56
metaphysis29-Jul-14 3:56 
QuestionMy vote of 5 Pin
opulos6-Jun-14 13:51
opulos6-Jun-14 13:51 
GeneralMy vote of 5 Pin
Erik Rude1-May-13 1:26
Erik Rude1-May-13 1:26 
GeneralMy vote of 5 Pin
sapien4u4-Mar-13 23:11
sapien4u4-Mar-13 23:11 
QuestionFantastic! Pin
Hanover Fist6-Nov-12 7:43
Hanover Fist6-Nov-12 7:43 
SuggestionOne little change Pin
Brad Bruce3-Jun-12 10:08
Brad Bruce3-Jun-12 10:08 
GeneralRe: One little change Pin
Erik Rude1-May-13 1:25
Erik Rude1-May-13 1:25 
SuggestionBrilliant Pin
Sean.Jansson23-Apr-12 23:45
Sean.Jansson23-Apr-12 23:45 
GeneralCan't load project Pin
sssw288-May-11 21:08
sssw288-May-11 21:08 
GeneralFound a Small Problem Pin
aaroncampf14-Mar-11 9:19
aaroncampf14-Mar-11 9:19 
Everything Works even on a Word 2010 File

BUT it has a problem Here


public void LoadWordHtml(string fileName)

index = strHtml.IndexOf(">", index); <---

But if you just move to the next line of code it works just fine!

Still really COOL Thumbs Up | :thumbsup:
GeneralRe: Found a Small Problem Pin
sebyweb20-Jan-12 4:45
sebyweb20-Jan-12 4:45 
GeneralRe: Found a Small Problem Pin
aaroncampf20-Jan-12 5:34
aaroncampf20-Jan-12 5:34 
GeneralRe: Found a Small Problem Pin
Brad Bruce3-Jun-12 6:25
Brad Bruce3-Jun-12 6:25 
GeneralMy vote of 5 Pin
aaroncampf14-Mar-11 9:16
aaroncampf14-Mar-11 9:16 
GeneralMy Vote of 5 Pin
RaviRanjanKr1-Mar-11 22:11
professionalRaviRanjanKr1-Mar-11 22:11 
GeneralMy vote of 5 Pin
thatraja1-Mar-11 21:44
professionalthatraja1-Mar-11 21:44 
GeneralMy vote of 5 Pin
litaooo18-Dec-10 3:31
litaooo18-Dec-10 3:31 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.