Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Converting Microsoft Word files into HTML Help Files

0.00/5 (No votes)
19 Dec 2007 1  
An Automated Method of Converting Microsoft Word files into HTML Help Files

Introduction

This article discusses a technique for converting a single Microsoft Word File into multiple HTML Help files. Specifically, this article examines the Word2Help sample program from Aspose (http://www.aspose.com/Community/Files/51/aspose.words/entry94733.aspx). In the more general sense, this article also discusses some of the functionality of the Aspose.Words product which is the underpinning of the Word2Help sample.

This method does not access any of the Microsoft Word APIs. Therefore, Word is not required to be installed on the computer running this program. Instead, it exclusively uses the Aspose.Words product to perform it conversions. Aspose.Words provides an easy mechanism to manipulate Microsoft Word files.

The code and sample data files can be found here: http://www.aspose.com/Community/Files/51/aspose.words/category1176.aspx.

Word2Help from a User's Perspective

Before we walk through the technical specifics, it is important to understand what Word2Help does. In a nut shell, the utility breaks a Microsoft Word document into several component HTML documents along any text that is styled with a Heading style.

Our purpose in this section is to highlight the Word2Help�s functionality from the user�s perspective. Therefore this section is not meant as an exhaustive explanation of Word2Help feature set. It exists as a primer to our technical discussion.

Let�s walk through a very simple example of converting a single Word Document into several HTML documents. It should be noted that the utility supports batch conversions of multiple Word files. However, our discussion will only focus on converting a single file.

Please note that the included ZIP contains not only the program and code, but also sample files to test our solution. Unzip the included file to C:\Temp or wherever you desire. We have unzipped the contents to C:\Temp. In doing so, �Sample Documentation.doc� and several other files are now found in C:\Temp\Word2Help\Doc (Figure 1). This folder represents the input sample data for our example.

Screenshot - DocFolder-Directory.png
Figure 1: Input Files - C:\Temp\Word2Help\Doc

We now wish to execute Word2Help against this input folder. The following command tells Word2Help to convert all Word files in C:\Temp\Word2Help\Doc and store their output in C:\Temp\Word2Help\Html. It also specifies a /fix parameter that we will ignore for now.

Word2Help.exe /src:C:\Temp\Word2Help\Doc /out:C:\Temp\Word2Help\Html /fix:http://www.aspose.com/Products/Aspose.Words/Api

Make sure that the included Aspose.Words.dll file (or newer version) is either in the GAC or in the same folder as the Word2Help.exe file. Further, it should be noted that without a full license you may see some extraneous text in the output file or during Word2Help processing stating you are using the Evaluation version of the product. Of course, if you have obtained a valid license for the product you will not be bothered with that message.

When the above command line runs, it will generate a number of output files (Figure 2). You will note that the output contains three types of files: (1) XML, (2) HTML and (3) Graphic files.

Screenshot - HtmlFolder-Directory.png
Figure 2: Output Files -- C:\Temp\Word2Help\Html

The XML file represents an index of all the HTML files generated. The text of C:\Temp\Word2Help\HTML\content.xml is show below in Figure 3.

<?xml version="1.0" encoding="utf-8" standalone="yes" ?> 
<content dir="C:\temp\Html">
  <book name="Aspose.Words Getting Started" href="AsposeWordsGettingStarted.html">
    <book name="Aspose.Words Features" href="AsposeWordsFeatures.html">
      <book name="File Formats and Conversions">
        <item name="High-Quality Conversions" href="HighQualityConversions.html" /> 
        <item name="Microsoft Word (DOC)" href="MicrosoftWordDOC.html" /> 
      </book>
    </book>
  </book>
</content>

Figure 3: Content.xml

Based on our discussions so far, you will immediately notice that the XML hierarchy is directly related to the Heading styles applied within the Microsoft Word document. For instance, �Aspose.Words Getting Started� is the only text in the document to be applied with the Heading 1 style. �Aspose.Words Features� is the only text applied with the Heading 2 style. Similarly, �File Formats and Conversions� is has the style Heading 3 applied. Finally, �High-Quality Conversions� and �Microsoft Word (DOC)� have the Heading 4 style applied to them.

Now that we�ve explained how the index and the rules for carving up the Microsoft Word document into HTML files, let�s finally look at the conversions. Figure 4 shows the Microsoft Word 2007 view of the document while Figure 5 shows �AsposeWordsGettingStarted.html� which represents the top part of the document.

In comparing the two, you will notice they look virtually identical. You will also notice that a header and footer were added to the HTML document. In fact, the header.html and banner.html shown in Figure 1 are pre-pended to the beginning of the generated HTML document. Further, footer.html was appended to the HTML document.

Screenshot - SourceWordDocument_Page1.png

Figure 4: Sample Documentation.Doc in Microsoft Word

Screenshot - HTMLWordDocument_GettingStarted.png

Figure 5: Section of Sample Documentation.Doc output as HTML

Overview of Classes

Word2Help is a rather simple program because all the heavy lifting is done by Aspose.Words. Word2Help only consists of 5 classes as show in Figure 6.

Screenshot - ClassDiagram.png

Figure 6: Word2Help Class Diagram

The Starter class is Startup object for the application. Thus, the entry point into Word2Help is Starter.Main.

The TopicCollection class loads the Microsoft Word document(s), breaks them up into topics and then saves the HTML and XML files.

The Topic class represents a single topic that will be saves as an HTML file. The Hyperlink class provides functionality to manipulate hyperlinks. Finally, the RegularExpressions class provides a Regex for Title, Head and Body.

The Code

The Starter class represents the startup object for the application. It parses the command line parameters, provides basic exception handling and starts the work through the TopicCollection class. The most notable part of the Starter class is the code shown in Figure 7.

TopicCollection topics = new TopicCollection(srcDir, fixUrl);
topics.AddFromDir(srcDir);
topics.WriteHtml(outDir);
topics.WriteContentXml(outDir);

Figure 7: Starter Class Notable Code

The tasks executed in Figure 7 are as follows:

� TopicCollection Constructor

1. Reads into memory the HTML Header, Banner and Footer as strings.

� AddFromDir Method

2. Loads each Microsoft Word file into an Aspose.Words.Document object.

3. For each Aspose.Words.Document object, a Microsoft Word section break is inserted immediately before text styled with a �Heading� style.

4. Then, each Microsoft Word section is loaded into its own Aspose.Words.Document and stored as a Topic class.

� WriteHtml Method

5. For each Topic, saves the Microsoft Word section as HTML while also adding the header, banner and footer.

� WriteContentXml Method

6. Finally, write the Content.xml file.

Because the TopicCollection Constructor is so straight forward, we won�t waste time discussing it. In the following sections we will describe the AddFromDir, WriteHtml and WriteContentXml methods.

AddFromDir

AddFromDir is essentially the heart of the program. The method ultimately creates several instances of the Aspose.Words.Document class as a means to manipulate the input Microsoft Word file(s). Further, the Aspose.Words.Document class is also used as a mechanism to convert a Word file or fragment to another format�such as HTML.

The Aspose.Words.Document class is documented here: http://www.aspose.com/Products/Aspose.Words/Api/Aspose.Words.Document.html. It represents a Word document. The Aspose.Words.Document class supports loading and saving Word documents of DOC, DOCX, RTF, XML, HTML formats. Additionally, documents can also be saved in TXT and PDF formats. As we already mentioned, we use this functionality to convert Microsoft Word data into an HTML file.

Figure 8 details the more significant parts of the TopicCollection class. It contains the definitions for AddFromDir, AddFromFile, InsertTopicSections and AddTopics.
The AddFromDir method calls AddFromFile for every Microsoft Word file in the input directory. AddFromFile instantiates and new Aspose.Words.Document object for every input Word document file. AddFromFile then calls InsertTopicSections and AddTopics.

/// 
/// Processes all DOC files found in the specified directory.
/// Loads and splits into topics.
/// 
public void AddFromDir(string dirName)
{
    foreach (string filename in Directory.GetFiles(dirName, "*.doc"))
        AddFromFile(filename);
}

/// 
/// Processes a specified DOC file. Loads and splits into topics.
/// 
public void AddFromFile(string fileName)
{
    Document doc = new Document(fileName);
    InsertTopicSections(doc);
    AddTopics(doc);
}

/// 
/// Inserts section breaks that delimit the topics.
/// 
/// The document where to insert the section breaks.
private static void InsertTopicSections(Document doc)
{
    DocumentBuilder builder = new DocumentBuilder(doc);

    NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true, false);
    ArrayList topicStartParas = new ArrayList();

    foreach (Paragraph para in paras)
    {
        StyleIdentifier style = para.ParagraphFormat.StyleIdentifier;
        if ((style >= StyleIdentifier.Heading1) && (style <= MaxTopicHeading) &&
            (para.HasChildNodes))
        {
            // Select heading paragraphs that must become topic starts.
            // We can't modify them in this loop, we have to remember them in an array first.
            topicStartParas.Add(para);
        }
        else if ((style > MaxTopicHeading) && (style <= StyleIdentifier.Heading9))
        {
            // Pull up headings. For example: if Heading 1-4 become topics, then I want Headings 5+ 
            // to become Headings 4+. Maybe I want to pull up even higher?
            para.ParagraphFormat.StyleIdentifier = (StyleIdentifier)((int)style - 1);
        }
    }

    foreach (Paragraph para in topicStartParas)
    {
        Section section = para.ParentSection;

        // Insert section break if the paragraph is not at the beginning of a section already.
        if (para != section.Body.FirstParagraph)
        {
            builder.MoveTo(para.FirstChild);
            builder.InsertBreak(BreakType.SectionBreakNewPage);

            // This is the paragraph that was inserted at the end of the now old section.
            // We don't really need the extra paragraph, we just needed the section.
            section.Body.LastParagraph.Remove();
        }
    }
}

/// 
/// Goes through the sections in the document and adds them as topics to the collection.
/// 
private void AddTopics(Document doc)
{
    foreach (Section section in doc.Sections)
    {
        try
        {
            Topic topic = new Topic(section, mFixUrl);
            mTopics.Add(topic);
        }
        catch (Exception e)
        {
            // If one topic fails, we continue with others.
            Console.WriteLine(e.Message);
        }
    }
}

Figure 8: TopicCollection Snippets

InsertTopicSections iterates through all the nodes in an Aspose.Words.Document object. The Apose.Words.Document object allows the developer to traverse a Microsoft Word document in a fashion similar to an XmlDocument. By that, I mean the class provides a hierarchy of elements and provides a number of API constructs patterned after Microsoft�s XmlDocument class. This is evident in the call to GetChildNodes where NodeType.Paragraph is passed as the first parameter. In this case, all the document�s paragraphs are returned in an enumerable collection. In turn, objects returned from this method can have children of their own which also can be enumerated.

The collection of paragraphs is enumerated to search for any paragraph that has a �Heading� style applied. The style of the paragraph can easily be determined by using the Paragraph�s ParagraphFormat property.

InsertTopicSections also instantiates an Aspose.Words.DocumentBuilder object. This is a companion class to Aspose.Words.Document. Aspose.Words.DocumentBuilder provides a mechanism to �build� a document. More specifically, Aspose.Words.DocumentBuilder provides the means to insert content and formatting while Aspose.Words.Document provides the overall management of the document.

In InsertTopicSections, the Aspose.Words.DocumentBuilder object takes an Aspose.Words.Document parameter in its constructor. Thus, any changes we make with DocumentBuilder ultimately get reflected back into the Aspose.Words.Document object.

In our case, we want to use the builder to insert a section break at paragraphs which have a �Heading� style applied. In the first part of the method, an ArrayList of Paragraphs that have the �Heading� style applied to them. Therefore, in the last part of the method, we can enumerate through each of these paragraphs and insert a section break. Thanks to Aspose, this only requires two method calls (Figure 9). The first method call moves to �cursor� to the paragraph in question. The second method call inserts a break at the �cursor� position.

builder.MoveTo(para.FirstChild);
builder.InsertBreak(BreakType.SectionBreakNewPage);

Figure 9: Moving the Cursor and Inserting a Section Break

Once the Aspose.Words.Document objects have been appropriately sliced into sections, we call AddTopics. Essentially, each section is imported into a new Aspose.Words.Document. In later parts of the program, these objects are saved as HTML through that object�s native method calls.

Each new Aspose.Words.Document that represents a section is wrapped within an instance of the Topic class. The Topic class also modifies any Microsoft Word embedded hyperlinks. Recall that when we executed Word2Help, we used a fix parameter of http://www.aspose.com/Products/Aspose.Words/Api. The topic class will then modify any hyperlinks that appear like �http://www.aspose.com/Products/Aspose.Words/Api/Aspose.Words.Body.html� by changing them into this: �Aspose.Words.Body.html�.

WriteHtml

Once we have the input Microsoft Word files sliced and diced, outputting them to HTML requires very little work. The TopicCollection.WriteHtml method enumerates all the Topic instances in turn calls that object�s Topic.WriteHtml method. The code for Topic.WriteHtml is shown in Figure 10.

public void WriteHtml(string htmlHeader, string htmlBanner, string htmlFooter, string outDir)
{
    string xxx = Path.Combine(outDir, this.FileName);

    // Export to HTML.
    // Don't save to memory stream! This will make images go to the TEMP folder and we don't want that.
    mTopicDoc.Save(xxx, SaveFormat.Html);

    // We need to modify the HTML string, read HTML back.
    string html;
    using (StreamReader reader = new StreamReader(xxx))
        html = reader.ReadToEnd();

    // Builds the HTML <head></head> element.
    string header = RegularExpressions.HtmlTitle.Replace(htmlHeader, mTitle, 1);
    
    // Applies the new <head></head> element instead of the original one.
    html = RegularExpressions.HtmlHead.Replace(html, header, 1);
    
    html = RegularExpressions.HtmlBodyDivStart.Replace(html, @" id=""nstext""", 1);

    string banner = htmlBanner.Replace("###TOPIC_NAME###", mTitle);
    
    // Add the standard banner.
    html = html.Replace("", "" + banner);
    
    // Add the standard footer.
    html = html.Replace("", htmlFooter + "");

    using (StreamWriter writer = new StreamWriter(xxx))
        writer.Write(html);
}

Figure 10: Topic.WriteHtml Code Snippet

Topic.WriteHtml calls the Aspose.Words.Document Save method to convert the Microsoft Word content/data to HTML. It should be noted that when a folder path is specified for the Save method, the graphics embedded in the Word file are also saved (with HTML IMG links to those extracted files) to that folder.

Because we want to prepend and append headers and footers, we load into a string the HTML we just saved. We then do our pre-pending and appending and then re-save the HTML file.

WriteContentXml

Finally, we call WriteContentXml to create a table of contents for all the HTML files we created. Because this is rather straight-forward, we will not elaborate on this method.

Summary

In this article we discussed the Utility Word2Help. This utility converts and breaks up Microsoft Word file(s) into HTML files. A key component of this utility is the Aspose.Words product available from www.aspose.com.

History

12/19/2007 - Posted Article

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here