Introduction
This article discusses a technique for converting a single Microsoft Word File into multiple HTML Help files. Specifically, this article examines the Word2Help sample program from Aspose (http://www.aspose.com/Community/Files/51/aspose.words/entry94733.aspx). In the more general sense, this article also discusses some of the functionality of the Aspose.Words product which is the underpinning of the Word2Help sample.
This method does not access any of the Microsoft Word APIs. Therefore, Word is not required to be installed on the computer running this program. Instead, it exclusively uses the Aspose.Words product to perform it conversions. Aspose.Words provides an easy mechanism to manipulate Microsoft Word files.
The code and sample data files can be found here: http://www.aspose.com/Community/Files/51/aspose.words/category1176.aspx.
Word2Help from a User's Perspective
Before we walk through the technical specifics, it is important to understand what Word2Help does. In a nut shell, the utility breaks a Microsoft Word document into several component HTML documents along any text that is styled with a Heading style.
Our purpose in this section is to highlight the Word2Help�s functionality from the user�s perspective. Therefore this section is not meant as an exhaustive explanation of Word2Help feature set. It exists as a primer to our technical discussion.
Let�s walk through a very simple example of converting a single Word Document into several HTML documents. It should be noted that the utility supports batch conversions of multiple Word files. However, our discussion will only focus on converting a single file.
Please note that the included ZIP contains not only the program and code, but also sample files to test our solution. Unzip the included file to C:\Temp or wherever you desire. We have unzipped the contents to C:\Temp. In doing so, �Sample Documentation.doc� and several other files are now found in C:\Temp\Word2Help\Doc (Figure 1). This folder represents the input sample data for our example.

Figure 1: Input Files - C:\Temp\Word2Help\Doc
We now wish to execute Word2Help against this input folder. The following command tells Word2Help to convert all Word files in C:\Temp\Word2Help\Doc and store their output in C:\Temp\Word2Help\Html. It also specifies a /fix parameter that we will ignore for now.
Word2Help.exe /src:C:\Temp\Word2Help\Doc /out:C:\Temp\Word2Help\Html /fix:http://www.aspose.com/Products/Aspose.Words/Api
Make sure that the included Aspose.Words.dll file (or newer version) is either in the GAC or in the same folder as the Word2Help.exe file. Further, it should be noted that without a full license you may see some extraneous text in the output file or during Word2Help processing stating you are using the Evaluation version of the product. Of course, if you have obtained a valid license for the product you will not be bothered with that message.
When the above command line runs, it will generate a number of output files (Figure 2). You will note that the output contains three types of files: (1) XML, (2) HTML and (3) Graphic files.

Figure 2: Output Files -- C:\Temp\Word2Help\Html
The XML file represents an index of all the HTML files generated. The text of C:\Temp\Word2Help\HTML\content.xml is show below in Figure 3.
="1.0" ="utf-8" ="yes"
<content dir="C:\temp\Html">
<book name="Aspose.Words Getting Started" href="AsposeWordsGettingStarted.html">
<book name="Aspose.Words Features" href="AsposeWordsFeatures.html">
<book name="File Formats and Conversions">
<item name="High-Quality Conversions" href="HighQualityConversions.html" />
<item name="Microsoft Word (DOC)" href="MicrosoftWordDOC.html" />
</book>
</book>
</book>
</content>
Figure 3: Content.xml
Based on our discussions so far, you will immediately notice that the XML hierarchy is directly related to the Heading styles applied within the Microsoft Word document. For instance, �Aspose.Words Getting Started� is the only text in the document to be applied with the Heading 1 style. �Aspose.Words Features� is the only text applied with the Heading 2 style. Similarly, �File Formats and Conversions� is has the style Heading 3 applied. Finally, �High-Quality Conversions� and �Microsoft Word (DOC)� have the Heading 4 style applied to them.
Now that we�ve explained how the index and the rules for carving up the Microsoft Word document into HTML files, let�s finally look at the conversions. Figure 4 shows the Microsoft Word 2007 view of the document while Figure 5 shows �AsposeWordsGettingStarted.html� which represents the top part of the document.
In comparing the two, you will notice they look virtually identical. You will also notice that a header and footer were added to the HTML document. In fact, the header.html and banner.html shown in Figure 1 are pre-pended to the beginning of the generated HTML document. Further, footer.html was appended to the HTML document.

Figure 4: Sample Documentation.Doc in Microsoft Word

Figure 5: Section of Sample Documentation.Doc output as HTML
Overview of Classes
Word2Help is a rather simple program because all the heavy lifting is done by Aspose.Words. Word2Help only consists of 5 classes as show in Figure 6.

Figure 6: Word2Help Class Diagram
The Starter class is Startup object for the application. Thus, the entry point into Word2Help is Starter.Main.
The TopicCollection class loads the Microsoft Word document(s), breaks them up into topics and then saves the HTML and XML files.
The Topic class represents a single topic that will be saves as an HTML file. The Hyperlink class provides functionality to manipulate hyperlinks. Finally, the RegularExpressions class provides a Regex for Title, Head and Body.
The Code
The Starter class represents the startup object for the application. It parses the command line parameters, provides basic exception handling and starts the work through the TopicCollection class. The most notable part of the Starter class is the code shown in Figure 7.
TopicCollection topics = new TopicCollection(srcDir, fixUrl);
topics.AddFromDir(srcDir);
topics.WriteHtml(outDir);
topics.WriteContentXml(outDir);
Figure 7: Starter Class Notable Code
The tasks executed in Figure 7 are as follows:
� TopicCollection Constructor
1. Reads into memory the HTML Header, Banner and Footer as strings.
� AddFromDir Method
2. Loads each Microsoft Word file into an Aspose.Words.Document object.
3. For each Aspose.Words.Document object, a Microsoft Word section break is inserted immediately before text styled with a �Heading� style.
4. Then, each Microsoft Word section is loaded into its own Aspose.Words.Document and stored as a Topic class.
� WriteHtml Method
5. For each Topic, saves the Microsoft Word section as HTML while also adding the header, banner and footer.
� WriteContentXml Method
6. Finally, write the Content.xml file.
Because the TopicCollection Constructor is so straight forward, we won�t waste time discussing it. In the following sections we will describe the AddFromDir, WriteHtml and WriteContentXml methods.
AddFromDir
AddFromDir is essentially the heart of the program. The method ultimately creates several instances of the Aspose.Words.Document class as a means to manipulate the input Microsoft Word file(s). Further, the Aspose.Words.Document class is also used as a mechanism to convert a Word file or fragment to another format�such as HTML.
The Aspose.Words.Document class is documented here: http://www.aspose.com/Products/Aspose.Words/Api/Aspose.Words.Document.html. It represents a Word document. The Aspose.Words.Document class supports loading and saving Word documents of DOC, DOCX, RTF, XML, HTML formats. Additionally, documents can also be saved in TXT and PDF formats. As we already mentioned, we use this functionality to convert Microsoft Word data into an HTML file.
Figure 8 details the more significant parts of the TopicCollection class. It contains the definitions for AddFromDir, AddFromFile, InsertTopicSections and AddTopics.
The AddFromDir method calls AddFromFile for every Microsoft Word file in the input directory. AddFromFile instantiates and new Aspose.Words.Document object for every input Word document file. AddFromFile then calls InsertTopicSections and AddTopics.
public void AddFromDir(string dirName)
{
foreach (string filename in Directory.GetFiles(dirName, "*.doc"))
AddFromFile(filename);
}
public void AddFromFile(string fileName)
{
Document doc = new Document(fileName);
InsertTopicSections(doc);
AddTopics(doc);
}
private static void InsertTopicSections(Document doc)
{
DocumentBuilder builder = new DocumentBuilder(doc);
NodeCollection paras = doc.GetChildNodes(NodeType.Paragraph, true, false);
ArrayList topicStartParas = new ArrayList();
foreach (Paragraph para in paras)
{
StyleIdentifier style = para.ParagraphFormat.StyleIdentifier;
if ((style >= StyleIdentifier.Heading1) && (style <= MaxTopicHeading) &&
(para.HasChildNodes))
{
topicStartParas.Add(para);
}
else if ((style > MaxTopicHeading) && (style <= StyleIdentifier.Heading9))
{
para.ParagraphFormat.StyleIdentifier = (StyleIdentifier)((int)style - 1);
}
}
foreach (Paragraph para in topicStartParas)
{
Section section = para.ParentSection;
if (para != section.Body.FirstParagraph)
{
builder.MoveTo(para.FirstChild);
builder.InsertBreak(BreakType.SectionBreakNewPage);
section.Body.LastParagraph.Remove();
}
}
}
private void AddTopics(Document doc)
{
foreach (Section section in doc.Sections)
{
try
{
Topic topic = new Topic(section, mFixUrl);
mTopics.Add(topic);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
}
}
Figure 8: TopicCollection Snippets
InsertTopicSections iterates through all the nodes in an Aspose.Words.Document object. The Apose.Words.Document object allows the developer to traverse a Microsoft Word document in a fashion similar to an XmlDocument. By that, I mean the class provides a hierarchy of elements and provides a number of API constructs patterned after Microsoft�s XmlDocument class. This is evident in the call to GetChildNodes where NodeType.Paragraph is passed as the first parameter. In this case, all the document�s paragraphs are returned in an enumerable collection. In turn, objects returned from this method can have children of their own which also can be enumerated.
The collection of paragraphs is enumerated to search for any paragraph that has a �Heading� style applied. The style of the paragraph can easily be determined by using the Paragraph�s ParagraphFormat property.
InsertTopicSections also instantiates an Aspose.Words.DocumentBuilder object. This is a companion class to Aspose.Words.Document. Aspose.Words.DocumentBuilder provides a mechanism to �build� a document. More specifically, Aspose.Words.DocumentBuilder provides the means to insert content and formatting while Aspose.Words.Document provides the overall management of the document.
In InsertTopicSections, the Aspose.Words.DocumentBuilder object takes an Aspose.Words.Document parameter in its constructor. Thus, any changes we make with DocumentBuilder ultimately get reflected back into the Aspose.Words.Document object.
In our case, we want to use the builder to insert a section break at paragraphs which have a �Heading� style applied. In the first part of the method, an ArrayList of Paragraphs that have the �Heading� style applied to them. Therefore, in the last part of the method, we can enumerate through each of these paragraphs and insert a section break. Thanks to Aspose, this only requires two method calls (Figure 9). The first method call moves to �cursor� to the paragraph in question. The second method call inserts a break at the �cursor� position.
builder.MoveTo(para.FirstChild);
builder.InsertBreak(BreakType.SectionBreakNewPage);
Figure 9: Moving the Cursor and Inserting a Section Break
Once the Aspose.Words.Document objects have been appropriately sliced into sections, we call AddTopics. Essentially, each section is imported into a new Aspose.Words.Document. In later parts of the program, these objects are saved as HTML through that object�s native method calls.
Each new Aspose.Words.Document that represents a section is wrapped within an instance of the Topic class. The Topic class also modifies any Microsoft Word embedded hyperlinks. Recall that when we executed Word2Help, we used a fix parameter of http://www.aspose.com/Products/Aspose.Words/Api. The topic class will then modify any hyperlinks that appear like �http://www.aspose.com/Products/Aspose.Words/Api/Aspose.Words.Body.html� by changing them into this: �Aspose.Words.Body.html�.
WriteHtml
Once we have the input Microsoft Word files sliced and diced, outputting them to HTML requires very little work. The TopicCollection.WriteHtml method enumerates all the Topic instances in turn calls that object�s Topic.WriteHtml method. The code for Topic.WriteHtml is shown in Figure 10.
public void WriteHtml(string htmlHeader, string htmlBanner, string htmlFooter, string outDir)
{
string xxx = Path.Combine(outDir, this.FileName);
mTopicDoc.Save(xxx, SaveFormat.Html);
string html;
using (StreamReader reader = new StreamReader(xxx))
html = reader.ReadToEnd();
string header = RegularExpressions.HtmlTitle.Replace(htmlHeader, mTitle, 1);
html = RegularExpressions.HtmlHead.Replace(html, header, 1);
html = RegularExpressions.HtmlBodyDivStart.Replace(html, @" id=""nstext""", 1);
string banner = htmlBanner.Replace("###TOPIC_NAME###", mTitle);
html = html.Replace("", "" + banner);
html = html.Replace("", htmlFooter + "");
using (StreamWriter writer = new StreamWriter(xxx))
writer.Write(html);
}
Figure 10: Topic.WriteHtml Code Snippet
Topic.WriteHtml calls the Aspose.Words.Document Save method to convert the Microsoft Word content/data to HTML. It should be noted that when a folder path is specified for the Save method, the graphics embedded in the Word file are also saved (with HTML IMG links to those extracted files) to that folder.
Because we want to prepend and append headers and footers, we load into a string the HTML we just saved. We then do our pre-pending and appending and then re-save the HTML file.
WriteContentXml
Finally, we call WriteContentXml to create a table of contents for all the HTML files we created. Because this is rather straight-forward, we will not elaborate on this method.
Summary
In this article we discussed the Utility Word2Help. This utility converts and breaks up Microsoft Word file(s) into HTML files. A key component of this utility is the Aspose.Words product available from www.aspose.com.
History
12/19/2007 - Posted Article