Word 2003 CodeProject Article Template






4.78/5 (39 votes)
May 25, 2004
14 min read

199852

1784
A Word 2003 template for CodeProject articles, and an examination of XSL transformation from Word to concise HTML
Introduction
With an XML-based file format, WordprocessingML, Word 2003 provides new opportunities for using XSL transformation to convert data and documents to and from Word. This article presents a utility template for writing CodeProject articles in Word 2003, with an XSL stylesheet for converting the native document to a concise HTML syntax representative of the CodeProject submission template. This article is not intended to serve as an introduction to XSL transformation, nor necessarily as a primer on WordprocessingML. Rather, this article offers XSL examples for transforming a Word document with single- and multi-line paragraph styles, character formatting, images, hyperlinks, and tables.
Background
I like using Word for writing articles. There are numerous features – outlining, revision tracking, and proofing tools to name a few – to assist the writer. Historically though, as a rich-text HTML editor Word has had its problems. Its functions over the years to save a document as HTML have produced notoriously complex and verbose syntax. For its part, Word 2003 offers both a full-fidelity HTML save format, and a "filtered HTML" format. The former produces as garrulous a syntax as previous versions; the latter, though cleaner, still handles too many formats (such as a simple list item) using a <span> tag rather than the suitable HTML (<li>). Though I prefer an editor that generates a more standard HTML, I still wish to benefit from all of Word's features. Writing CodeProject articles, based on the CodeProject submission template[^], is an excellent case where I want Word's power but simple HTML output, using standard heading <h2>, paragraph <p>, and list item <li> tags among others.
Word 2003 opens the door to this possibility by offering WordprocessingML as an XML-based save format. Originally called WordML, WordprocessingML provides a complete grammar for representing a Word document as XML. With it and an appropriate XSL stylesheet, document transformation to a simpler HTML format is attainable. The template and companion XSL stylesheet described in this article serve as a utility to convert a Word 2003 document into a simpler HTML syntax for CodeProject articles.
For the reader not familiar with XML or XSL transformation, try the W3Schools tutorials on XML [^] and XSL [^]. For an introduction to and reference for WordprocessingML, try the following from Microsoft:
- Office 2003 XML Reference Schemas[^]
- New XML Features of the Microsoft Office Word 2003 Object Model[^]
Using the Template
The template includes a custom toolbar, styles in the Bob-loves-orange CodeProject colors, and some VBA code. Because of the code, security issues must be considered when using the template.
Setting Up
Copy the template CodeProject Article.dot to your local templates directory. This location can be found by clicking Word's Tools menu to Options on the File Locations tab under "User Templates".
A typical location for the templates folder is "driveLetter:\Documents and Settings\user\Application Data\Microsoft\Templates".
Security Issues
Depending on your security settings, you may receive a warning (or the code may be disabled entirely) when attempting to use the template. To view your security settings in Word, click the Tools menu to Macro, Security.
The template is not signed, so disabled code is possible if the security level is set higher than Medium. To use the template, ensure one of the following options:
- On the Trusted Publishers tab of the Security window, check the box labeled Trust all installed add-ins and templates. This allows use of the template provided it has been copied to the User Templates file location.
- Set the Security Level to Medium and when opening the template, choose to enable macros.
- Sign the template with your own security certificate, potentially including that certificate among the Trusted Publishers list. Refer to Word 2003 documentation for more information on code signing.
The First Time – Setting Options
To create a new document using the template, click the File menu to New… In the New Document task pane, under Templates click On my computer…, then select the CodeProject Article icon. Upon first use, the Options dialog displays:
In the XSL Transform Stylesheet box, enter the full path of the companion XSL stylesheet, or click Browse to locate the file. This path must be set for the XSL transformation to function correctly. Check the box Open the .html file after XSL transform at your discretion.
These options are stored as custom properties in the template itself, so there are no additional registry settings or external files used.
Toolbar Functions
The XSL transformation employed here is largely based on the use of paragraph, character, and table styles. Specific style names are easy to match in the XSL stylesheet, and the template encourages the use of these styles through the functions on its custom toolbar.
Function |
Toolbar Button |
Description |
Heading 2 |
|
Apply the Heading2 style to the selected paragraph. Heading2 renders as an |
Heading 3 |
|
Apply the Heading3 style to the selected paragraph. Heading3 renders as an |
Code Block |
|
Apply the pre style to the selected paragraph(s). When transformed, blocks using the pre style are rendered within |
Normal |
|
Apply the Normal style to the selected paragraph(s). Normal paragraphs render as |
BulletList |
|
Apply the BulletList style to the selected paragraph(s). This style name is interpreted upon transformation as a |
NumberList |
|
Apply the NumberList style to the selected paragraph(s). This style name is interpreted upon transformation as an |
Bold, Italic, Underline |
|
Standard bold, italic, and underline character formatting, transformed to |
Code formatting |
|
Character formatting for variables or class names; this style name transforms to a |
Table style – Border0 |
|
Apply the TableBorder0 table style to the selected table. Upon transformation, this renders a |
Table style – Border1 |
|
Apply the TableBorder1 table style to the selected table. Upon transformation, this renders a |
Insert Hyperlink |
|
Standard Word 2003 command for inserting hyperlinks, with the utility of including a new window [^] link. Destinations may be external to the document, or internal bookmarks. (Note: proofing errors within hyperlinks may interfere with the rendering of hyperlinks to XML; see Additional Considerations for more information) |
Insert Download |
|
Custom command for inserting a download file hyperlink, such as those that appear above an article. In addition to constructing the link, the DownloadList paragraph style is applied, which when transformed renders a |
Insert Linked Picture |
|
Conducts the standard Insert Picture Word dialog, and then ensures that the inserted picture is linked and not embedded. Upon transformation, a linked picture is rendered as an |
Apply XSL Transformation |
|
Saves the current document in its original format (typically .doc), then saves again using XSL transformation, generating a file with the same name as the original but with an .html extension. Once transformed, the document is reset so additional saves retain the original format. |
Options |
|
Conducts the Options dialog, allowing the path to the XSL stylesheet to be set. These options are stored directly in the template as custom properties. |
The XSL Stylesheet
The file CPArticleTransform.xsl provides the XSL stylesheet used for this transformation. This file can be saved anywhere on the drive with the template; as mentioned, the template's Options dialog provides a box to enter the full stylesheet path.
Namespaces and Outer Templates
WordprocessingML incorporates a number of namespaces, which we will include as attributes in the root <xsl:stylesheet>
tag.
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"
xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
>
Among this listing, the following prefixes are particularly important in our transformation:
- xsl – serves as an alias for the namespace defining XSL transformation; stylesheet commands will be prefixed with xsl.
- w – alias for the WordprocessingML namespace; when matching most nodes specific to the Word document, we'll prefix using w. For example, to match a Word paragraph tag, we'll look for
<w:p>
. - v – alias for the VML namespace, used by Word to represent images.
- wx – alias for the Word 2003 auxiliary namespace; section and sub-section tags will be prefixed with wx.
- aml – alias for the Annotation Markup Language namespace; bookmarks are represented as
<aml:annotation>
tags.
The root node of a Word document, represented through WordprocessingML, is the <w:wordDocument>
element. Our template for matching this root node of the document is as follows:
<!-- =============================================================
Match the root node
============================================================= -->
<xsl:template match="/w:wordDocument">
<html>
<head>
<title>The Code Project</title>
<style>
BODY, P, TD { font-family: Verdana, Arial, Helvetica,
sans-serif;
font-size: 10pt }
H2,H3,H4,H5 { color: #ff9900; font-weight: bold; }
H2 { font-size: 13pt; }
H3 { font-size: 12pt; }
H4 { font-size: 10pt; color: black; }
PRE { BACKGROUND-COLOR: #FBEDBB;
FONT-FAMILY: "Courier New", Courier, mono;
WHITE-SPACE: pre; }
CODE { COLOR: #990000;
FONT-FAMILY: "Courier New", Courier, mono; }
</style>
<link rel="stylesheet" type="text/css"
href="http://www.codeproject.com/styles/global.css" />
</head>
<body>
<!-- skip to the w:body tag -->
<xsl:apply-templates select="w:body" />
</body>
</html>
</xsl:template>
With this template, we set up the article HTML and issue the <xsl:apply-templates>
instruction to render the document body.
In WordprocessingML, a <w:body>
tag serves as a container for section and sub-section nodes, represented as <wx:sect>
and <wx:sub-section>
. These in turn serve as containers for paragraphs, represented by the <w:p>
tag. It is at the paragraph level that the heart of our processing begins, so for <w:body>
, <wx:sect>
, and <wx:sub-section>
matches, we simply issue the <xsl:apply-templates>
instruction to dive further down into the element hierarchy.
<!-- =============================================================
Match nodes that would encapsulate a paragraph <w:p> node
============================================================= -->
<!-- match the w:body node -->
<xsl:template match="w:body">
<xsl:apply-templates select="*" />
</xsl:template>
<!-- match the wx:sect node -->
<xsl:template match="wx:sect">
<xsl:apply-templates select="*" />
</xsl:template>
<!-- match the w:sub-section node -->
<xsl:template match="wx:sub-section">
<xsl:apply-templates select="*" />
</xsl:template>
Single-line Paragraph Formatting
Once inside the body of the document, we use a template matching the tag <w:p>
. This represents an individual paragraph. As the template is based on the use of styles in Word, locating heading paragraphs is a straightforward matter. Among other children, paragraphs are containers for <w:pPr>
tags, which stands for "paragraph properties". The <w:pPr>
tag may contain a <w:pStyle>
tag if a paragraph style is in use. The name of the style will be found in the w:val
attribute. Therefore, to match a paragraph with the Heading2 style, we can use the following XPath syntax:
w:pPr/w:pStyle[@w:val='Heading2']
The <w:p>
template looks for several different heading paragraph styles from within an <xsl:choose>
tag. The <xsl:otherwise>
condition applies a simple <p>
tag in the output.
<xsl:template match="w:p">
<!-- seek paragraph formatting and apply tags accordingly -->
<xsl:choose>
<!-- ==========================================================
single-paragraph items
paragraph formatting that is fairly simple to handle;
these are typically heading formats that would fit in a
single line
========================================================== -->
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading2']">
<h2><xsl:apply-templates select="*" /></h2>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading3']">
<h3><xsl:apply-templates select="*" /></h3>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading4']">
<h4><xsl:apply-templates select="*" /></h4>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading5']">
<h5><xsl:apply-templates select="*" /></h5>
</xsl:when>
. . .
<!-- ==========================================================
treat everything else as a regular paragraph
(e.g. the Normal style)
========================================================== -->
<xsl:otherwise>
<p>
<!-- apply horizontal align? -->
<xsl:choose>
<xsl:when test="w:pPr/w:jc/@w:val">
<xsl:attribute name="align">
<xsl:value-of select="w:pPr/w:jc/@w:val" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!-- apply templates for content -->
<xsl:apply-templates select="*" />
</p>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Multi-line Paragraph Formatting
A more complex situation arises when using lists or <pre>
sections. In these cases, each line (ended with a carriage return) is considered a new paragraph to Word, and would have its own paragraph style information. Though we can still identify each by its style name (e.g. "BulletList", or "pre") we need to treat the multiple lines as a single group – surrounded with say a <ul>
or <pre>
container.
For these cases, we will still test for the style name as we did before. Once found, we'll test the preceding paragraph to see if it matches the same style. If it doesn't, we can assume we are beginning the multi-paragraph block. In the case of a BulletList for example, we will then apply a transform like the following:
<ul>
<xsl:apply-templates select="." mode="insideBulletList"/>
</ul>
The mode
attribute here is the key to making this work. We will continue applying templates, thus continuing to match <w:p>
tags. However, by specifying a mode
we can change the operational <w:p>
template to one specifically designed for, say, a bullet list. Recall that our original <w:p>
template was defined without a mode
:
<xsl:template match="w:p">
. . .
</xsl:template>
We'll define another template to match <w:p>
tags, but include the mode
attribute to handle paragraph processing differently inside a BulletList.
<!-- match paragraph nodes that are part of a Bullet list -->
<xsl:template match="w:p" mode="insideBulletList">
<!-- output this bullet item paragraph -->
<li><xsl:apply-templates /></li>
<!--go to next one-->
<xsl:apply-templates
select="following-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"
mode="insideBulletList" />
</xsl:template>
A paragraph match here outputs the list item <li>
tag, then applies the same template for any siblings that follow, provided they share the paragraph style name "BulletList". So back in the original <w:p>
template, as an <xsl:when>
condition in the original <xsl:choose>
instruction, the following handles BulletList formatting:
. . .
<!-- ==========================================================
multi-paragraph items
paragraph formatting that is more complicated to handle;
these are typically paragraph formats that will span multiple
lines, such as a list of items or the <pre> format
========================================================== -->
<!-- match the BulletList style -->
<xsl:when test="w:pPr/w:pStyle[@w:val='BulletList']">
<xsl:choose>
<!-- if the preceding paragraph was also a BulletList style,
then it has already been handled through the
'insideBulletList' mode; ignore it here -->
<xsl:when
test="preceding-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"/>
<!-- otherwise, start a UL tag and apply templates with the
'insideBulletList' mode -->
<xsl:otherwise>
<ul>
<xsl:apply-templates select="." mode="insideBulletList"/>
</ul>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
. . .
This block reflects the pattern also used for NumberList, DownloadList, and pre paragraph styles.
Runs and Character Formatting
In WordprocessingML, the tag <w:r>
identifies a run of content. These tags are children of <w:p>
tags and represent containers of content with consistent character formatting. Text, linked images, and line breaks are all examples of content nested inside a <w:r>
tag. The <w:r>
tag may also contain a <w:rPr>
tag to enclose the properties (including character formatting) of the run. As multiple character formats may be applied to a run, we must adhere to proper hierarchical nesting of formatting tags in the output. To accomplish this, we will call a recursive template when matching a <w:r>
tag, and pass as a parameter the first of the child formatting tags within the <w:rPr>
run property parent.
<!-- =============================================================
match run nodes <w:r> within a paragraph;
this is the common container for content such as text or
pictures and where we'll deduce character formatting
============================================================= -->
<!-- match run nodes within a paragraph-->
<xsl:template match="w:r">
<!-- =======================================================
Character formatting at this level is identified with
run property nodes <w:rPr>; for example, a run with bold
formatting will have a <w:rPr> tag with a <w:b> tag for
a child. As multiple formatting tags are possible, we
need a way to account for them all while maintaining the
necessary XML heirarchical structure. We'll accomplish
this with a recursive template that loops through
each property <w:rPr> node, surrounding the content nodes
with proper formatting tags.
The recursive template is "recurseRunProps"; we initiate
the first call to it here.
======================================================= -->
<!-- iterate through all w:rPr child tags to apply formatting;
at the end of the recursion, the run text is applied -->
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="nodeCount" select="1" />
<xsl:with-param name="propNodes" select="w:rPr/*" />
<!-- run content will be any child node that isn't a <w:rPr>
node; this will include text <w:t>, picture <w:pict>,
and line breaks <w:br> -->
<xsl:with-param name="runContent" select="*[not(w:rPr)]" />
</xsl:call-template>
</xsl:template>
The recursive template recurseRunProps checks to see if it has been passed a valid node, and if so tries to match supported character formatting. If a supported formatting tag is caught, an <xsl:call-template>
instruction is issued to execute recurseRunProps again with the next formatting child, nested within the appropriate output formatting tags. If the passed node is not a supported formatting tag, recurseRunProps is still called with the next formatting child, if any. When the recursion has ended, an <xsl:apply-templates>
instruction is performed to process the inner run content.
The following shows the pattern in recurseRunProps for matching <w:b>
bold formatting tags. Italic, underline, and <code>
character formats follow the same pattern.
<!-- =============================================================
This is a recursive template that is called when a run tag
<w:r> is matched
parameters:
nodeCount - the index value of the <w:rPr> node within
propNodes that should be processed
propNodes - the complete list of <w:rPr> nodes to
recursively process
runContent - the content (i.e. all nodes other than
<w:rPr>) around which formatting tags
are to be applied
============================================================= -->
<xsl:template name="recurseRunProps">
<xsl:param name="nodeCount" />
<xsl:param name="propNodes" />
<xsl:param name="runContent" />
<!-- select the <w:rPr> node to process, based on the index
nodeCount -->
<xsl:variable name="curNode" select="$propNodes[$nodeCount]" />
<!-- is this a valid node to process? -->
<xsl:choose>
<xsl:when test="$curNode">
<!-- this is a valid node; process it, and
recursively call processing for the next node;-->
<xsl:choose>
<!-- process Bold tags -->
<xsl:when test="name($curNode)='w:b' ">
<b>
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="propNodes" select="$propNodes" />
<xsl:with-param name="nodeCount" select="$nodeCount+1" />
<xsl:with-param name="runContent" select="$runContent" />
</xsl:call-template>
</b>
</xsl:when>
. . .
<!-- we don't recognize this run formatting tag;
ignore it and go to the next -->
<xsl:otherwise>
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="propNodes" select="$propNodes" />
<xsl:with-param name="nodeCount" select="$nodeCount+1" />
<xsl:with-param name="runContent" select="$runContent" />
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<!-- If this isn't a valid node, then we're out of nodes
to process; output the run content at this point
and end the recursion. The run content is handled
by applying templates that match run content nodes
(such as <w:t> or <w:pict>) -->
<xsl:otherwise>
<xsl:apply-templates select="$runContent" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Run Content: Text, Line Breaks, and Images
Following the recursive application of character formatting, we process the content of a run. Each type of content is supported through its own template matching a WordprocessingML tag. Regular text, represented by a <w:t>
tag, is rendered with an <xsl:value-of>
instruction.
<!-- match text nodes within a run-->
<xsl:template match="w:t">
<!-- simple - just output the text content -->
<xsl:value-of select="." />
</xsl:template>
Line breaks (created in Word by pressing [Shift]+[Enter]) are also simple to address with a template matching the <w:br>
tag:
<!-- match br tags in a run -->
<xsl:template match="w:br">
<!-- simple line break within a paragraph; this is entered in word with
[Shift]+[Enter] -->
<br />
</xsl:template>
Images are a little more complicated. Our output should be an <img>
tag with a src
attribute pointing to a file relative to the html document itself. To support this, we must insert linked pictures in the Word document rather than embedded pictures. Linked pictures are identified with <w:pict>
tags in WordprocessingML. We can pull the linked file source name from the src
attribute of the w:pict/v:shape/v:imagedata
child tag. Pictures in WordprocessingML are described with VML syntax, hence the v:
prefix. In VML, image dimensions are represented through a CSS style
attribute. We use that to add a style
attribute to the output <img>
tag.
<!-- match linked picture nodes within a run -->
<xsl:template match="w:pict">
<!-- linked pictures can be handled this way; embedded pics cannot -->
<!-- output as an <img> tag -->
<img>
<!-- output the src attribute; this seems to be a file name
relative to the word document -->
<xsl:attribute name="src">
<xsl:value-of select="v:shape/v:imagedata/@src" />
</xsl:attribute>
<!-- word is using VML to store image information;
for width and height this is in css units;
capture the css style property and apply to
the html <img> tag -->
<xsl:if test="v:shape/@style">
<xsl:attribute name="style">
<xsl:value-of select="v:shape/@style" />
</xsl:attribute>
</xsl:if>
</img>
</xsl:template>
Hyperlinks
In WordprocessingML, a hyperlink is represented with a <w:hlink>
tag. If present, a w:dest
attribute indicates an external destination. Without it, a destination internal to the document is assumed. The w:bookmark
attribute then contains the name of a destination bookmark.
Document bookmarks are represented by Word as a pair of <aml:annotation>
tags, one with a w:type
attribute of "Word.Bookmark.Start", the other with a w:type
value of "Word.Bookmark.End". The .Start bookmark tag also has a w:name
attribute representing the bookmark name. It is this value that will match the w:bookmark
value in the <w:hlink>
tag.
Whether the destination is external or internal, the <w:hlink>
tag will nest its display text as inner content.
<!-- match hlink nodes within a paragraph -->
<xsl:template match="w:hlink">
<!-- get the destination, if any -->
<xsl:variable name="dest">
<xsl:value-of select="@w:dest" />
</xsl:variable>
<!-- set up the anchor tag -->
<a>
<!-- add the href attribute -->
<xsl:attribute name="href">
<xsl:choose>
<!-- if the w:bookmark attribute is present, use it -->
<xsl:when test="@w:bookmark">
<xsl:value-of select="concat($dest, '#', @w:bookmark)" />
</xsl:when>
<!-- if not, just use the destination value -->
<xsl:otherwise>
<xsl:value-of select="$dest" />
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<!-- if there is a w:target attribute, use it too -->
<xsl:if test="@w:target">
<xsl:attribute name="target">
<xsl:value-of select="@w:target" />
</xsl:attribute>
</xsl:if>
<!-- add inner content, probably a <w:t> tag -->
<xsl:apply-templates />
</a>
</xsl:template>
<!-- match the tag that indicates the starting position
for a bookmark -->
<xsl:template match="aml:annotation">
<xsl:if test="@w:type='Word.Bookmark.Start'">
<!-- use the w:name attribute to identify the bookmark name -->
<!-- place an anchor link here with that name -->
<a>
<xsl:attribute name="name">
<xsl:value-of select="@w:name" />
</xsl:attribute>
</a>
</xsl:if>
</xsl:template>
Tables
Table formatting has been kept simple in this stylesheet. The template contains two table styles, TableBorder0 and TableBorder1, which are interpreted in the XSL instructions to apply either "0" or "1" for the output table border
attribute.
<!-- match the outer table <w:tbl> tag -->
<xsl:template match="w:tbl">
<table>
<xsl:attribute name="border">
<!-- if it's a TableBorder0 style, set the border to 0 -->
<!-- otherwise, set the border to 1 -->
<xsl:choose>
<xsl:when test="w:tblPr/w:tblStyle/@w:val = 'TableBorder0'">
0
</xsl:when>
<xsl:otherwise>1</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<!-- apply templates for inner table content (i.e. <tr> tags)
<xsl:apply-templates />
</table>
</xsl:template>
We supply the following template to match the <w:tr>
table row tags:
<!-- match the table row <w:tr> tag -->
<xsl:template match="w:tr">
<tr valign="top">
<xsl:apply-templates />
</tr>
</xsl:template>
Finally, we process individual cells within a row by matching the <w:tc>
tag with a template. The formatting supported here includes background color and alignment.
<!-- match the table cell <w:tc> tag -->
<xsl:template match="w:tc">
<td>
<!-- does this table cell have a background color? -->
<xsl:choose>
<xsl:when test="w:tcPr/w:shd/@w:fill">
<!-- if so, apply it to the <td> tag as an attribute -->
<xsl:attribute name="bgColor">
<xsl:value-of select="concat('#', w:tcPr/w:shd/@w:fill)" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!-- adjust the vertical align? -->
<xsl:choose>
<xsl:when test="w:tcPr/w:vAlign/@w:val">
<xsl:attribute name="valign">
<xsl:value-of select="w:tcPr/w:vAlign/@w:val" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!-- apply templates for inner content -->
<xsl:apply-templates />
</td>
</xsl:template>
Additional Considerations
Word 2003 does a good job of representing a document with full fidelity in WordprocessingML – too good a job, in fact. Proofing errors for example may render as tags whether or not the options to display such errors are enabled. This can impact the XSL transformation.
Proofing Errors in Lists
When spelling or grammar errors exist at the beginning of a list item, a <w:proofErr>
tag may result as a sibling tag to the list item. In our transformation, we are assuming contiguous list items as siblings, employing following-sibling
and preceding-sibling
XSL functions to render them. The appearance of the <w:proofErr>
tag effectively interrupts the list, causing a new list to begin with the next list item. To avoid this problem, right-click those spelling and grammar errors in list items and choose to either fix or ignore them prior to transformation.
Hyperlinks as Field Codes
The existence of proofing errors may cause hyperlinks to render differently as well. I have seen hyperlinks represented in WordprocessingML as combinations of <w:fldChar>
and <w:instrText>HYPERLINK …</w:insertText>
tags, rather than as <w:hlink>
tags if there are spelling or grammar errors in the text of the link. As with list items, to avoid this problem right-click on those spelling/grammar errors and fix or ignore them prior to transformation.
Summary
This article presents an XSL stylesheet for transforming a Word 2003 document into a simple HTML syntax, at the same time offering a Word 2003 template for CodeProject article authors. By making heavy use of Word styles, and by matching specific WordprocessingML tags, common HTML may be rendered without the verbosity typical of Word's Save as HTML command. Certain transformation issues are resolved through resourceful XSL application. For example, the problem of dealing with multiple-paragraph blocks is resolved by using the mode
attribute of the <xsl:apply-templates>
instruction, and a recursive template is applied for proper hierarchical nesting of character formatting. With room for further development, I hope this template serves as a useful tool and XSL example for the CodeProject community.