Click here to Skip to main content
15,867,488 members
Articles / Programming Languages / XML
Article

Word 2003 CodeProject Article Template

Rate me:
Please Sign up or sign in to vote.
4.78/5 (41 votes)
24 May 200414 min read 196.4K   1.8K   85   45
A Word 2003 template for CodeProject articles, and an examination of XSL transformation from Word to concise HTML

Image 1

Introduction

With an XML-based file format, WordprocessingML, Word 2003 provides new opportunities for using XSL transformation to convert data and documents to and from Word. This article presents a utility template for writing CodeProject articles in Word 2003, with an XSL stylesheet for converting the native document to a concise HTML syntax representative of the CodeProject submission template. This article is not intended to serve as an introduction to XSL transformation, nor necessarily as a primer on WordprocessingML. Rather, this article offers XSL examples for transforming a Word document with single- and multi-line paragraph styles, character formatting, images, hyperlinks, and tables.

Background

I like using Word for writing articles. There are numerous features – outlining, revision tracking, and proofing tools to name a few – to assist the writer. Historically though, as a rich-text HTML editor Word has had its problems. Its functions over the years to save a document as HTML have produced notoriously complex and verbose syntax. For its part, Word 2003 offers both a full-fidelity HTML save format, and a "filtered HTML" format. The former produces as garrulous a syntax as previous versions; the latter, though cleaner, still handles too many formats (such as a simple list item) using a <span> tag rather than the suitable HTML (<li>). Though I prefer an editor that generates a more standard HTML, I still wish to benefit from all of Word's features. Writing CodeProject articles, based on the CodeProject submission template[^], is an excellent case where I want Word's power but simple HTML output, using standard heading <h2>, paragraph <p>, and list item <li> tags among others.

Word 2003 opens the door to this possibility by offering WordprocessingML as an XML-based save format. Originally called WordML, WordprocessingML provides a complete grammar for representing a Word document as XML. With it and an appropriate XSL stylesheet, document transformation to a simpler HTML format is attainable. The template and companion XSL stylesheet described in this article serve as a utility to convert a Word 2003 document into a simpler HTML syntax for CodeProject articles.

For the reader not familiar with XML or XSL transformation, try the W3Schools tutorials on XML [^] and XSL [^]. For an introduction to and reference for WordprocessingML, try the following from Microsoft:

Using the Template

The template includes a custom toolbar, styles in the Bob-loves-orange CodeProject colors, and some VBA code. Because of the code, security issues must be considered when using the template.

Setting Up

Copy the template CodeProject Article.dot to your local templates directory. This location can be found by clicking Word's Tools menu to Options on the File Locations tab under "User Templates".

Image 2

A typical location for the templates folder is "driveLetter:\Documents and Settings\user\Application Data\Microsoft\Templates".

Security Issues

Depending on your security settings, you may receive a warning (or the code may be disabled entirely) when attempting to use the template. To view your security settings in Word, click the Tools menu to Macro, Security.

Image 3

The template is not signed, so disabled code is possible if the security level is set higher than Medium. To use the template, ensure one of the following options:

  • On the Trusted Publishers tab of the Security window, check the box labeled Trust all installed add-ins and templates. This allows use of the template provided it has been copied to the User Templates file location.
  • Set the Security Level to Medium and when opening the template, choose to enable macros.
  • Sign the template with your own security certificate, potentially including that certificate among the Trusted Publishers list. Refer to Word 2003 documentation for more information on code signing.

The First Time – Setting Options

To create a new document using the template, click the File menu to New… In the New Document task pane, under Templates click On my computer…, then select the CodeProject Article icon. Upon first use, the Options dialog displays:

Image 4

In the XSL Transform Stylesheet box, enter the full path of the companion XSL stylesheet, or click Browse to locate the file. This path must be set for the XSL transformation to function correctly. Check the box Open the .html file after XSL transform at your discretion.

These options are stored as custom properties in the template itself, so there are no additional registry settings or external files used.

Toolbar Functions

The XSL transformation employed here is largely based on the use of paragraph, character, and table styles. Specific style names are easy to match in the XSL stylesheet, and the template encourages the use of these styles through the functions on its custom toolbar.

Function

Toolbar Button

Description

Heading 2

Image 5

Apply the Heading2 style to the selected paragraph. Heading2 renders as an <h2> tag upon transformation.

Heading 3

Image 6

Apply the Heading3 style to the selected paragraph. Heading3 renders as an <h3> tag upon transformation.

Code Block

Image 7

Apply the pre style to the selected paragraph(s). When transformed, blocks using the pre style are rendered within <pre>…</pre> tags.

Normal

Image 8

Apply the Normal style to the selected paragraph(s). Normal paragraphs render as <p> tags.

BulletList

Image 9

Apply the BulletList style to the selected paragraph(s). This style name is interpreted upon transformation as a <ul> block of <li> items.

NumberList

Image 10

Apply the NumberList style to the selected paragraph(s). This style name is interpreted upon transformation as an <ol> block of <li> items.

Bold, Italic, Underline

Image 11

Standard bold, italic, and underline character formatting, transformed to <b>, <i>, and <u> tags.

Code formatting

Image 12

Character formatting for variables or class names; this style name transforms to a <code> tag.

Table style – Border0

Image 13

Apply the TableBorder0 table style to the selected table. Upon transformation, this renders a border="0" attribute in the <table> tag.

Table style – Border1

Image 14

Apply the TableBorder1 table style to the selected table. Upon transformation, this renders a border="1" attribute in the <table> tag.

Insert Hyperlink

Image 15

Standard Word 2003 command for inserting hyperlinks, with the utility of including a new window [^] link. Destinations may be external to the document, or internal bookmarks. (Note: proofing errors within hyperlinks may interfere with the rendering of hyperlinks to XML; see Additional Considerations for more information)

Insert Download

Image 16

Custom command for inserting a download file hyperlink, such as those that appear above an article. In addition to constructing the link, the DownloadList paragraph style is applied, which when transformed renders a <ul class='download'> tag.

Insert Linked Picture

Image 17

Conducts the standard Insert Picture Word dialog, and then ensures that the inserted picture is linked and not embedded. Upon transformation, a linked picture is rendered as an <img> tag with a src attribute set to the path of the picture relative to the document. If the picture is in the same folder as the document, src is set to the file name only; if in a sibling folder of the document, src is set to folder\picFileName.xxx.

Apply XSL Transformation

Image 18

Saves the current document in its original format (typically .doc), then saves again using XSL transformation, generating a file with the same name as the original but with an .html extension. Once transformed, the document is reset so additional saves retain the original format.

Options

Image 19

Conducts the Options dialog, allowing the path to the XSL stylesheet to be set. These options are stored directly in the template as custom properties.

The XSL Stylesheet

The file CPArticleTransform.xsl provides the XSL stylesheet used for this transformation. This file can be saved anywhere on the drive with the template; as mentioned, the template's Options dialog provides a box to enter the full stylesheet path.

Namespaces and Outer Templates

WordprocessingML incorporates a number of namespaces, which we will include as attributes in the root <xsl:stylesheet> tag.

XML
<xsl:stylesheet version="1.0"
         xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
         xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" 
         xmlns:v="urn:schemas-microsoft-com:vml" 
         xmlns:w10="urn:schemas-microsoft-com:office:word" 
         xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core" 
         xmlns:aml="http://schemas.microsoft.com/aml/2001/core" 
         xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint" 
         xmlns:o="urn:schemas-microsoft-com:office:office" 
         xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" 
         >

Among this listing, the following prefixes are particularly important in our transformation:

  • xsl – serves as an alias for the namespace defining XSL transformation; stylesheet commands will be prefixed with xsl.
  • w – alias for the WordprocessingML namespace; when matching most nodes specific to the Word document, we'll prefix using w. For example, to match a Word paragraph tag, we'll look for <w:p>.
  • v – alias for the VML namespace, used by Word to represent images.
  • wx – alias for the Word 2003 auxiliary namespace; section and sub-section tags will be prefixed with wx.
  • aml – alias for the Annotation Markup Language namespace; bookmarks are represented as <aml:annotation> tags.

The root node of a Word document, represented through WordprocessingML, is the <w:wordDocument> element. Our template for matching this root node of the document is as follows:

XML
<!-- =============================================================
     Match the root node
     ============================================================= -->
    
    <xsl:template match="/w:wordDocument">
        <html>
            <head>
                <title>The Code Project</title>
                <style>
                    BODY, P, TD { font-family: Verdana, Arial, Helvetica, 
                                               sans-serif; 
                                  font-size: 10pt }
                    H2,H3,H4,H5 { color: #ff9900; font-weight: bold; }
                    H2 { font-size: 13pt; }
                    H3 { font-size: 12pt; }
                    H4 { font-size: 10pt; color: black; }
                    PRE { BACKGROUND-COLOR: #FBEDBB; 
                          FONT-FAMILY: "Courier New", Courier, mono; 
                          WHITE-SPACE: pre; }
                    CODE { COLOR: #990000; 
                           FONT-FAMILY: "Courier New", Courier, mono; }
                </style>
                <link rel="stylesheet" type="text/css" 
                  href="http://www.codeproject.com/styles/global.css" />
            </head>
            <body>
                <!-- skip to the w:body tag -->
                <xsl:apply-templates select="w:body" />
            </body>
        </html>
    </xsl:template>    

With this template, we set up the article HTML and issue the <xsl:apply-templates> instruction to render the document body.

In WordprocessingML, a <w:body> tag serves as a container for section and sub-section nodes, represented as <wx:sect> and <wx:sub-section>. These in turn serve as containers for paragraphs, represented by the <w:p> tag. It is at the paragraph level that the heart of our processing begins, so for <w:body>, <wx:sect>, and <wx:sub-section> matches, we simply issue the <xsl:apply-templates> instruction to dive further down into the element hierarchy.

XML
<!-- =============================================================
     Match nodes that would encapsulate a paragraph <w:p> node
     ============================================================= -->

    <!-- match the w:body node -->
    <xsl:template match="w:body">
        <xsl:apply-templates select="*" />    
    </xsl:template>    

                
    <!-- match the wx:sect node -->
    <xsl:template match="wx:sect">
        <xsl:apply-templates select="*" />    
    </xsl:template>    


    <!-- match the w:sub-section node -->
    <xsl:template match="wx:sub-section">
        <xsl:apply-templates select="*" />    
    </xsl:template>    

Single-line Paragraph Formatting

Once inside the body of the document, we use a template matching the tag <w:p>. This represents an individual paragraph. As the template is based on the use of styles in Word, locating heading paragraphs is a straightforward matter. Among other children, paragraphs are containers for <w:pPr> tags, which stands for "paragraph properties". The <w:pPr> tag may contain a <w:pStyle> tag if a paragraph style is in use. The name of the style will be found in the w:val attribute. Therefore, to match a paragraph with the Heading2 style, we can use the following XPath syntax:

XML
w:pPr/w:pStyle[@w:val='Heading2']

The <w:p> template looks for several different heading paragraph styles from within an <xsl:choose> tag. The <xsl:otherwise> condition applies a simple <p> tag in the output.

XML
<xsl:template match="w:p">
    <!-- seek paragraph formatting and apply tags accordingly -->
    <xsl:choose>

   <!-- ==========================================================
        single-paragraph items
        paragraph formatting that is fairly simple to handle;
        these are typically heading formats that would fit in a
        single line
        ========================================================== -->

        <xsl:when test="w:pPr/w:pStyle[@w:val='Heading2']">
            <h2><xsl:apply-templates select="*" /></h2>
        </xsl:when>

        <xsl:when test="w:pPr/w:pStyle[@w:val='Heading3']">
            <h3><xsl:apply-templates select="*" /></h3>
        </xsl:when>

        <xsl:when test="w:pPr/w:pStyle[@w:val='Heading4']">
            <h4><xsl:apply-templates select="*" /></h4>
        </xsl:when>

        <xsl:when test="w:pPr/w:pStyle[@w:val='Heading5']">
            <h5><xsl:apply-templates select="*" /></h5>
        </xsl:when>

        . . .

   <!-- ==========================================================
        treat everything else as a regular paragraph
        (e.g. the Normal style)
        ========================================================== -->

        <xsl:otherwise>
            <p>
                <!-- apply horizontal align? -->
                <xsl:choose>
                    <xsl:when test="w:pPr/w:jc/@w:val">
                        <xsl:attribute name="align">
                            <xsl:value-of select="w:pPr/w:jc/@w:val" />
                        </xsl:attribute>
                    </xsl:when>
                </xsl:choose>

                <!-- apply templates for content -->
                <xsl:apply-templates select="*" />
            </p>
        </xsl:otherwise>


    </xsl:choose>

</xsl:template>

Multi-line Paragraph Formatting

A more complex situation arises when using lists or <pre> sections. In these cases, each line (ended with a carriage return) is considered a new paragraph to Word, and would have its own paragraph style information. Though we can still identify each by its style name (e.g. "BulletList", or "pre") we need to treat the multiple lines as a single group – surrounded with say a <ul> or <pre> container.

For these cases, we will still test for the style name as we did before. Once found, we'll test the preceding paragraph to see if it matches the same style. If it doesn't, we can assume we are beginning the multi-paragraph block. In the case of a BulletList for example, we will then apply a transform like the following:

XML
<ul>
  <xsl:apply-templates select="." mode="insideBulletList"/>
</ul>

The mode attribute here is the key to making this work. We will continue applying templates, thus continuing to match <w:p> tags. However, by specifying a mode we can change the operational <w:p> template to one specifically designed for, say, a bullet list. Recall that our original <w:p> template was defined without a mode:

XML
<xsl:template match="w:p">
    . . .
</xsl:template>

We'll define another template to match <w:p> tags, but include the mode attribute to handle paragraph processing differently inside a BulletList.

XML
    <!-- match paragraph nodes that are part of a Bullet list -->
    <xsl:template match="w:p" mode="insideBulletList">
        <!-- output this bullet item paragraph -->
        <li><xsl:apply-templates /></li>

        <!--go to next one-->
        <xsl:apply-templates 
select="following-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"
             mode="insideBulletList" />

    </xsl:template>

A paragraph match here outputs the list item <li> tag, then applies the same template for any siblings that follow, provided they share the paragraph style name "BulletList". So back in the original <w:p> template, as an <xsl:when> condition in the original <xsl:choose> instruction, the following handles BulletList formatting:

XML
. . .

<!-- ==========================================================
     multi-paragraph items 
     paragraph formatting that is more complicated to handle;
     these are typically paragraph formats that will span multiple
     lines, such as a list of items or the <pre> format 
     ========================================================== -->
            
     <!-- match the BulletList style -->         
     <xsl:when test="w:pPr/w:pStyle[@w:val='BulletList']">
         <xsl:choose>
             <!-- if the preceding paragraph was also a BulletList style,
                  then it has already been handled through the
                  'insideBulletList' mode; ignore it here -->
             <xsl:when 
test="preceding-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"/>
             
           <!-- otherwise, start a UL tag and apply templates with the 
                         'insideBulletList' mode -->
             <xsl:otherwise>
                <ul>
                  <xsl:apply-templates select="." mode="insideBulletList"/>
                </ul>
             </xsl:otherwise>
         </xsl:choose>
     </xsl:when>
  
   . . .

This block reflects the pattern also used for NumberList, DownloadList, and pre paragraph styles.

Runs and Character Formatting

In WordprocessingML, the tag <w:r> identifies a run of content. These tags are children of <w:p> tags and represent containers of content with consistent character formatting. Text, linked images, and line breaks are all examples of content nested inside a <w:r> tag. The <w:r> tag may also contain a <w:rPr> tag to enclose the properties (including character formatting) of the run. As multiple character formats may be applied to a run, we must adhere to proper hierarchical nesting of formatting tags in the output. To accomplish this, we will call a recursive template when matching a <w:r> tag, and pass as a parameter the first of the child formatting tags within the <w:rPr> run property parent.

XML
<!-- =============================================================
     match run nodes <w:r> within a paragraph;
     this is the common container for content such as text or 
     pictures and where we'll deduce character formatting
     ============================================================= -->
    <!-- match run nodes within a paragraph-->
    <xsl:template match="w:r">

    <!-- =======================================================
         Character formatting at this level is identified with 
         run property nodes <w:rPr>; for example, a run with bold
         formatting will have a <w:rPr> tag with a <w:b> tag for
         a child.  As multiple formatting tags are possible, we
         need a way to account for them all while maintaining the
         necessary XML heirarchical structure.  We'll accomplish
         this with a recursive template that loops through
         each property <w:rPr> node, surrounding the content nodes
         with proper formatting tags.  
         
         The recursive template is "recurseRunProps"; we initiate
         the first call to it here.
         ======================================================= -->             
        <!-- iterate through all w:rPr child tags to apply formatting;
             at the end of the recursion, the run text is applied -->
        <xsl:call-template name="recurseRunProps">
            <xsl:with-param name="nodeCount" select="1" />
            <xsl:with-param name="propNodes" select="w:rPr/*" />
            
            <!-- run content will be any child node that isn't a <w:rPr> 
                 node; this will include text <w:t>, picture <w:pict>,
                 and line breaks <w:br> -->
            <xsl:with-param name="runContent" select="*[not(w:rPr)]" />   
                         
        </xsl:call-template>           
             
    </xsl:template>

The recursive template recurseRunProps checks to see if it has been passed a valid node, and if so tries to match supported character formatting. If a supported formatting tag is caught, an <xsl:call-template> instruction is issued to execute recurseRunProps again with the next formatting child, nested within the appropriate output formatting tags. If the passed node is not a supported formatting tag, recurseRunProps is still called with the next formatting child, if any. When the recursion has ended, an <xsl:apply-templates> instruction is performed to process the inner run content.

The following shows the pattern in recurseRunProps for matching <w:b> bold formatting tags. Italic, underline, and <code> character formats follow the same pattern.

XML
<!-- =============================================================
     This is a recursive template that is called when a run tag
     <w:r> is matched
     
        parameters:
             nodeCount  - the index value of the <w:rPr> node within 
                          propNodes that should be processed
             propNodes  - the complete list of <w:rPr> nodes to
                          recursively process
             runContent - the content (i.e. all nodes other than
                          <w:rPr>) around which formatting tags
                          are to be applied            
     ============================================================= -->
    <xsl:template name="recurseRunProps">
        <xsl:param name="nodeCount" />
        <xsl:param name="propNodes" />
        <xsl:param name="runContent" />
        
        <!-- select the <w:rPr> node to process, based on the index 
             nodeCount -->
        <xsl:variable name="curNode" select="$propNodes[$nodeCount]" />
        
        <!-- is this a valid node to process? -->
        <xsl:choose>
          <xsl:when test="$curNode">
            
            <!-- this is a valid node; process it, and
                 recursively call processing for the next node;-->
            <xsl:choose>
            
                <!-- process Bold tags -->
                <xsl:when test="name($curNode)='w:b' ">
                  <b>
                  <xsl:call-template name="recurseRunProps">
                    <xsl:with-param name="propNodes" select="$propNodes" />
                    <xsl:with-param name="nodeCount" select="$nodeCount+1" />
                    <xsl:with-param name="runContent" select="$runContent" />
                  </xsl:call-template>
                  </b>    
                </xsl:when>

                . . .                

                <!-- we don't recognize this run formatting tag;
                     ignore it and go to the next -->
                <xsl:otherwise>
                  <xsl:call-template name="recurseRunProps">
                    <xsl:with-param name="propNodes" select="$propNodes" />
                    <xsl:with-param name="nodeCount" select="$nodeCount+1" />
                    <xsl:with-param name="runContent" select="$runContent" />
                  </xsl:call-template>    
                </xsl:otherwise>
            </xsl:choose>
            
          </xsl:when>
          
          <!-- If this isn't a valid node, then we're out of nodes
               to process; output the run content at this point 
               and end the recursion.  The run content is handled
               by applying templates that match run content nodes
               (such as <w:t> or <w:pict>) -->

          <xsl:otherwise>
            <xsl:apply-templates select="$runContent" />
          </xsl:otherwise>            
        
        </xsl:choose>        
        
    </xsl:template>

Run Content: Text, Line Breaks, and Images

Following the recursive application of character formatting, we process the content of a run. Each type of content is supported through its own template matching a WordprocessingML tag. Regular text, represented by a <w:t> tag, is rendered with an <xsl:value-of> instruction.

XML
<!-- match text nodes within a run-->
<xsl:template match="w:t">
    <!-- simple - just output the text content -->
    <xsl:value-of select="." />
</xsl:template>

Line breaks (created in Word by pressing [Shift]+[Enter]) are also simple to address with a template matching the <w:br> tag:

XML
<!-- match br tags in a run -->
<xsl:template match="w:br">
    <!-- simple line break within a paragraph; this is entered in word with
         [Shift]+[Enter] -->
    <br />
</xsl:template>

Images are a little more complicated. Our output should be an <img> tag with a src attribute pointing to a file relative to the html document itself. To support this, we must insert linked pictures in the Word document rather than embedded pictures. Linked pictures are identified with <w:pict> tags in WordprocessingML. We can pull the linked file source name from the src attribute of the w:pict/v:shape/v:imagedata child tag. Pictures in WordprocessingML are described with VML syntax, hence the v: prefix. In VML, image dimensions are represented through a CSS style attribute. We use that to add a style attribute to the output <img> tag.

XML
<!-- match linked picture nodes within a run -->
<xsl:template match="w:pict">
    <!-- linked pictures can be handled this way; embedded pics cannot -->
    <!-- output as an <img> tag -->
    <img>
        <!-- output the src attribute; this seems to be a file name
             relative to the word document -->

        <xsl:attribute name="src">
            <xsl:value-of select="v:shape/v:imagedata/@src" />
        </xsl:attribute>

        <!-- word is using VML to store image information;
             for width and height this is in css units;
             capture the css style property and apply to
             the html <img> tag -->

        <xsl:if test="v:shape/@style">
            <xsl:attribute name="style">
                <xsl:value-of select="v:shape/@style" />
            </xsl:attribute>
        </xsl:if>

    </img>

</xsl:template>

Hyperlinks

In WordprocessingML, a hyperlink is represented with a <w:hlink> tag. If present, a w:dest attribute indicates an external destination. Without it, a destination internal to the document is assumed. The w:bookmark attribute then contains the name of a destination bookmark.

Document bookmarks are represented by Word as a pair of <aml:annotation> tags, one with a w:type attribute of "Word.Bookmark.Start", the other with a w:type value of "Word.Bookmark.End". The .Start bookmark tag also has a w:name attribute representing the bookmark name. It is this value that will match the w:bookmark value in the <w:hlink> tag.

Whether the destination is external or internal, the <w:hlink> tag will nest its display text as inner content.

XML
<!-- match hlink nodes within a paragraph -->
<xsl:template match="w:hlink">
    <!-- get the destination, if any -->
    <xsl:variable name="dest">
        <xsl:value-of select="@w:dest" />
    </xsl:variable>

    <!-- set up the anchor tag -->
    <a>
      <!-- add the href attribute -->
      <xsl:attribute name="href">
        <xsl:choose>
          <!-- if the w:bookmark attribute is present, use it -->
          <xsl:when test="@w:bookmark">
            <xsl:value-of select="concat($dest, '#', @w:bookmark)" />
          </xsl:when>
          <!-- if not, just use the destination value -->
          <xsl:otherwise>
            <xsl:value-of select="$dest" />
          </xsl:otherwise>
        </xsl:choose>
      </xsl:attribute>

      <!-- if there is a w:target attribute, use it too -->
      <xsl:if test="@w:target">
        <xsl:attribute name="target">
            <xsl:value-of select="@w:target" />
        </xsl:attribute>
      </xsl:if>

      <!-- add inner content, probably a <w:t> tag -->
      <xsl:apply-templates />
    </a>
</xsl:template>

<!-- match the tag that indicates the starting position
     for a bookmark -->
<xsl:template match="aml:annotation">
    <xsl:if test="@w:type='Word.Bookmark.Start'">
        <!-- use the w:name attribute to identify the bookmark name -->
        <!-- place an anchor link here with that name -->
        <a>
            <xsl:attribute name="name">
                <xsl:value-of select="@w:name" />
            </xsl:attribute>
        </a>
    </xsl:if>
</xsl:template>

Tables

Table formatting has been kept simple in this stylesheet. The template contains two table styles, TableBorder0 and TableBorder1, which are interpreted in the XSL instructions to apply either "0" or "1" for the output table border attribute.

XML
<!-- match the outer table <w:tbl> tag -->
<xsl:template match="w:tbl">        

  <table>

    <xsl:attribute name="border">
      <!-- if it's a TableBorder0 style, set the border to 0 -->
      <!-- otherwise, set the border to 1 -->
      <xsl:choose>
          <xsl:when test="w:tblPr/w:tblStyle/@w:val = 'TableBorder0'">
            0
          </xsl:when>

          <xsl:otherwise>1</xsl:otherwise>

      </xsl:choose>                 
    </xsl:attribute>
 
    <!-- apply templates for inner table content (i.e. <tr> tags)
    <xsl:apply-templates />

  </table>        

</xsl:template>

We supply the following template to match the <w:tr> table row tags:

XML
<!-- match the table row <w:tr> tag -->
<xsl:template match="w:tr">
    <tr valign="top">
        <xsl:apply-templates />
    </tr>
</xsl:template>

Finally, we process individual cells within a row by matching the <w:tc> tag with a template. The formatting supported here includes background color and alignment.

XML
<!-- match the table cell <w:tc> tag -->
<xsl:template match="w:tc">
    <td>
        <!-- does this table cell have a background color? -->
        <xsl:choose>
            <xsl:when test="w:tcPr/w:shd/@w:fill">
                <!-- if so, apply it to the <td> tag as an attribute -->
                <xsl:attribute name="bgColor">
                  <xsl:value-of select="concat('#', w:tcPr/w:shd/@w:fill)" />
                </xsl:attribute>
            </xsl:when>
        </xsl:choose>
             
        <!-- adjust the vertical align? -->
        <xsl:choose>
            <xsl:when test="w:tcPr/w:vAlign/@w:val">
                <xsl:attribute name="valign">
                    <xsl:value-of select="w:tcPr/w:vAlign/@w:val" />
                </xsl:attribute>
            </xsl:when>
        </xsl:choose>
             
        <!-- apply templates for inner content -->
        <xsl:apply-templates />

    </td>        

</xsl:template>

Additional Considerations

Word 2003 does a good job of representing a document with full fidelity in WordprocessingML – too good a job, in fact. Proofing errors for example may render as tags whether or not the options to display such errors are enabled. This can impact the XSL transformation.

Proofing Errors in Lists

When spelling or grammar errors exist at the beginning of a list item, a <w:proofErr> tag may result as a sibling tag to the list item. In our transformation, we are assuming contiguous list items as siblings, employing following-sibling and preceding-sibling XSL functions to render them. The appearance of the <w:proofErr> tag effectively interrupts the list, causing a new list to begin with the next list item. To avoid this problem, right-click those spelling and grammar errors in list items and choose to either fix or ignore them prior to transformation.

Hyperlinks as Field Codes

The existence of proofing errors may cause hyperlinks to render differently as well. I have seen hyperlinks represented in WordprocessingML as combinations of <w:fldChar> and <w:instrText>HYPERLINK …</w:insertText> tags, rather than as <w:hlink> tags if there are spelling or grammar errors in the text of the link. As with list items, to avoid this problem right-click on those spelling/grammar errors and fix or ignore them prior to transformation.

Summary

This article presents an XSL stylesheet for transforming a Word 2003 document into a simple HTML syntax, at the same time offering a Word 2003 template for CodeProject article authors. By making heavy use of Word styles, and by matching specific WordprocessingML tags, common HTML may be rendered without the verbosity typical of Word's Save as HTML command. Certain transformation issues are resolved through resourceful XSL application. For example, the problem of dealing with multiple-paragraph blocks is resolved by using the mode attribute of the <xsl:apply-templates> instruction, and a recursive template is applied for proper hierarchical nesting of character formatting. With room for further development, I hope this template serves as a useful tool and XSL example for the CodeProject community.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
University of Nevada, Las Vegas
United States United States
With a background in education, music, application development, institutional research, data governance, and business intelligence, I work for the University of Nevada, Las Vegas helping to derive useful information from institutional data. It's an old picture, but one of my favorites.

Comments and Discussions

 
GeneralRe: Grouping multiple paragraph styles under one parent using xsl Pin
gandhiaryah4-Jan-06 22:22
gandhiaryah4-Jan-06 22:22 
GeneralGreat work! Pin
Erik Westermann28-Sep-04 18:29
professionalErik Westermann28-Sep-04 18:29 
GeneralRe: Great work! Pin
Mike Ellison28-Sep-04 19:09
Mike Ellison28-Sep-04 19:09 
GeneralProofing errors Pin
Roy Cornelissen29-Jul-04 3:59
Roy Cornelissen29-Jul-04 3:59 
GeneralRe: Proofing errors Pin
Mike Ellison29-Jul-04 8:07
Mike Ellison29-Jul-04 8:07 
GeneralRe: Proofing errors Pin
Roy Cornelissen29-Jul-04 20:46
Roy Cornelissen29-Jul-04 20:46 
GeneralRe: Proofing errors Pin
Mike Ellison2-Aug-04 6:06
Mike Ellison2-Aug-04 6:06 
Questionhow about the opposite direction - HTML-&gt;WordML Pin
Sue Work29-Jun-04 6:15
Sue Work29-Jun-04 6:15 
Great article Mike - thanks for this!!

I have sort of the opposite situation...I have raw material in XML form where that XML contains HTML-style presentation tags (<b>, </b>, <br/> etc). I am producing WordML output from this raw material.

It is a requirement to:
(a) allow the authoring to happen in XML format
(b) to tag italic, underline, bold, line breaks and lists in that raw material

I would assume that I could create a transform that does the opposite of what your article describes ? i.e. take something like

<summary>
The most <b>important</b> thing to remember is that the author wants control over format with<br/> <ul>
<li>easy intuitive tags</li>
<li>something else</li>
</ul>
</summary>

and produce WordML output of the text.

<summary>
<w:p><w:r><w:t>The most</w:t><w:r>
<w:r>
<w:rPr><w:b/></w:rPr>
<w:t>important</w:t>
</w:r>
...
</summary>

Does anyone have any good examples of that kind of transform??? I figure if not, I can take your sample and just do the reverse of everything, but I don't want to reinvent the wheel...

Sue Work
AnswerRe: how about the opposite direction - HTML-&gt;WordML Pin
Mike Ellison29-Jun-04 6:33
Mike Ellison29-Jun-04 6:33 
AnswerRe: how about the opposite direction - HTML-&gt;WordML Pin
Mike Ellison29-Jun-04 6:38
Mike Ellison29-Jun-04 6:38 
GeneralRe: how about the opposite direction - HTML-&gt;WordML Pin
Sue Work29-Jun-04 6:44
Sue Work29-Jun-04 6:44 
GeneralRe: how about the opposite direction - HTML to WordML Pin
manchu7321-Aug-07 6:56
manchu7321-Aug-07 6:56 
AnswerRe: how about the opposite direction - HTML-&amp;gt;WordML [modified] Pin
Dmitry Dzygin15-May-07 2:54
Dmitry Dzygin15-May-07 2:54 
GeneralThanks a million Pin
Roy Cornelissen11-Jun-04 2:21
Roy Cornelissen11-Jun-04 2:21 
GeneralRe: Thanks a million Pin
Mike Ellison11-Jun-04 6:27
Mike Ellison11-Jun-04 6:27 
GeneralYou rule Pin
WillemM3-Jun-04 7:14
WillemM3-Jun-04 7:14 
GeneralRe: You rule Pin
Mike Ellison3-Jun-04 7:15
Mike Ellison3-Jun-04 7:15 
GeneralWay to handle embedded images Pin
wjvii3-Jun-04 2:39
wjvii3-Jun-04 2:39 
GeneralClickety Pin
Colin Angus Mackay3-Jun-04 2:48
Colin Angus Mackay3-Jun-04 2:48 
GeneralRe: Way to handle embedded images Pin
Mike Ellison3-Jun-04 7:14
Mike Ellison3-Jun-04 7:14 
GeneralExcellent! Pin
Daniel Cazzulino [XML MVP]1-Jun-04 9:37
Daniel Cazzulino [XML MVP]1-Jun-04 9:37 
GeneralRe: Excellent! Pin
Mike Ellison1-Jun-04 12:12
Mike Ellison1-Jun-04 12:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.