|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionWith an XML-based file format, WordprocessingML, Word 2003 provides new opportunities for using XSL transformation to convert data and documents to and from Word. This article presents a utility template for writing CodeProject articles in Word 2003, with an XSL stylesheet for converting the native document to a concise HTML syntax representative of the CodeProject submission template. This article is not intended to serve as an introduction to XSL transformation, nor necessarily as a primer on WordprocessingML. Rather, this article offers XSL examples for transforming a Word document with single- and multi-line paragraph styles, character formatting, images, hyperlinks, and tables. BackgroundI like using Word for writing articles. There are numerous features – outlining, revision tracking, and proofing tools to name a few – to assist the writer. Historically though, as a rich-text HTML editor Word has had its problems. Its functions over the years to save a document as HTML have produced notoriously complex and verbose syntax. For its part, Word 2003 offers both a full-fidelity HTML save format, and a "filtered HTML" format. The former produces as garrulous a syntax as previous versions; the latter, though cleaner, still handles too many formats (such as a simple list item) using a <span> tag rather than the suitable HTML (<li>). Though I prefer an editor that generates a more standard HTML, I still wish to benefit from all of Word's features. Writing CodeProject articles, based on the CodeProject submission template[^], is an excellent case where I want Word's power but simple HTML output, using standard heading <h2>, paragraph <p>, and list item <li> tags among others. Word 2003 opens the door to this possibility by offering WordprocessingML as an XML-based save format. Originally called WordML, WordprocessingML provides a complete grammar for representing a Word document as XML. With it and an appropriate XSL stylesheet, document transformation to a simpler HTML format is attainable. The template and companion XSL stylesheet described in this article serve as a utility to convert a Word 2003 document into a simpler HTML syntax for CodeProject articles. For the reader not familiar with XML or XSL transformation, try the W3Schools tutorials on XML [^] and XSL [^]. For an introduction to and reference for WordprocessingML, try the following from Microsoft:
Using the TemplateThe template includes a custom toolbar, styles in the Bob-loves-orange CodeProject colors, and some VBA code. Because of the code, security issues must be considered when using the template. Setting UpCopy the template CodeProject Article.dot to your local templates directory. This location can be found by clicking Word's Tools menu to Options on the File Locations tab under "User Templates". A typical location for the templates folder is "driveLetter:\Documents and Settings\user\Application Data\Microsoft\Templates". Security IssuesDepending on your security settings, you may receive a warning (or the code may be disabled entirely) when attempting to use the template. To view your security settings in Word, click the Tools menu to Macro, Security. The template is not signed, so disabled code is possible if the security level is set higher than Medium. To use the template, ensure one of the following options:
The First Time – Setting OptionsTo create a new document using the template, click the File menu to New… In the New Document task pane, under Templates click On my computer…, then select the CodeProject Article icon. Upon first use, the Options dialog displays: In the XSL Transform Stylesheet box, enter the full path of the companion XSL stylesheet, or click Browse to locate the file. This path must be set for the XSL transformation to function correctly. Check the box Open the .html file after XSL transform at your discretion. These options are stored as custom properties in the template itself, so there are no additional registry settings or external files used. Toolbar FunctionsThe XSL transformation employed here is largely based on the use of paragraph, character, and table styles. Specific style names are easy to match in the XSL stylesheet, and the template encourages the use of these styles through the functions on its custom toolbar.
The XSL StylesheetThe file CPArticleTransform.xsl provides the XSL stylesheet used for this transformation. This file can be saved anywhere on the drive with the template; as mentioned, the template's Options dialog provides a box to enter the full stylesheet path. Namespaces and Outer TemplatesWordprocessingML incorporates a number of namespaces, which we will include as attributes in the root <xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"
xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:w10="urn:schemas-microsoft-com:office:word"
xmlns:sl="http://schemas.microsoft.com/schemaLibrary/2003/core"
xmlns:aml="http://schemas.microsoft.com/aml/2001/core"
xmlns:wx="http://schemas.microsoft.com/office/word/2003/auxHint"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882"
>
Among this listing, the following prefixes are particularly important in our transformation:
The root node of a Word document, represented through WordprocessingML, is the <!-- =============================================================
Match the root node
============================================================= -->
<xsl:template match="/w:wordDocument">
<html>
<head>
<title>The Code Project</title>
<style>
BODY, P, TD { font-family: Verdana, Arial, Helvetica,
sans-serif;
font-size: 10pt }
H2,H3,H4,H5 { color: #ff9900; font-weight: bold; }
H2 { font-size: 13pt; }
H3 { font-size: 12pt; }
H4 { font-size: 10pt; color: black; }
PRE { BACKGROUND-COLOR: #FBEDBB;
FONT-FAMILY: "Courier New", Courier, mono;
WHITE-SPACE: pre; }
CODE { COLOR: #990000;
FONT-FAMILY: "Courier New", Courier, mono; }
</style>
<link rel="stylesheet" type="text/css"
href="http://www.codeproject.com/styles/global.css" />
</head>
<body>
<!-- skip to the w:body tag -->
<xsl:apply-templates select="w:body" />
</body>
</html>
</xsl:template>
With this template, we set up the article HTML and issue the In WordprocessingML, a <!-- =============================================================
Match nodes that would encapsulate a paragraph <w:p> node
============================================================= -->
<!-- match the w:body node -->
<xsl:template match="w:body">
<xsl:apply-templates select="*" />
</xsl:template>
<!-- match the wx:sect node -->
<xsl:template match="wx:sect">
<xsl:apply-templates select="*" />
</xsl:template>
<!-- match the w:sub-section node -->
<xsl:template match="wx:sub-section">
<xsl:apply-templates select="*" />
</xsl:template>
Single-line Paragraph FormattingOnce inside the body of the document, we use a template matching the tag w:pPr/w:pStyle[@w:val='Heading2']
The <xsl:template match="w:p">
<!-- seek paragraph formatting and apply tags accordingly -->
<xsl:choose>
<!-- ==========================================================
single-paragraph items
paragraph formatting that is fairly simple to handle;
these are typically heading formats that would fit in a
single line
========================================================== -->
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading2']">
<h2><xsl:apply-templates select="*" /></h2>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading3']">
<h3><xsl:apply-templates select="*" /></h3>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading4']">
<h4><xsl:apply-templates select="*" /></h4>
</xsl:when>
<xsl:when test="w:pPr/w:pStyle[@w:val='Heading5']">
<h5><xsl:apply-templates select="*" /></h5>
</xsl:when>
. . .
<!-- ==========================================================
treat everything else as a regular paragraph
(e.g. the Normal style)
========================================================== -->
<xsl:otherwise>
<p>
<!-- apply horizontal align? -->
<xsl:choose>
<xsl:when test="w:pPr/w:jc/@w:val">
<xsl:attribute name="align">
<xsl:value-of select="w:pPr/w:jc/@w:val" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!-- apply templates for content -->
<xsl:apply-templates select="*" />
</p>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Multi-line Paragraph FormattingA more complex situation arises when using lists or For these cases, we will still test for the style name as we did before. Once found, we'll test the preceding paragraph to see if it matches the same style. If it doesn't, we can assume we are beginning the multi-paragraph block. In the case of a BulletList for example, we will then apply a transform like the following: <ul>
<xsl:apply-templates select="." mode="insideBulletList"/>
</ul>
The <xsl:template match="w:p">
. . .
</xsl:template>
We'll define another template to match <!-- match paragraph nodes that are part of a Bullet list -->
<xsl:template match="w:p" mode="insideBulletList">
<!-- output this bullet item paragraph -->
<li><xsl:apply-templates /></li>
<!--go to next one-->
<xsl:apply-templates
select="following-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"
mode="insideBulletList" />
</xsl:template>
A paragraph match here outputs the list item . . .
<!-- ==========================================================
multi-paragraph items
paragraph formatting that is more complicated to handle;
these are typically paragraph formats that will span multiple
lines, such as a list of items or the <pre> format
========================================================== -->
<!-- match the BulletList style -->
<xsl:when test="w:pPr/w:pStyle[@w:val='BulletList']">
<xsl:choose>
<!-- if the preceding paragraph was also a BulletList style,
then it has already been handled through the
'insideBulletList' mode; ignore it here -->
<xsl:when
test="preceding-sibling::*[1][self::w:p/w:pPr/w:pStyle[@w:val='BulletList']]"/>
<!-- otherwise, start a UL tag and apply templates with the
'insideBulletList' mode -->
<xsl:otherwise>
<ul>
<xsl:apply-templates select="." mode="insideBulletList"/>
</ul>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
. . .
This block reflects the pattern also used for NumberList, DownloadList, and pre paragraph styles. Runs and Character FormattingIn WordprocessingML, the tag <!-- =============================================================
match run nodes <w:r> within a paragraph;
this is the common container for content such as text or
pictures and where we'll deduce character formatting
============================================================= -->
<!-- match run nodes within a paragraph-->
<xsl:template match="w:r">
<!-- =======================================================
Character formatting at this level is identified with
run property nodes <w:rPr>; for example, a run with bold
formatting will have a <w:rPr> tag with a <w:b> tag for
a child. As multiple formatting tags are possible, we
need a way to account for them all while maintaining the
necessary XML heirarchical structure. We'll accomplish
this with a recursive template that loops through
each property <w:rPr> node, surrounding the content nodes
with proper formatting tags.
The recursive template is "recurseRunProps"; we initiate
the first call to it here.
======================================================= -->
<!-- iterate through all w:rPr child tags to apply formatting;
at the end of the recursion, the run text is applied -->
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="nodeCount" select="1" />
<xsl:with-param name="propNodes" select="w:rPr/*" />
<!-- run content will be any child node that isn't a <w:rPr>
node; this will include text <w:t>, picture <w:pict>,
and line breaks <w:br> -->
<xsl:with-param name="runContent" select="*[not(w:rPr)]" />
</xsl:call-template>
</xsl:template>
The recursive template recurseRunProps checks to see if it has been passed a valid node, and if so tries to match supported character formatting. If a supported formatting tag is caught, an The following shows the pattern in recurseRunProps for matching <!-- =============================================================
This is a recursive template that is called when a run tag
<w:r> is matched
parameters:
nodeCount - the index value of the <w:rPr> node within
propNodes that should be processed
propNodes - the complete list of <w:rPr> nodes to
recursively process
runContent - the content (i.e. all nodes other than
<w:rPr>) around which formatting tags
are to be applied
============================================================= -->
<xsl:template name="recurseRunProps">
<xsl:param name="nodeCount" />
<xsl:param name="propNodes" />
<xsl:param name="runContent" />
<!-- select the <w:rPr> node to process, based on the index
nodeCount -->
<xsl:variable name="curNode" select="$propNodes[$nodeCount]" />
<!-- is this a valid node to process? -->
<xsl:choose>
<xsl:when test="$curNode">
<!-- this is a valid node; process it, and
recursively call processing for the next node;-->
<xsl:choose>
<!-- process Bold tags -->
<xsl:when test="name($curNode)='w:b' ">
<b>
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="propNodes" select="$propNodes" />
<xsl:with-param name="nodeCount" select="$nodeCount+1" />
<xsl:with-param name="runContent" select="$runContent" />
</xsl:call-template>
</b>
</xsl:when>
. . .
<!-- we don't recognize this run formatting tag;
ignore it and go to the next -->
<xsl:otherwise>
<xsl:call-template name="recurseRunProps">
<xsl:with-param name="propNodes" select="$propNodes" />
<xsl:with-param name="nodeCount" select="$nodeCount+1" />
<xsl:with-param name="runContent" select="$runContent" />
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
<!-- If this isn't a valid node, then we're out of nodes
to process; output the run content at this point
and end the recursion. The run content is handled
by applying templates that match run content nodes
(such as <w:t> or <w:pict>) -->
<xsl:otherwise>
<xsl:apply-templates select="$runContent" />
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Run Content: Text, Line Breaks, and ImagesFollowing the recursive application of character formatting, we process the content of a run. Each type of content is supported through its own template matching a WordprocessingML tag. Regular text, represented by a <!-- match text nodes within a run-->
<xsl:template match="w:t">
<!-- simple - just output the text content -->
<xsl:value-of select="." />
</xsl:template>
Line breaks (created in Word by pressing [Shift]+[Enter]) are also simple to address with a template matching the <!-- match br tags in a run -->
<xsl:template match="w:br">
<!-- simple line break within a paragraph; this is entered in word with
[Shift]+[Enter] -->
<br />
</xsl:template>
Images are a little more complicated. Our output should be an <!-- match linked picture nodes within a run -->
<xsl:template match="w:pict">
<!-- linked pictures can be handled this way; embedded pics cannot -->
<!-- output as an <img> tag -->
<img>
<!-- output the src attribute; this seems to be a file name
relative to the word document -->
<xsl:attribute name="src">
<xsl:value-of select="v:shape/v:imagedata/@src" />
</xsl:attribute>
<!-- word is using VML to store image information;
for width and height this is in css units;
capture the css style property and apply to
the html <img> tag -->
<xsl:if test="v:shape/@style">
<xsl:attribute name="style">
<xsl:value-of select="v:shape/@style" />
</xsl:attribute>
</xsl:if>
</img>
</xsl:template>
HyperlinksIn WordprocessingML, a hyperlink is represented with a Document bookmarks are represented by Word as a pair of Whether the destination is external or internal, the <!-- match hlink nodes within a paragraph -->
<xsl:template match="w:hlink">
<!-- get the destination, if any -->
<xsl:variable name="dest">
<xsl:value-of select="@w:dest" />
</xsl:variable>
<!-- set up the anchor tag -->
<a>
<!-- add the href attribute -->
<xsl:attribute name="href">
<xsl:choose>
<!-- if the w:bookmark attribute is present, use it -->
<xsl:when test="@w:bookmark">
<xsl:value-of select="concat($dest, '#', @w:bookmark)" />
</xsl:when>
<!-- if not, just use the destination value -->
<xsl:otherwise>
<xsl:value-of select="$dest" />
</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<!-- if there is a w:target attribute, use it too -->
<xsl:if test="@w:target">
<xsl:attribute name="target">
<xsl:value-of select="@w:target" />
</xsl:attribute>
</xsl:if>
<!-- add inner content, probably a <w:t> tag -->
<xsl:apply-templates />
</a>
</xsl:template>
<!-- match the tag that indicates the starting position
for a bookmark -->
<xsl:template match="aml:annotation">
<xsl:if test="@w:type='Word.Bookmark.Start'">
<!-- use the w:name attribute to identify the bookmark name -->
<!-- place an anchor link here with that name -->
<a>
<xsl:attribute name="name">
<xsl:value-of select="@w:name" />
</xsl:attribute>
</a>
</xsl:if>
</xsl:template>
TablesTable formatting has been kept simple in this stylesheet. The template contains two table styles, TableBorder0 and TableBorder1, which are interpreted in the XSL instructions to apply either "0" or "1" for the output table <!-- match the outer table <w:tbl> tag -->
<xsl:template match="w:tbl">
<table>
<xsl:attribute name="border">
<!-- if it's a TableBorder0 style, set the border to 0 -->
<!-- otherwise, set the border to 1 -->
<xsl:choose>
<xsl:when test="w:tblPr/w:tblStyle/@w:val = 'TableBorder0'">
0
</xsl:when>
<xsl:otherwise>1</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<!-- apply templates for inner table content (i.e. <tr> tags)
<xsl:apply-templates />
</table>
</xsl:template>
We supply the following template to match the <!-- match the table row <w:tr> tag -->
<xsl:template match="w:tr">
<tr valign="top">
<xsl:apply-templates />
</tr>
</xsl:template>
Finally, we process individual cells within a row by matching the <!-- match the table cell <w:tc> tag -->
<xsl:template match="w:tc">
<td>
<!-- does this table cell have a background color? -->
<xsl:choose>
<xsl:when test="w:tcPr/w:shd/@w:fill">
<!-- if so, apply it to the <td> tag as an attribute -->
<xsl:attribute name="bgColor">
<xsl:value-of select="concat('#', w:tcPr/w:shd/@w:fill)" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!-- adjust the vertical align? -->
<xsl:choose>
<xsl:when test="w:tcPr/w:vAlign/@w:val">
<xsl:attribute name="valign">
<xsl:value-of select="w:tcPr/w:vAlign/@w:val" />
</xsl:attribute>
</xsl:when>
</xsl:choose>
<!-- apply templates for inner content -->
<xsl:apply-templates />
</td>
</xsl:template>
Additional ConsiderationsWord 2003 does a good job of representing a document with full fidelity in WordprocessingML – too good a job, in fact. Proofing errors for example may render as tags whether or not the options to display such errors are enabled. This can impact the XSL transformation. Proofing Errors in ListsWhen spelling or grammar errors exist at the beginning of a list item, a Hyperlinks as Field CodesThe existence of proofing errors may cause hyperlinks to render differently as well. I have seen hyperlinks represented in WordprocessingML as combinations of SummaryThis article presents an XSL stylesheet for transforming a Word 2003 document into a simple HTML syntax, at the same time offering a Word 2003 template for CodeProject article authors. By making heavy use of Word styles, and by matching specific WordprocessingML tags, common HTML may be rendered without the verbosity typical of Word's Save as HTML command. Certain transformation issues are resolved through resourceful XSL application. For example, the problem of dealing with multiple-paragraph blocks is resolved by using the | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||