|
|||||||||||||||||||||
|
|||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
OverviewThis article describes a conversion tool which takes an HTML document as input and generates a Microsoft Word document for printing. It all started when I had to work on a new information system with hundreds of computers. We decided to go for a 100% web-based application. Everything was fine until we had to print official documents from the application... Although there are standardization efforts in progress (both at the W3C with XHTML-PRINT and IEEE with the Print Working Group), and besides some good tools to print HTML (HTML Print from Bersoft, ScriptX from MeadCo), none of these seemed to address my needs. I wanted to keep my Web-based application, and reuse the generated HTML to feed a printer... Have you tried to print HTML documents? Have you tried to format your HTML documents for printing, with specific fonts, sizes, headers, footers, and margins? If you have, then you know that HTML is not appropriate for printing - but you can find other formats and use new tools to convert HTML documents into Microsoft Word format, a format suitable for printing... And this is what this article is about. ContentsFeaturesThe XHTML2RTF conversion tool:
IntroductionThe XHTML2RTF conversion tool uses XSL style sheet to convert an XHTML document into an RTF document, suitable for previewing and printing with Word (or Word Viewer). XHTML = HTML + XMLThe Extensible HyperText Markup Language (XHTML) is a family of current and future document types and modules that reproduce, subset, and extend HTML, reformulated in XML. XHTML family document types are all XML-based, and ultimately are designed to work in conjunction with XML-based user agents. XHTML is the successor of HTML, and a series of specifications has been developed for XHTML. The XHTML2RTF conversion tool reads XHTML documents as input. As a consequence, you have to adapt your application in order to use this tool. XSLXSL stands for eXtensible Stylesheet Language. It is a family of recommendations for defining XML document transformation and presentation. It consists of three parts:
For more about XSL, please refer to XSL references pages. The XHTML2RTF conversion tool uses XSL to transform XHTML documents (XML documents) into RTF documents. This is the core of the tool - anything else is just a glue to build your application. Everything is in the XSL style sheet. Microsoft XML SDK 3.0Microsoft provides an XML SDK for processing XML and XSL documents. It's often installed with the operating system, but you can download and install the latest SDK. See References section for more on MSXML SDK. The XHTML2RTF conversion tool uses XML SDK objects and methods to process XHTML and transform it into RTF. XML SDK API is available to Web application as well as batch applications and so is the XHTML2RTF conversion tool. Microsoft Rich Text Format (RTF)Microsoft created an exchange format for Word documents: Rich Text Format (RTF). Unlike the native Word format, it is documented; moreover, RTF has been here for some time (so you can view RTF documents with good old Word 97). There is also a free RTF viewer (Word 97/2000 Viewer), and even WordPad (installed with most Windows releases) can open, view and edit RTF documents. XHTML2RTFThe XHTML to RTF converter consists of an XSL style sheet for parsing XHTML tags and generating their RTF equivalents. UsageFrom HTML to XHTMLYou have to adapt your application to generate XHTML documents if you want to use the XHTML2RTF conversion tool:
Thus, you will be able to customize the RTF output for your class (it's too hard to parse an HTML style declaration within an XSL style sheet). Spaces in HTML and RTFIn HTML, spaces are not significant - most browsers ignore them when they render the document. On the other hand, Microsoft Word (and RTF) render spaces as visible characters. Be careful when building your HTML document: do not generate spaces or they will be shown in your Word document. Header and footer in HTML and RTFThe default header in the RTF document contains the HTML
The default footer in the RTF document contains the page number and the document date (current date and time; i.e. print date and time). You can change the footer by setting the parameters Page breakTo force a page break in your RTF document, you can use the CSS style " This is on page 1
<p style="page-break-before:always"/>
This is on page 2
Note that other values for these CSS styles (left, right, auto...) are not supported (only " XSL style sheet parametersThe XSL style sheet xhtml2rtf.xsl provides a set of parameters so that you can change the stylesheet's default behavior:
Batch mode (WSH)I wrote a BATCH program (XHTML2RTF.BAT) which relies on Windows Script Host (WSH) to call the XML DOM SDK and transforms an HTML file into its RTF equivalent (output is done in To use this component from batch: call the program XHTML2RTF.BAT with the HTML file name as parameter. The RTF file is generated in C:\> XHTML2RTF.BAT Readme.htm > Readme.rtf
C:\> START WINWORD Readme.rtf
To pass parameters to the XHTML2RTF program, use the -p flag followed by the parameter name and value. For example: C:\> XHTML2RTF.BAT -p page-start-number=5 -p document-protected=0
-p font-name-default='Arial' Readme.htm > Readme.rtf
C:\> START WINWORD Readme.rtf
Web-based (ASP)I wrote a simple ASP library to call the component from an ASP page, producing RTF document from live, dynamic content (results from a SQL database request, for example). To use this component from a web page, you have to include the XHTML2RTF.inc file in your page, and call the function <!--#include file="XHTML2RTF.inc"-->
var strXHTML = " \
<html xmlns=\"http://www.w3.org/1999/xhtml\"
xmlns:xhtml2rtf=
\"http://www.lutecia.info/download/xmlns/xhtml2rtf\"> \
<head> \
<title>Hello, World! from string</title> \
</head> \
<body> \
<h1>Hello, World!</h1> \
</body> \
</html> \
";
XHTMLString2RTF(strXHTML);
Note: The real production system uses SQL requests, generates XML output, transforms it into XHTML via a first XSL style sheet, and then transforms it into an RTF document. The example above is just that - an example for demonstration purposes. Please do not generate HTML via strings on your production system ;-) Raw RTF outputThe XHTML2RTF conversion tool provides a direct RTF output with no rendering in XHTML. The tool processes a special tag ( There are many uses for this raw output - in particular, you can work around most of the current limitations in the conversion tool (as listed in the TODO section). For example, you can send the RTF code for an image, even if the conversion tool doesn't handle images yet: <xhtml2rtf:raw class="rtf">
{\*\shppict{\pict\picw3043\pich3043\picwgoal1725\pichgoal1725\pngblip
89504e470d0a1a0a0000000d49484452000000730000007308020000002421
aab1000000017352474200aece1ce90000000467414d410000b18f0bfc61050000
...
}}
</xhtml2rtf:raw>
To find out what RTF code is appropriate for this image, I just used Word to edit a document with a picture, and then saved it in the RTF format. I opened the resulting file as text, and copied/pasted the RTF code into the XHTML output, within the RTF-specific fieldsSome RTF-specific fields are available in the conversion tool. Page numberYou can display the current page number in your RTF document via PAGE <xhtml2rtf:page_number/>
Total number of pagesYou can display total number of pages in your RTF document via PAGE <xhtml2rtf:page_number/> / <xhtml2rtf:total_number_of_pages/>
Samples
Implementation
To do list
References
AcknowledgementsMany thanks to 2can for his table support added to my original source code. | ||||||||||||||||||||