Click here to Skip to main content
6,595,444 members and growing! (17,959 online)
Email Password   helpLost your password?
Enterprise Systems » Office Development » General     Intermediate License: The Code Project Open License (CPOL)

Token Replacement in Office Documents

By Rahul Singla

A new productive way of producing Office documents with dynamic content substitution.
VB (VB 7.x, VB 8.0, VB 9.0), Office, .NET 2.0, .NET 3.0, .NET 3.5VS2005, VS2008, Dev
Version:5 (See All)
Posted:22 Dec 2008
Updated:22 May 2009
Views:12,660
Bookmarked:35 times
Announcements
Loading...
 
Search    
Advanced Search
Add to IE Search
printPrint   add Share
      Discuss Discuss   Broken Article?Report  
7 votes for this article.
Popularity: 4.08 Rating: 4.83 out of 5

1

2

3
1 vote, 14.3%
4
6 votes, 85.7%
5

Introduction

A common situation is to prepare invoices, etc. from information in a database. Clients usually have extremely customized invoice formats, but the data to be filled in is basically the same. This article and its related code present a new (I have searched, but could not find anything related to this topic, so I believe it's new), innovative, and productive way of producing dynamic Office documents. I am using the word Office in this entire document as it does not relate solely to Microsoft Office. I have tested this approach on Open Office as well, and it works. The associated code enables to perform Token Replacement on both Office products.

Background

Until now, I was creating HTML on the fly, and making it available for download as a *.doc or *.xls file, exploiting Office application capabilities for processing HTML documents. But, there was a limit this approach would go to. There were obvious limitations in precise positioning in HTML and how Office applications interpret and display that. Not to mention the advanced processing capabilities of Office documents that are missing while generating dynamic HTML.

A new approach had been swirling around my mind since the time Office products embraced XML formats. We, developers, have been using token replacement for a long time for producing dynamic content. How about doing it with Office documents!!

So, I finally got a chance to implement this, when the document I had to produce exceeded HTML formatting capabilities. Now, I was trying to create a Word document, with tokens of the form [$TokenName$] in it, and replacing the tokens with actual text, programmatically. However, it was not as easy as I thought.

Word 2007 splits up tokens into multiple parts, depending upon regions to be checked for spellings. Ditto is the case with OO (OO refers to Open Office). These Office products might split up what you consider a single word into multiple runs (they use the word run) of text surrounded by their XML. That makes the scenario almost impossible. I tried to discuss my approach at Microsoft Forums and at Brian Jones' blog. However, I could not get any useful help.

So, I decided to take it up myself.

Some More Background

As I said, I would be talking of Office XML in general, not particularly related to Microsoft Office or Open Office. Consider one thing before moving forward, only Microsoft Office 2007 applications use a full XML document format. Office 2003 probably uses XML, but I have not studied the Office 2003 document format. Neither have I tested the code on Office 2003 documents.

Regarding OO, I downloaded it specifically for this project. The version was 2.4 at the time I downloaded it. So, it should work on it and later versions. However, you, as a developer, should not be worried. Read forward.

I evaluated several options before actually having to use this approach I am describing (you can say, I was not left with any other option). VSTO is Microsoft Office centric. Also, with the minimal knowledge I collected about VSTO (by asking questions on forums), I thought it would not help. Microsoft's Open XML SDK for Office was indeed a very attractive one. However, I was not in any way interested in the Open XML schemas used by Microsoft Office, and this product's documentation clearly stated that you need to have a good knowledge of the XML schemas to exploit this library.

Then, I ripped opened a Microsoft Office 2007 document (it is simply a zip file, just unzip it), and analyzed the contents. It was plain XML. And, I immediately decided to use the XML processing capabilities of .NET to process what it actually was, just plain-sane XML, not treating it as any special schema.

So, the attached code does not require anything special to be installed. You can use it easily on a Desktop application, as well as a Web application, with the only requirement of .NET 3.5 installed on the machine where the code is executing. Not even the associated Office product needs to be installed. I am treating it as pure XML, so there are no other requirements absolutely. Just prepare an Office document with your tokens in it, deploy the document with your code, and you are ready to scramble.

Now Comes the Code

It all starts with opening up an Office package (an Office file is now called an Office package technically, as it is not just a single file). A good starting point would be to open any Office document with your favorite Zip tool and analyze the contents.

I am using the SharpZipLib library for anything related to processing Zip files during this token replacement process. Once you open the Zip file, you would notice that there are multiple files. The main content file is "word/document.xml" for Microsoft Word 2007 and "content.xml" for OO Writer. As of now, the code can process only Word and Writer packages. I would add support for Excel and OO Calc as I get time. I acknowledge that the code attached is a bit immature at this time. Still, I am writing this article to discuss it with others so that you can adapt and enhance it for your scenario, if you find it useful. I am not sure when I would be able to make enhancements myself.

The overall code structure is as follows:

ClassDiagram1.gif

Straight away after opening up the Zip file, I read its contents into an XDocument (the code uses LINQ and LINQ to XML all over). Right now, the major functionality is coming from the TokenReplacementBase class. You specify the filename to open, the token start, and the token end (both strings).

OfficePackage is the helper class to enable reading and modification of the Office document packages.

Regex

The code relies heavily on the .NET System.Text.RegularExpressions for token replacement. You should be careful while choosing the token start and the token end. I am not a Regex expert, and as of now, the characters in your token delimiters should not appear in your content. I have used "[$" and "$]" in the attached sample doc files.

Immediately after constructing an object of the appropriate base class, you can start calling the replaceToken() method (in a loop probably, to process all tokens), passing in your token (including the delimiters) and the replacement value. On the first call to this method, the code parses the entire content, looking for tokens, and stores a dictionary of matches found with the token as the key. After that, and on all subsequent calls, it just consults this dictionary to perform the substitution.

As an alternative, you can create a List (of TokenReplacementInfo) (a helper class in the project), and call replaceTokens(), passing in this List just once, and it performs all the substitutions.

XmlUtil is the helper class that helps in Office XML specific text matching and replacement.

Sample

Suppose you have the following token: [$Date$].

You can replace it with the call...

t.replaceToken("[$Date$]", DateTime.Today)

... where t is an object of the concrete class for the package you are processing (Rahul.Office.MS.Word.TokenReplacement or Rahul.Office.OO.Writer.TokenReplacement).

Fine Points

  • The token replacement functionality is flexible. You can specify metadata in your token to control exactly how the token is replaced. E.g.:
    • The token:
      [$Date$<metadata><type>date
      </type><format>dd.MM.yyyy</format></metadata>]

      when replaced, would adhere to the date format you specified.

    • As another one:
      [$InvoiceTotalValue$<metadata><type>money
      </type><format>text</format></metadata>]

      would treat the replacement value as a monetary one, and automatically convert it to English (hundred thousand, five hundred six only, etc.).

    • The last metadata supported right now is for normal text.
      [$InvoiceTotalValue$<metadata>
      <type>money</type><format>text</format>
      <transform>upper</transform></metadata>

      The transforms "upper" and "lower" converts the text to uppercase or lower-case, respectively. This can be applied to text as well as monetary values whose format is text.

    Note that the metadata appears only in the Office document you have prepared. While making function calls, you still just call:

    t.replaceToken("[$Date$]", DateTime.Today)

    The code automatically determines if that token has any metadata attached in the document. The metadata tags can appear anywhere between your token.

  • The same token can appear multiple times. The default behavior is to replace all occurrences of the same token with the replacement value specified as the first call to replaceToken() with that token as the input. However, there is an overload of replaceToken() available, where you can specify the ith occurrence of the token to be replaced. Thus, the following call...
    t.replaceToken("[$Date$]", DateTime.Today, 2)

    ... would replace only the second occurrence of the token [$Date$] in the original content. (There are some finer points related to ith occurrence replacement, which are documented properly in the code attached.)

The same token can occur multiple times with different metadata. And, each token occurrence would be replaced taking into consideration the metadata specified, if any, for that occurrence only.

  • Another common situation in invoicing would be an invoice that can contain multiple products. But, you don't know how many when preparing the tokenized document at the development time. Not need to worry. Just prepare a sample row in a table. Leave a unique token in that row in any cell. Then, before calling replaceToken(), call replicateRow(), passing in that unique token and the number of times the row is to be replicated.
  • All replacements are done in memory. When you are finished, remember to call the save() function, as that replaces the content in the original Office package specified in the constructor.

Using the Code

Here are the precise steps for using the code:

  1. Prepare a Word or Writer document (not template) with tokens in it.
  2. Create a Console application. Add a reference to the attached Rahul.Office assembly
  3. Start replacing tokens. A sample code is presented below.

Sample

Here's all it takes to replace tokens:

Sub Main()
    'Copy the Tokenized file. The new copy would actually be processed.
    System.IO.File.Copy("TokenizedInvoice.docx", _
                        "ProcessedInvoice.docx", True)

    'Construct an object to Microsoft Office Word Token Replacement.
    Dim t As New TokenReplacement("ProcessedInvoice.docx", _
                                  "[$", "$]")

    'Construct a list of Token Info's to be replaced.
    Dim list As New List(Of TokenReplacementInfo)

    list.Add(New TokenReplacementInfo("[$LCNo$]", "11111"))
    list.Add(New TokenReplacementInfo("[$LCInvoiceNo$]", "22222"))

    'Pay attention here. The Date token has metadata in the Invoice.
         'You do not (should not) specify the metadata here.
    list.Add(New TokenReplacementInfo("[$LCInvoiceDate$]", DateTime.Today))

    'Notice the Tokenized document has a token called [$ReplicateRow$]. That
         'row has Product information, and suppose I have 2 prodcuts in this
         'Invoice. True indicates to remove the Row Replication token after the
         'replication process.
    t.replicateRow("[$ReplicateRow$]", True, 2)

    'I am substituting 2 products here. So, using 0 and 1 to indicate the
         'occurence of the token to be replaced, because the replicate rows would
         'have identical Tokens.
    'First good goes here. Last parameter is 0 for replacement of first
         'occurrences of all tokens specified. (-1 would have replaced all),
    'which also is the default behaviour without the third parameter.
    list.Add(New TokenReplacementInfo("[$LCGoodOrderName$]", _
             "My neighbour's car", 0))
    list.Add(New TokenReplacementInfo("[$LCGoodOrderBrand$]", _
             "Bentley", 0))
    list.Add(New TokenReplacementInfo("[$LCGoodOrderSpecification$]", _
             "Black, with leather upholstery", 0))

    'Second good goes here Last parameter is 1 for replacement of second
         'occurrences of all tokens specified.
    'Omitted for brevity

    'Check the Tokenized Invoice. There are multiple occurrences of
         '[$LCInvoiceTotalValue$] with different metadata.
    'Just one call replaces all, honoring each one's metadata individually.
    list.Add(New TokenReplacementInfo("[$LCInvoiceTotalValue$]", 10010200))
    list.Add(New TokenReplacementInfo("[$LCDeliveryType$]", "CNF"))

    'Replace all in one go.
    t.replaceTokens(list)

    'Don't forget the save the document.
    t.save()
End Sub

Advantages over Open XML SDK or Other Such Options

  1. You need to have a good understanding of Open XML schemas for using these SDKs.
  2. You are stuck with one particular product when using them.
  3. You need them installed on the target machine for use. Here, just drop the Rahul.Office assembly into the bin folder, or copy the code files to your project.

I am not trying to play down these SDKs. They are very powerful. But, I believe they are too powerful to be used in regular development, unless you have a good understanding of schemas.

Still To Be Done

I needed to deliver the functionality quickly to a client, and assembled the original code quickly. Since then, I have made some enhancements to it, that I have updated in the article.

I am using lots of regex, and probably they can be tweaked to increase performance (although I have been able to Token Replace large documents in virtually no time). There are many more features or metadata extensions that could be added. Support for Excel and Calc, at least, is desired. Currently, Replacement in Header/Footer sections is not supported (see the Comment below for details. More replacement options, the list would never end.

I will try to take time out and enhance this. But, right now, as it stands, the code should satisfy many requirements in a majority of the cases.

History

  • 22nd December 2008: Initial post
  • 22nd May, 2009: Article updated

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Rahul Singla


Member

Location: India India

Other popular Office Development articles:

Article Top
You must Sign In to use this message board.
FAQ FAQ 
 
Noise Tolerance  Layout  Per page   
 Msgs 1 to 21 of 21 (Total in Forum: 21) (Refresh)FirstPrevNext
GeneralWhy not playing with Rtf PinmemberErnest Bariq15:05 28 Jun '09  
GeneralA better approach might be ContentPlaceHolders? Pinmemberdeloford22:07 22 May '09  
GeneralRe: A better approach might be ContentPlaceHolders? PinmemberRahul Singla22:39 22 May '09  
GeneralRe: A better approach might be ContentPlaceHolders? [modified] Pinmemberdeloford4:31 23 May '09  
GeneralRe: A better approach might be ContentPlaceHolders? PinmemberRahul Singla7:20 23 May '09  
GeneralReplace in Headers and Footers PinmemberElsaWood5:46 22 May '09  
GeneralRe: Replace in Headers and Footers [modified] PinmemberRahul Singla7:10 22 May '09  
GeneralRe: Replace in Headers and Footers [modified] PinmemberRahul Singla6:49 24 May '09  
GeneralRight thing PinmemberBülent1:01 29 Jan '09  
GeneralRe: Right thing PinmemberRahul Singla18:48 29 Jan '09  
Generallist of docs Pinmemberbidulle3:55 9 Jan '09  
GeneralRe: list of docs PinmemberRahul Singla22:46 9 Jan '09  
GeneralRe: list of docs PinmemberErnest Bariq17:02 10 Jan '09  
GeneralRe: list of docs PinmemberRahul Singla19:36 10 Jan '09  
GeneralRe: list of docs PinmemberErnest Bariq17:04 10 Jan '09  
GeneralInteresting PinmemberNyarlatotep1:46 30 Dec '08  
QuestionWhy aren't you using the OpenXML SDK? PinmemberItay Sagui6:40 22 Dec '08  
AnswerRe: Why aren't you using the OpenXML SDK? PinmemberSoulStone-BR10:49 22 Dec '08  
GeneralRe: Why aren't you using the OpenXML SDK? PinmemberItay Sagui12:53 22 Dec '08  
GeneralRe: Why aren't you using the OpenXML SDK? PinmemberRahul Singla19:02 22 Dec '08  
GeneralRe: Why aren't you using the OpenXML SDK? PinmemberRahul Singla19:08 22 Dec '08  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 22 May 2009
Editor: Deeksha Shenoy
Copyright 2008 by Rahul Singla
Everything else Copyright © CodeProject, 1999-2009
Web22 | Advertise on the Code Project