![]() |
Enterprise Systems »
Office Development »
General
Intermediate
License: The Code Project Open License (CPOL)
Token Replacement in Office DocumentsBy Rahul SinglaA new productive way of producing Office documents with dynamic content substitution. |
VB (VB 7.x, VB 8.0, VB 9.0), Office, .NET 2.0, .NET 3.0, .NET 3.5VS2005, VS2008, Dev
|
|
Advanced Search Add to IE Search |
|
|
|
||||||||||||||||
A common situation is to prepare invoices, etc. from information in a database. Clients usually have extremely customized invoice formats, but the data to be filled in is basically the same. This article and its related code present a new (I have searched, but could not find anything related to this topic, so I believe it's new), innovative, and productive way of producing dynamic Office documents. I am using the word Office in this entire document as it does not relate solely to Microsoft Office. I have tested this approach on Open Office as well, and it works. The associated code enables to perform Token Replacement on both Office products.
Until now, I was creating HTML on the fly, and making it available for download as a *.doc or *.xls file, exploiting Office application capabilities for processing HTML documents. But, there was a limit this approach would go to. There were obvious limitations in precise positioning in HTML and how Office applications interpret and display that. Not to mention the advanced processing capabilities of Office documents that are missing while generating dynamic HTML.
A new approach had been swirling around my mind since the time Office products embraced XML formats. We, developers, have been using token replacement for a long time for producing dynamic content. How about doing it with Office documents!!
So, I finally got a chance to implement this, when the document I had to produce exceeded HTML formatting capabilities. Now, I was trying to create a Word document, with tokens of the form [$TokenName$] in it, and replacing the tokens with actual text, programmatically. However, it was not as easy as I thought.
Word 2007 splits up tokens into multiple parts, depending upon regions to be checked for spellings. Ditto is the case with OO (OO refers to Open Office). These Office products might split up what you consider a single word into multiple runs (they use the word run) of text surrounded by their XML. That makes the scenario almost impossible. I tried to discuss my approach at Microsoft Forums and at Brian Jones' blog. However, I could not get any useful help.
So, I decided to take it up myself.
As I said, I would be talking of Office XML in general, not particularly related to Microsoft Office or Open Office. Consider one thing before moving forward, only Microsoft Office 2007 applications use a full XML document format. Office 2003 probably uses XML, but I have not studied the Office 2003 document format. Neither have I tested the code on Office 2003 documents.
Regarding OO, I downloaded it specifically for this project. The version was 2.4 at the time I downloaded it. So, it should work on it and later versions. However, you, as a developer, should not be worried. Read forward.
I evaluated several options before actually having to use this approach I am describing (you can say, I was not left with any other option). VSTO is Microsoft Office centric. Also, with the minimal knowledge I collected about VSTO (by asking questions on forums), I thought it would not help. Microsoft's Open XML SDK for Office was indeed a very attractive one. However, I was not in any way interested in the Open XML schemas used by Microsoft Office, and this product's documentation clearly stated that you need to have a good knowledge of the XML schemas to exploit this library.
Then, I ripped opened a Microsoft Office 2007 document (it is simply a zip file, just unzip it), and analyzed the contents. It was plain XML. And, I immediately decided to use the XML processing capabilities of .NET to process what it actually was, just plain-sane XML, not treating it as any special schema.
So, the attached code does not require anything special to be installed. You can use it easily on a Desktop application, as well as a Web application, with the only requirement of .NET 3.5 installed on the machine where the code is executing. Not even the associated Office product needs to be installed. I am treating it as pure XML, so there are no other requirements absolutely. Just prepare an Office document with your tokens in it, deploy the document with your code, and you are ready to scramble.
It all starts with opening up an Office package (an Office file is now called an Office package technically, as it is not just a single file). A good starting point would be to open any Office document with your favorite Zip tool and analyze the contents.
I am using the SharpZipLib library for anything related to processing Zip files during this token replacement process. Once you open the Zip file, you would notice that there are multiple files. The main content file is "word/document.xml" for Microsoft Word 2007 and "content.xml" for OO Writer. As of now, the code can process only Word and Writer packages. I would add support for Excel and OO Calc as I get time. I acknowledge that the code attached is a bit immature at this time. Still, I am writing this article to discuss it with others so that you can adapt and enhance it for your scenario, if you find it useful. I am not sure when I would be able to make enhancements myself.
The overall code structure is as follows:
Straight away after opening up the Zip file, I read its contents into an XDocument (the code uses LINQ and LINQ to XML all over). Right now, the major functionality is coming from the TokenReplacementBase class. You specify the filename to open, the token start, and the token end (both strings).
OfficePackage is the helper class to enable reading and modification of the Office document packages.
The code relies heavily on the .NET System.Text.RegularExpressions for token replacement. You should be careful while choosing the token start and the token end. I am not a Regex expert, and as of now, the characters in your token delimiters should not appear in your content. I have used "[$" and "$]" in the attached sample doc files.
Immediately after constructing an object of the appropriate base class, you can start calling the replaceToken() method (in a loop probably, to process all tokens), passing in your token (including the delimiters) and the replacement value. On the first call to this method, the code parses the entire content, looking for tokens, and stores a dictionary of matches found with the token as the key. After that, and on all subsequent calls, it just consults this dictionary to perform the substitution.
As an alternative, you can create a List (of TokenReplacementInfo) (a helper class in the project), and call replaceTokens(), passing in this List just once, and it performs all the substitutions.
XmlUtil is the helper class that helps in Office XML specific text matching and replacement.
Suppose you have the following token: [$Date$].
You can replace it with the call...
t.replaceToken("[$Date$]", DateTime.Today)
... where t is an object of the concrete class for the package you are processing (Rahul.Office.MS.Word.TokenReplacement or Rahul.Office.OO.Writer.TokenReplacement).
[$Date$<metadata><type>date
</type><format>dd.MM.yyyy</format></metadata>]
when replaced, would adhere to the date format you specified.
[$InvoiceTotalValue$<metadata><type>money
</type><format>text</format></metadata>]
would treat the replacement value as a monetary one, and automatically convert it to English (hundred thousand, five hundred six only, etc.).
[$InvoiceTotalValue$<metadata>
<type>money</type><format>text</format>
<transform>upper</transform></metadata>
The transforms "upper" and "lower" converts the text to uppercase or lower-case, respectively. This can be applied to text as well as monetary values whose format is text.
Note that the metadata appears only in the Office document you have prepared. While making function calls, you still just call:
t.replaceToken("[$Date$]", DateTime.Today)
The code automatically determines if that token has any metadata attached in the document. The metadata tags can appear anywhere between your token.
replaceToken() with that token as the input. However, there is an overload of replaceToken() available, where you can specify the ith occurrence of the token to be replaced. Thus, the following call...
t.replaceToken("[$Date$]", DateTime.Today, 2)
... would replace only the second occurrence of the token [$Date$] in the original content. (There are some finer points related to ith occurrence replacement, which are documented properly in the code attached.)
The same token can occur multiple times with different metadata. And, each token occurrence would be replaced taking into consideration the metadata specified, if any, for that occurrence only.
replaceToken(), call replicateRow(), passing in that unique token and the number of times the row is to be replicated. save() function, as that replaces the content in the original Office package specified in the constructor. Here are the precise steps for using the code:
Here's all it takes to replace tokens:
Sub Main()
'Copy the Tokenized file. The new copy would actually be processed.
System.IO.File.Copy("TokenizedInvoice.docx", _
"ProcessedInvoice.docx", True)
'Construct an object to Microsoft Office Word Token Replacement.
Dim t As New TokenReplacement("ProcessedInvoice.docx", _
"[$", "$]")
'Construct a list of Token Info's to be replaced.
Dim list As New List(Of TokenReplacementInfo)
list.Add(New TokenReplacementInfo("[$LCNo$]", "11111"))
list.Add(New TokenReplacementInfo("[$LCInvoiceNo$]", "22222"))
'Pay attention here. The Date token has metadata in the Invoice.
'You do not (should not) specify the metadata here.
list.Add(New TokenReplacementInfo("[$LCInvoiceDate$]", DateTime.Today))
'Notice the Tokenized document has a token called [$ReplicateRow$]. That
'row has Product information, and suppose I have 2 prodcuts in this
'Invoice. True indicates to remove the Row Replication token after the
'replication process.
t.replicateRow("[$ReplicateRow$]", True, 2)
'I am substituting 2 products here. So, using 0 and 1 to indicate the
'occurence of the token to be replaced, because the replicate rows would
'have identical Tokens.
'First good goes here. Last parameter is 0 for replacement of first
'occurrences of all tokens specified. (-1 would have replaced all),
'which also is the default behaviour without the third parameter.
list.Add(New TokenReplacementInfo("[$LCGoodOrderName$]", _
"My neighbour's car", 0))
list.Add(New TokenReplacementInfo("[$LCGoodOrderBrand$]", _
"Bentley", 0))
list.Add(New TokenReplacementInfo("[$LCGoodOrderSpecification$]", _
"Black, with leather upholstery", 0))
'Second good goes here Last parameter is 1 for replacement of second
'occurrences of all tokens specified.
'Omitted for brevity
'Check the Tokenized Invoice. There are multiple occurrences of
'[$LCInvoiceTotalValue$] with different metadata.
'Just one call replaces all, honoring each one's metadata individually.
list.Add(New TokenReplacementInfo("[$LCInvoiceTotalValue$]", 10010200))
list.Add(New TokenReplacementInfo("[$LCDeliveryType$]", "CNF"))
'Replace all in one go.
t.replaceTokens(list)
'Don't forget the save the document.
t.save()
End Sub
I am not trying to play down these SDKs. They are very powerful. But, I believe they are too powerful to be used in regular development, unless you have a good understanding of schemas.
I needed to deliver the functionality quickly to a client, and assembled the original code quickly. Since then, I have made some enhancements to it, that I have updated in the article.
I am using lots of regex, and probably they can be tweaked to increase performance (although I have been able to Token Replace large documents in virtually no time). There are many more features or metadata extensions that could be added. Support for Excel and Calc, at least, is desired. Currently, Replacement in Header/Footer sections is not supported (see the Comment below for details. More replacement options, the list would never end.
I will try to take time out and enhance this. But, right now, as it stands, the code should satisfy many requirements in a majority of the cases.
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 22 May 2009 Editor: Deeksha Shenoy |
Copyright 2008 by Rahul Singla Everything else Copyright © CodeProject, 1999-2009 Web22 | Advertise on the Code Project |