Click here to Skip to main content
Email Password   helpLost your password?

Introduction

A common situation is to prepare invoices, etc. from information in a database. Clients usually have extremely customized invoice formats, but the data to be filled in is basically the same. This article and its related code present a new (I have searched, but could not find anything related to this topic, so I believe it's new), innovative, and productive way of producing dynamic Office documents. I am using the word Office in this entire document as it does not relate solely to Microsoft Office. I have tested this approach on Open Office as well, and it works. The associated code enables to perform Token Replacement on both Office products.

Background

Until now, I was creating HTML on the fly, and making it available for download as a *.doc or *.xls file, exploiting Office application capabilities for processing HTML documents. But, there was a limit this approach would go to. There were obvious limitations in precise positioning in HTML and how Office applications interpret and display that. Not to mention the advanced processing capabilities of Office documents that are missing while generating dynamic HTML.

A new approach had been swirling around my mind since the time Office products embraced XML formats. We, developers, have been using token replacement for a long time for producing dynamic content. How about doing it with Office documents!!

So, I finally got a chance to implement this, when the document I had to produce exceeded HTML formatting capabilities. Now, I was trying to create a Word document, with tokens of the form [$TokenName$] in it, and replacing the tokens with actual text, programmatically. However, it was not as easy as I thought.

Word 2007 splits up tokens into multiple parts, depending upon regions to be checked for spellings. Ditto is the case with OO (OO refers to Open Office). These Office products might split up what you consider a single word into multiple runs (they use the word run) of text surrounded by their XML. That makes the scenario almost impossible. I tried to discuss my approach at Microsoft Forums and at Brian Jones' blog. However, I could not get any useful help.

So, I decided to take it up myself.

Some More Background

As I said, I would be talking of Office XML in general, not particularly related to Microsoft Office or Open Office. Consider one thing before moving forward, only Microsoft Office 2007 applications use a full XML document format. Office 2003 probably uses XML, but I have not studied the Office 2003 document format. Neither have I tested the code on Office 2003 documents.

Regarding OO, I downloaded it specifically for this project. The version was 2.4 at the time I downloaded it. So, it should work on it and later versions. However, you, as a developer, should not be worried. Read forward.

I evaluated several options before actually having to use this approach I am describing (you can say, I was not left with any other option). VSTO is Microsoft Office centric. Also, with the minimal knowledge I collected about VSTO (by asking questions on forums), I thought it would not help. Microsoft's Open XML SDK for Office was indeed a very attractive one. However, I was not in any way interested in the Open XML schemas used by Microsoft Office, and this product's documentation clearly stated that you need to have a good knowledge of the XML schemas to exploit this library.

Then, I ripped opened a Microsoft Office 2007 document (it is simply a zip file, just unzip it), and analyzed the contents. It was plain XML. And, I immediately decided to use the XML processing capabilities of .NET to process what it actually was, just plain-sane XML, not treating it as any special schema.

So, the attached code does not require anything special to be installed. You can use it easily on a Desktop application, as well as a Web application, with the only requirement of .NET 3.5 installed on the machine where the code is executing. Not even the associated Office product needs to be installed. I am treating it as pure XML, so there are no other requirements absolutely. Just prepare an Office document with your tokens in it, deploy the document with your code, and you are ready to scramble.

Now Comes the Code

It all starts with opening up an Office package (an Office file is now called an Office package technically, as it is not just a single file). A good starting point would be to open any Office document with your favorite Zip tool and analyze the contents.

I am using the SharpZipLib library for anything related to processing Zip files during this token replacement process. Once you open the Zip file, you would notice that there are multiple files. The main content file is "word/document.xml" for Microsoft Word 2007 and "content.xml" for OO Writer. As of now, the code can process only Word and Writer packages. I would add support for Excel and OO Calc as I get time. I acknowledge that the code attached is a bit immature at this time. Still, I am writing this article to discuss it with others so that you can adapt and enhance it for your scenario, if you find it useful. I am not sure when I would be able to make enhancements myself.

The overall code structure is divided into 4 components (as of the 3rd version of this library):

  1. Replacers - These denote sections of Office Documents (header, content, footer, etc.). Token Replacement can be performed on each section independently. These were added in the 2nd version of this library.

    replacers.gif

  2. Documents - These classes represent the Office Documents (.docx, etc.) themselves. Each document can choose to expose desired sections. e.g. Microsoft Word & OO Writer documents expose a header, content & footer section where each section has a replacer associated with it, that performs Token Replacement in that section. The following code snippet might help to clarify this:
    doc.header.replaceToken("[$Date$]", DateTime.Today)
    doc.content.replaceToken("[$Consignee$]", Loreium Ipsum)			

    documents.gif

  3. Interfaces - These interfaces are implemented by the Office Documents (explained below in the Add-In section).

    interfaces.gif

  4. Helper classes - These classes provide utility functions to help in token replacements.

    helpers.gif

Straight away after opening up the Zip file, I read its contents into an XDocument (the code uses LINQ and LINQ to XML all over). Right now, the major regarding functionality actual Token Replacement is coming from the TokenReplacerBase class. This is the base class for all the Replacers for different sections of an Office document.

However, you as a user needs to have an instance of a concrete class of TokenizedDocumentBase. You get this instance by using static factory methods of the TokenizedDocumentProxy class. You specify the filename (with its path) to open, the token start, and the token end (both strings). This proxy class has been introduced in the 3rd version of this library again for reasons explained in the Add-In section below. An example code snippet should help:

Dim p As IWordProcessingTokenizedDocument = TokenizedDocumentProxy.getDocumentProcessor_
				(Of IWordProcessingTokenizedDocument)( _
				SupportedExtensions.docx, _
				"ProcessedInvoice.docx", _
				"[$", "$]", _
				True)	

OfficePackage is the helper class to enable reading and modification of the Office document packages. Formatter, MetadataProcessor, Currency etc. are other helper classes used for various purposes during Token Replacement.

Regex

The code relies heavily on the .NET System.Text.RegularExpressions for token replacement. You should be careful while choosing the token start and the token end. As of now, the characters in your token delimiters should not appear in your content. I have used "[$" and "$]" as the delimiters in the attached sample docx files.

Immediately after constructing an object of the appropriate Document class (by using the factory methods in TokenizedDocumentProxy), you can start calling the replaceToken() method (in a loop probably, to process all tokens), passing in your token (including the delimiters) and the replacement value. On the first call to this method, the code parses the entire content, looking for tokens, and stores a dictionary of matches found with the token as the key. After that, and on all subsequent calls, it just consults this dictionary to perform the substitution.

As an alternative, you can create a List (of TokenReplacementInfo) (a helper class in the project), and call replaceTokens(), passing in this List just once, and it performs all the substitutions.

XmlUtil is the helper class that helps in Office XML specific text matching and replacement.

Sample

Suppose you have the following token: [$Date$].

You can replace it with the call...

doc.content.replaceToken("[$Date$]", DateTime.Today)

... where t is an object of the concrete class for the package you are processing (Rahul.Office.MS.Word.WordTokenizedDocument or Rahul.Office.OO.Writer.WriterTokenizedDocument).

Fine Points

The same token can occur multiple times with different metadata. And, each token occurrence would be replaced taking into consideration the metadata specified, if any, for that occurrence only.

Using the Code

Here are the precise steps for using the code:

  1. Prepare a Word or Writer document (not template) with tokens in it.
  2. Create a Console application. Add a reference to the attached Rahul.Office assembly
  3. Start replacing tokens. A sample code is presented below.

Sample

Here's all it takes to replace tokens:

Sub Main()
    'Copy the Tokenized file. The new copy would actually be processed.
    System.IO.File.Copy("TokenizedInvoice.docx", _
                        "ProcessedInvoice.docx", True)

    'Construct an object to Microsoft Office Word Token Replacement.
	Dim p As IWordProcessingTokenizedDocument = _
			TokenizedDocumentProxy.getDocumentProcessor_
			(Of IWordProcessingTokenizedDocument) _
			(SupportedExtensions.docx, _
			"ProcessedInvoice.docx", _
			"[$", "$]", _
			True)

    'Construct a list of Token Info's to be replaced.
    Dim list As New List(Of TokenReplacementInfo)

    list.Add(New TokenReplacementInfo("[$LCNo$]", "11111"))
    list.Add(New TokenReplacementInfo("[$LCInvoiceNo$]", "22222"))

    'Pay attention here. The Date token has metadata in the Invoice.
         'You do not (should not) specify the metadata here.
    list.Add(New TokenReplacementInfo("[$LCInvoiceDate$]", DateTime.Today))

    'Notice the Tokenized document has a token called [$ReplicateRow$]. That
         'row has Product information, and suppose I have 2 products in this
         'Invoice. True indicates to remove the Row Replication token after the
         'replication process.
    p.body.replicateRow("[$ReplicateRow$]", True, 2)

    'I am substituting 2 products here. So, using 0 and 1 to indicate the
         'occurrence of the token to be replaced, because the replicate rows would
         'have identical Tokens.
    'First good goes here. Last parameter is 0 for replacement of first
         'occurrences of all tokens specified. (-1 would have replaced all),
    'which also is the default behaviour without the third parameter.
    list.Add(New TokenReplacementInfo("[$LCGoodOrderName$]", _
             "My neighbour's car", 0))
    list.Add(New TokenReplacementInfo("[$LCGoodOrderBrand$]", _
             "Bentley", 0))
    list.Add(New TokenReplacementInfo("[$LCGoodOrderSpecification$]", _
             "Black, with leather upholstery", 0))

    'Second good goes here Last parameter is 1 for replacement of second
         'occurrences of all tokens specified.
    'Omitted for brevity

    'Check the Tokenized Invoice. There are multiple occurrences of
         '[$LCInvoiceTotalValue$] with different metadata.
    'Just one call replaces all, honoring each one's metadata individually.
    list.Add(New TokenReplacementInfo("[$LCInvoiceTotalValue$]", 10010200))
    list.Add(New TokenReplacementInfo("[$LCDeliveryType$]", "CNF"))

    'Replace all in one go.
    p.body.replaceTokens(list)

	'New Feature: Html Replacement
	If (TypeOf (p) Is IWordProcessingTokenizedDocumentExtension) Then
		Dim pext As IWordProcessingTokenizedDocumentExtension = _
		CType(p, IWordProcessingTokenizedDocumentExtension)
		pext.replaceTokenWithHtml("[$HtmlToken$]", _
		"Html text that replaced a token")
	End If

    'Don't forget the save the document.
    p.save()
End Sub

Advantages over Open XML SDK or Other Such Options

  1. You need to have a good understanding of Open XML schemas for using these SDKs.
  2. You are stuck with one particular product when using them.
  3. You need them installed on the target machine for use. Here, just drop the Rahul.Office assembly into the bin folder, or copy the code files to your project.

I am not trying to play down these SDKs. They are very powerful. But, I believe they are too powerful to be used in regular development, unless you have a good understanding of schemas.

3rd Version - Add-In Architecture Introduced

The second version of this library introduced support for Token Replacement in Header and Footer sections (see this comment) (But please, download the code attached with this article, not that comment, because that code is obsolete and the file has been removed from Rapidshare).

Some time after releasing the second version, I had an interesting scenario, where a client wanted to be able to replace a token with Html produced dynamically. As anyone would imagine, this was a considerably complex scenario because you simply cannot replace the Token with HTML markup. This would render the Office document corrupt, because HTML is not compatible with Office markup.

I needed to provide this functionality. It was simply not possible to provide a conversion from HTML to Office markup. This would have been way too complex and outside the scope of this library. Some Googling revealed the support of VSTO for such scenarios. However, remember VSTO is a Microsoft Office centric collection of libraries for enabling processing of Microsoft Office documents from .NET code. More importantly, VSTO requires a valid copy of Microsoft Office to be installed on the machine.

So, I refactored this library for Add-In architecture. The core support for Token Replacement together with all the features mentioned above come from the core Rahul.Office.dll assembly. However, this assembly itself tries to load Rahul.Office.MS.dll or Rahul.Office.OO.dll assemblies. These assemblies can provide extended support for Token Replacement for the corresponding Office product. However, if not found, the core assembly reverts to itself for the Token Replacement features it provides.

To support this refactoring, the TokenizedDocumentBase was refactored into a set of interfaces. The ITokenizedDocument interface provides methods that all Tokenized Documents should implement. IWordProcessingTokenizedDocument interface contains methods that all Word processors (Microsoft Word, OO Writer, etc.) should provide. Both these interfaces are implemented completely by classes inside the core Rahul.Office.dll assembly.

However, another interface IWordProcessingTokenizedDocumentExtension provides extension methods that Add-In assemblies might choose to implement. Currently, it provides a single method replaceTokenWithHtml, which is implemented by the Rahul.Office.MS.dll assembly for Microsoft Word Tokenized documents.

To support this architecture, a special TokenizedDocumentProxy class has been created, with static factory methods like:

getDocumentProcessor(ByVal extension As SupportedExtensions, _
	ByVal documentPath As String, ByVal tokenStart As String, _
	ByVal tokenEnd As String, ByVal lookForDedicatedAssemblies As Boolean) _
	As ITokenizedDocument
Now if you pass true as the last argument, it would look for the Add-In assemblies, before falling back to itself in case those are not found. In case, you pass false as the last argument, the Add-In assemblies would not be looked for. I would strongly recommend passing false, unless you need the additional features required by the Add-In assemblies.

Some points of caution:

  1. Pass false if you don't require the extension features as Add-In assemblies are loaded dynamically through Reflection, which might impact performance.
  2. The Add-In assemblies provide features for a specific Office product. Hence, they can provide non-standard implementations not available for the other Office products.
  3. The Add-In assemblies might have their own pre-requisites. e.g. If you choose to download the source Ccde with Add-Ins, you get Rahul.Office.MS.dll that provides Microsoft Office specific extensions. It provides these features using VSTO, which requires Microsoft Office to be installed on the matching before you can use it.
    Thus, if you don't require the additional Add-In feature, you should download the source code without the Add-Ins. The only extension feature being provided by the Add-In currently is the ability to replace a Token with HTML formatted string.
    Also, note that VSTO uses Interop extensively and is hence, considerably slow.
  4. If you are using the Add-In assemblies, remember they are loaded by Reflection, and should reside in the same directory as the core Rahul.Office.dll assembly.

Still To Be Done

I needed to deliver the functionality quickly to a client, and assembled the original code quickly. Since then, I have made some enhancements to it, that I have updated in the article.

I am using lots of regex, and probably they can be tweaked to increase performance (although I have been able to Token Replace large documents in virtually no time). There are many more features or metadata extensions that could be added. Support for Excel and Calc, at least, is desired. More replacement options, the list would never end.

I will try to take time out and enhance this. But, right now, as it stands, the code should satisfy many requirements in a majority of the cases.

Also Available On My Blog

The source code for this article is also available on my blog, Token Replacement in Office documents. The article would always be kept updated together with its source code here on CodeProject. However, I have noticed it takes time for the article to be updated on CodeProject once I submit an updated version (this last version took in excess of 2 weeks to get updated). So, you can download the latest code from my blog post. Simultaneously, the updated code would always also be available here on CodeProject.

History

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
QuestionA couple of questions
NorCan131169
3:53 12 Jan '10  
First of all, thank you for sharing your project.

I’m using Rahul.Office for an intranet resumé application to produce Word exports of the resumés. I have a couple of issues I’d like to ask you about.

First of all, when I try to open the exported Word file, Word warns me that the file has “unreadable content”. Have you come across this problem during your development or use of Rahul.Office? I should probably mention that I use Ionic.zip, instead of SharpZipLib, to zip the Word file, in case you think that might be relevant. I use Office 2007.

My second question concerns the tokens in the template file. My Word template has a table for projects people have worked on. This table is replicated using IWordProcessingTokenizedDocument.body.replicateRow () to allow for multiple projects. This works very well. However, I need to list the start date and end date for the project twice on each row. I tried putting two identical tokens in the table (so; two [$ProjectStartDate$] tokens, and two [$ProjectEndDate$] tokens), but this resulted in the start date of the second project replacing the second start date token of the first project. As a consequence, some projects ended up not having the tokens replaced because there were no more instances of the “start date” and ”end date” TokenReplacementInfo.
I guess what I was hoping for (and probably what I’d need for this to work) was for identical tokens to have the same instance number if they’re in the same row. I see now that it doesn’t work that way, and I ended up renaming the second instances to [$ProjectStartDate2$] and [$ProjectEndDate2$], and adding separate TokenReplacementInfo instances in my code, using the same data. I don’t like this approach, as it ties my code directly to the template layout. Do you have any suggestions on other ways of doing this?

Thanks again for sharing your work.

Best Regards
Frode Breimo
AnswerRe: A couple of questions
Rahul Singla
6:52 12 Jan '10  
NorCan131169 wrote:
First of all, when I try to open the exported Word file, Word warns me that the file has “unreadable content”. Have you come across this problem during your development or use of Rahul.Office? I should probably mention that I use Ionic.zip, instead of SharpZipLib, to zip the Word file, in case you think that might be relevant. I use Office 2007.


Hi Frode, I have never faced this issue while using Rahul.Office. I would recommend sticking to SharpZipLib. I have no idea about Ionic.zip. Although I am not sure, the issue might be related to that. Try using SharopZipLib, and see if you still face the issue.



NorCan131169 wrote:
I see now that it doesn’t work that way, and I ended up renaming the second instances to [$ProjectStartDate2$] and [$ProjectEndDate2$], and adding separate TokenReplacementInfo instances in my code, using the same data. I don’t like this approach, as it ties my code directly to the template layout.


Yes, that is the way it is intended to work. It has been designed that way, and in my opinion is the sensible way to approach it. As clearly mentioned in the article, you need to specify token index to replace a particular occurrence. I dont see any problem with renaming the second instance in your case.

However, the code can be easily modified to suit your needs, if you really like to have it that way.


NorCan131169 wrote:
I don’t like this approach, as it ties my code directly to the template layout.


I believe your code is already tied to the template (not its layout). The token name in the template should match in the code and in the document. You are free to place the token anywhere in the document as you like. So, it's not tying to the layout.

However, again as I said, if you insist on replacing all tokens in a particular row, you would need to alter code.
GeneralRe: A couple of questions
NorCan131169
10:53 12 Jan '10  
Thank you for your quick response.

I agree that they way it is implemented is a sensible way of doing it. My needs in this perticular case may be a special case.

I realise my code is not tied to the actual layout of the template, that was probably a bad choice of words. I guess what I meant was that the code had to be adapted to that perticular template, and by "layout" i meant the fact that it used the same data in two different places. I was hoping to create a general piece of code that just supplied all the data about a person, and then use that on multiple templates. Now the code is (somewhat) specific for this template.

I suppose it doesn't really matter that much though, since having the extra dates in there doesn't break anything when applied to other templates, it is just ignored.

Regarding the far more important "unreadable content" problem I will try SharpZipLib and see how that works.

Thank you.
GeneralRe: A couple of questions
Rahul Singla
18:55 12 Jan '10  
NorCan131169 wrote:
I guess what I meant was that the code had to be adapted to that perticular template


Frankly, I myself needed a to do things a bit different a couple of times. However, I have kept the library fairly general for the public. What I actually did was to create a base class for my own solutions, that did the things differently, and that was used all over the solution instead of playing with the library itself.

In your case, you can create a class, say MyDocument. Now, have a function in that class replaceRowToken(token, value, count).

Make this function add "count" tokens to the list of TokenReplacementInfo.

You can now invoke it as:
replaceRowToken("[$StartDate$]", DateTime.Now, 1)
replaceRowToken("[$StartDate$]", DateTime.Now, 2)

Alternatively, you can create separate derived classes from this class, one for each template you are targeting. This is the way, I myself use my library. Create a base class for the solution, and derived classes for targeting each template, that just provide template specific replacement (to pass count to the base class, depending upon whether the table row has 1 or 2 same tokens).

You see, you would never be able to completely isolate your classes from the actual template information.

I have custom code that automatically loads class depending upon the template chosen.


NorCan131169 wrote:
I suppose it doesn't really matter that much though, since having the extra dates in there doesn't break anything when applied to other templates, it is just ignored.


Yes, if you try to replace a token that does not exist, it will not do any harm. So, this approach can also be followed for templates and having a common class for replacement.

It really depends upon the requirements, and complexity of your project.
GeneralWhy not playing with Rtf
Ernest Bariq
15:05 28 Jun '09  
Hi, I've tryed un nice lib from codeproject called NrtfTree NRTFTree - A class library for RTF processing in C#[^], why not trying to make another class to manage Rtf as it is a structured doc ?
there is also Writing Your Own RTF Converter[^] but it's more complex

.: Ernest Bariq :.
__________________
http://titus31.no-ip.info/jupiterZ

GeneralRe: Why not playing with Rtf
Rahul Singla
18:59 12 Jan '10  
For the starters, I have only basic information regarding the RTF format. I have processed it at a couple of places, but never really gone deep into it.

Secondly, it does not come remotely near to the power & features offered by Open XML formats of MS & OO Office.

Thirdly, it is not XML based. The Rtf format in my opinion, is not really suited for complex processing, because ot its non-standard markup.
GeneralA better approach might be ContentPlaceHolders?
deloford
22:07 22 May '09  
Nice article, but I think a better approach is to use ContentPlaceHolders?

There is a free tool that allows you to generate the content control xpaths.

1. Add Content Controls into the Word 2007 document
2. Add a custom XML file into the Word 2007 document (zipped file)
3. Map the content controls to elements in the XML document (using XPATH and no code)
4. Demonstrate that changing content in the Content Controls will update the custom XML file and visa-versa.

http://blogs.msdn.com/modonovan/archive/2006/05/23/604704.aspx
GeneralRe: A better approach might be ContentPlaceHolders?
Rahul Singla
22:39 22 May '09  
I just checked that out. Yes, that is a good option. However, I still find mine a better one with respect to productivity, flexibility and extensibility. There is a comment on the bottom of that blog post. I am quoting it here:

"Is there anyway to get the formating of the word document in XML file? What I mean by that is, I followed this example step by step and got to the XML file which do contain all the data coming from MS Word content controls but it loses all the formating (Bold, underscore, links etc...) is there any way to get the formating also?"

I am already preparing to extend my approach to support replacement in Headers/Footers and to support Excel/Spreadsheet. It's just a matter of time. Anyways, thanx for pointing that out. It surely is an option to consider seriously.
GeneralRe: A better approach might be ContentPlaceHolders? [modified]
deloford
4:31 23 May '09  
Agreed, that is an issue if you are trying to force formatting from your data into the document.

However, the majority of situations it is acceptable to allow data to be formatted by the user (think of most CSV mail merge situations).

If required, a simpler approach might be to use the Word Object Model to listen for ContentControlBeforeContentUpdate to overide the data binding process and load your XML data as RTF (or whatever custom markup you like)

You could also explore programmatically generating BuildingBlocks (and the corresponding BuildingBlockContentControl) which can contain all sorts of word content.

http://msdn.microsoft.com/en-us/library/bb258119.aspx[^]

Having said all that I havent tried these approaches and it may be that your (neat) solution turns out to be easier in the end but if I was starting a project on this now I would explore these approaches especially if I needed to retrieve user data after modification.

Here is the free tool I mentioned:
http://dbe.codeplex.com/[^]

A good starter:
http://msdn.microsoft.com/en-us/library/bb510135.aspx[^]

modified on Saturday, May 23, 2009 9:41 AM

GeneralRe: A better approach might be ContentPlaceHolders?
Rahul Singla
7:20 23 May '09  
I had a quick look at the Urls you pointed to. And I must say, the approaches are compelling (also requiring a considerably higher degree of understanding of the Office system & its markup).

But definitely, those are the one I myself might turn to, if I found my approach lacking for a situation.

However, I have made it pretty clear in the article that this utility is not a substitute for Office XML SDK, or other approaches that MS has provided for processing Office documents programatically.
Rather, it is a quick, intuitive and productive approach for generating dynamic pre-formatted Office documents from data in the database, reports, or other such common tasks, with a major selling point of having complete abstraction from Office, its internals & markup.
GeneralReplace in Headers and Footers
ElsaWood
5:46 22 May '09  
Great article! Thumbs Up Thank you so much for all your work and for sharing it with us!

Tokens in the headers and footers are not replaced. Has anybody come across this and found out a solution?
GeneralRe: Replace in Headers and Footers [modified]
Rahul Singla
7:10 22 May '09  
Thanx.
Well, check this out after some days. I have updated the code for some minor glitches and have asked the CodeProject team to update the article. It might take some days for the CodeProject to update it.

Regarding the Header/Footer issue, I probably should have mentioned in the article, that Header/Footer replacement is actually not supported as of now. There are multiple reasons and complexities for this. To describe a couple:
1) If you check out the Office document formats, Headers/Footers are actually stored as separate Xml files in the zip file, separate from the main Content area. This would mean considerable increase in complexity in handling Tokens spread across multiple files in the main zip file, and ensuring each one gets replaced correctly.

2)There is no intuitive way from an Office document perspective to logically combine the different areas (Header/Content/Footer) and present them as a single unit for Token replacement.

I have been thinking of providing functions like replaceInHeader() etc. But owing to my schedule, I have not been able to work forward much on this one. I really want to take time off, and make this more versatile looking at its application areas for myself, and the community.

modified on Sunday, May 24, 2009 11:38 AM

GeneralRe: Replace in Headers and Footers [modified]
Rahul Singla
6:49 24 May '09  
Hi ElsaWood,
I am pretty serious about the usability of this article, as an approach for Dynamic production of Office documents. And now, I have made significant enhancements to the code to progress in the direction that I have promised in the article.

This update now support Token Replacement in Word Headers & Footers, just as you desired.

As I have mentioned in the article, I assembled this code the first time in pretty much haste to support a client requirements. For this update, I had to refactor it in many terms to provide extensibility to the code & allow future enhancements with not much of breaking changes (I hope so!!!).

So, this update incorporates many breaking changes. But all of them relate mainly to the background processing in the code. If you have been following the sample provided in the download without tweaking the actual code, it should not be too difficult for you to adapt to this update.

Mainly there have been some class renamings, and support for Headers & Footers meant, you now need to call functions like:
doc.header.replaceToken(...) or
doc.footer.replaceToken(...) or
doc.body.replaceToken(...)

I have uploaded the Updated code at Rapidshare. It would be useful for me if you can use it, and report any bugs you find (I have tested it on my end). I would update the code in the article as and when you certify the functionality to be working fine.

http://rapidshare.com/files/236933306/OfficeTokenReplacement.zip.html[^]

modified on Monday, May 25, 2009 1:35 AM

GeneralRight thing
Bülent
1:01 29 Jan '09  
Hi,

thak you for the tut. It was nearly the same, i thought about and wanted to build up.
Did you worked further for Excel? Please contact me direktly at cenk21 atsymbol gmx.de Smile

I will test your code and will fix the leaks. Afterword i sent it to you to update your nice article.

Kind regards

Bülent Tiknas
GeneralRe: Right thing
Rahul Singla
18:48 29 Jan '09  
Hi Bulent,

I myself need the functionality extended for Excel. I started examining & researching OO SpreadSheet's & Office Excel Xml formats.

But I had to leave up the work in between due to some engagements that would keep me occupied for nearly a couple of months. I understand that it would be hugely beneficial to provide support for Excel.

But I would not be able to do it right now. I will update and enhance this code as soon as I get time.

In the meanwhile, your input would be welcome. Regarding testing, I have already done some myself, and have found the Regexs used perfect in all test cases, and boundary conditions. However, I would be all ears to listen to any fixes if there are any.
Generallist of docs
bidulle
3:55 9 Jan '09  
Hi, thank you for your code but I encounter a hudge problem...

First I migrate your code to C# (If you want it ...)

I'm trying to merge documents and replace tokens. I've tried first with AltChunk to merge all the docs and replace tokens but it put the docx inside the docx zip file and your regex tool can look inside.

so I process all the sub documents before merging them to the main but I have hazardous time results (because the processed doc is still hanged by something, I suppose it's the regex)

How can I dispose your tool. I've tried to implement IDisposable and set the XDocument to null but its the same. some time I have to wait 30 sec. before going on to the next doc!

Help me me please
GeneralRe: list of docs
Rahul Singla
22:46 9 Jan '09  
Hi bidulle. Excuse me for not being able to understand your problem exactly because of confusing English you have used.

I am describing your problem in my words. Please correct me if I am wrong.

You have sub-documents on which you use this tool to replace tokens. Then, you are trying to merge them to produce a single new document.

Well, I have again checked my code. My code does NOT lock the document when it is processing it. As I said, all processing is being done in memory. However, you must call the Save() method if you want to persist back the replaced content to the original file. The file is completely free after the call to Save() method. I have been able to open the processed file with MS Word immediately after the call to Save() method. I see no reason why the file should be locked up.

As an aside, remember you need to create a new TokenReplacement object for each sub-document you are processing. You pass in the document path to its constructor, so a single object is able to process only a single document.

If you still find the file to be locked up, try using Unlocker (http://ccollomb.free.fr/unlocker/[^]) to see which Process has locked up your document. Make sure your file is not opened in Word. Word locks the file in exclusive non-shareable mode.
GeneralRe: list of docs
Ernest Bariq
17:02 10 Jan '09  
Hi It's me bidule but from my home login (Ernest)

What I do is:

File.Copy(args[i], String.Format(tempFile,i.ToString()), true);

using (TokenReplacement t
= new TokenReplacement(String.Format(tempFile, i.ToString()), "{$", "}"))
{
t.replaceTokens(list);
t.save();
}
// here is the problem ///
merger.DocumentMerging(argsUnit);
// end
File.Delete(String.Format(tempFile, i.ToString()));

when I try to manipulate the sub doc in my merger, File.Open throws me an exeption: it's still being hanged by some thing ( I only see the regex) so I have to loop inside my method to wait for the availability of the sub doc

here is the time results for sub docs availability

Docs\temp1.docx : 1 cycles, en 0mn 0s 408ms
Docs\temp2.docx : 885 cycles, en 0mn 0s 228ms
Docs\temp3.docx : 723 cycles, en 0mn 0s 243ms
Docs\temp4.docx : 433 cycles, en 0mn 0s 207ms
Done.

That means I can go inside my merger.DocumentMerging(argsUnit) method but I have to wait inside before working with it

while (!finish)
{
try
{

using (FileStream fileStream = File.Open(sourceFile, FileMode.Open))
{
chunk.FeedData(fileStream);
}
finish = true;

}
catch
{
//System.Threading.Thread.Sleep(100);
}
I++;
}

.: Ernest Bariq :.
__________________
titus31.no-ip.info

GeneralRe: list of docs
Rahul Singla
19:36 10 Jan '09  
Hi Bidulle cum Ernest Wink

I just cannot believe it that I made this mistake. I am almost paranoid about releasing resources.

I found the reason & the solution to your problem. Refer the OfficePackage.vb file in the solution. It's saveContent() function looks like follows:


Public Shared Sub saveContent(ByVal filePath As String, ByVal entryPath As String, ByVal encoding As Text.Encoding, ByVal content As String)
Dim zipFile As New ZipFile(filePath)

Dim entry As ZipEntry = zipFile.GetEntry(entryPath)
Dim c As CompressionMethod = entry.CompressionMethod

Dim newEntry As New ZipEntry(entryPath)
newEntry.CompressionMethod = c

zipFile.BeginUpdate()
zipFile.Delete(entry)

zipFile.Add(New StringDataSource(content, encoding), entryPath, c)
zipFile.CommitUpdate()

zipFile.Close()
End Sub

Notice, that second last line of this function, zipFile.Close(). In the original source code posted above, I had forgotten to Close() the zipFile opened. Closing it resolves your problem.

To be more precise, just add the following line:
zipFile.Close()
at the end of OfficePackage class' saveContent() function.

I would make this change to the code posted above so that other's don't face this problem.
GeneralRe: list of docs
Ernest Bariq
17:04 10 Jan '09  
the time elapsed is the time I hae to wait ...
sorry

.: Ernest Bariq :.
__________________
titus31.no-ip.info

GeneralInteresting
Nyarlatotep
1:46 30 Dec '08  
Long time ago I have had the same needs and I have resolved to use RTF format. It has the same 'problems' like multiple word runs (plain text is interleaved by rtf command tokens). Because of this, managing repeated rows has been a nightmare and I have had to use continue section breaks during template editing to 'force' word runs to reset and restart after the section break.
Now I'm very curious about your solution: it seems very good.
QuestionWhy aren't you using the OpenXML SDK?
Itay Sagui
6:40 22 Dec '08  
http://msdn.microsoft.com/en-us/library/bb448854.aspx
[^]
http://www.microsoft.com/downloads/details.aspx?familyid=ad0b72fb-4a1d-4c52-bdb5-7dd7e816d046&displaylang=en[^]

Itay Sagui  |  Tzunami Inc
Tel: +972-9-9507479  |  Mobile: +972-54-5343800  |  Email: itay@tzunami.com

AnswerRe: Why aren't you using the OpenXML SDK?
SoulStone-BR
10:49 22 Dec '08  
As the author have already said:
Advantages over Open XML SDK or other such options
1. You need to have a good understanding of Open XML schemas for using the these SDKs.
2. You are stuck to one particular product when using them
3. You need them installed on the target machine for use. Here, just drop the Rahul.Office assembly into the bin folder or copy the code files to your project.
Wink
GeneralRe: Why aren't you using the OpenXML SDK?
Itay Sagui
12:53 22 Dec '08  
I'm afraid I don't agree with any of those points (though I must admit that I missed them when first reading the article):
1. Using the OpenXML doesn't require you to understand the schema any better than manually modifying the XML (which is what the article is all about). It even makes it easier, since you can work with an object model, and don't need to work directly with the XML
2. You are stuck with one particular product anyway - Office - since you are editing Office documents.
3. The OpenXML is a single assembly, which you can drop just as easily as any other assembly.

Anyway, that was just a question...

Itay Sagui  |  Tzunami Inc
Tel: +972-9-9507479  |  Mobile: +972-54-5343800  |  Email: itay@tzunami.com

GeneralRe: Why aren't you using the OpenXML SDK?
Rahul Singla
19:02 22 Dec '08  
Hi Sagui, have a look at the following blog post from Brian Jones, a Program Manager for Office:
http://blogs.msdn.com/brian_jones/archive/2008/10/06/open-xml-format-sdk-2-0.aspx[^]

In particular, look at the following section (directly quoted from there):
What Can't the Open XML SDK do?

Before we get into the design of the SDK I want to point out a couple of key points of what the SDK will not be able to do:

*
The Open XML SDK is NOT a replacement for the Office Object Model; and provides no abstraction on top of the file formats
o You need to understand the structure of the file formats to leverage the SDK, it doesn't hide it from you
* The SDK does NOT provide functionality to convert Open XML Formats to and from other formats, like HTML or XPS
*
The SDK does NOT guarantee document validity of Open XML Formats when developers use the SDK or if the developer chooses to manipulate the underlying xml directly
o We are working on providing validation functionality in subsequent CTP releases of version 2.0 of the SDK
* The SDK does NOT provide application behaviors such as layout (ex. pagination of WordprocessingML documents) or recalculation functionality


That should make one point as to you should have an understanding of Open XML schemas.
There are other more important points I mentioned above:
1) There's no direct Token Replacement option. You would need to write custom code extensively for that.
2) Your Token might be split into multiple runs by Office. That would make it a lot more difficult for replacement.
3) Productivity. Do you expect each & every developer to understand and open XML files for something as basic as Token replacement??
4) Intuitive. Some people would still use Open XML SDK. And indeed, this article is not a replacement of the SDK, as I have already disclaimed above. But this article is about allowing a normal developer to leverage Office products in day-to-day programming in an intuitive & productive way.


Last Updated 12 Nov 2009 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010