Click here to Skip to main content
Click here to Skip to main content

Fill Mergefields in .docx Documents without Microsoft Word

, 31 May 2011
Rate this:
Please Sign up or sign in to vote.
Utility class for filling mergefields (loose fields and tabular data) in a Microsoft Word (docx) template document, without needing Microsoft Word itself

Introduction

This application uses the Open XML SDK to find MERGEFIELDs in Microsoft Word documents and replace them with the provided data. Additionally, there's also support for adding tables with data. This is a very fast and stable way of generating Microsoft Word documents server-side.

The main code only consists of 1 class with a few methods that do all the work. I've provided a frond-end to test the functionality of the class.

To be able to run the application, you must download and install the aforementioned SDK. As the SDK is written in .NET 3.5, the entire library only works in .NET 3.5 and above.

Background

For a customer project, I needed the ability to inject data from an XML file into a standardized document format. The customer still used Microsoft Office 2000 but had installed the Compatibility Pack on all his PCs.

I didn't want to use Microsoft Word through OLE automation because it was a server-side process that ran unattended. As Microsoft doesn't recommend using Microsoft Office in such scenarios, it wasn't an option. But I remembered that the new docx format is just a zipped archive of loose XML files that can be edited. After some searching on the Internet, I found the Open XML SDK that provided a lot of help in parsing the Microsoft Word document structure. Finally, I've written a piece of code that fills a Microsoft Word docx file with the data from the XML file. This resulted in the required document with data.

Using this mechanism also gave me the additional advantage that the customer himself could edit the layout of the template. Although it wasn't a requirement, it saved me a lot of time afterwards.

Using the Front-End

Along with the source code, a front-end application has been provided to allow you to test the functionality.

wordfill_demo.jpg

This application has been written using WPF and uses the datagrid from the WPF toolkit. To be able to run the testing application, you'll need to download and install the WPF toolkit from CodePlex.

Of course, before being able to test anything, you'll need a docx template. I've added a sample template to the zip file, but you can just as well provide your own (see the following chapter for details about the template).

In the main window, you must start by providing the full path of the template in the textbox above (as long as this field is empty, the Generate button will be disabled).

Add your fields and the data in the grid in the center of the window. To add tabular data, click on the 'Add Table' button and define the tablename and column names (max. 5). Click on OK and provide the data for the table. Repeat this for each table.

Finally, click on Generate. Your report should appear automatically.

The docx Template

First of all, you'll need a Microsoft Word docx document with a number of MERGEFIELDs that act as placeholders for your data. The mergefields contain the name (code) of the data that you want to add, for example:

{MERGEFIELD CAND_NAME \* MERGEFORMAT}

There are also 3 suffixes that can be used:

  • dp: Deletes the paragraph if the data field is empty or wasn't provided
  • dr: (only in tables) Deletes the row if the data field is empty or wasn't provided
  • dt: (only in tables) Deletes the whole table if the data field is empty or wasn't provided

The suffixes are added to the field name, with a preceding '#'. For example:

{MERGEFIELD CAND_NAME#dp \* MERGEFORMAT}

If you want to add tabular data to the Word document, you must add a Table to the docx document. The cells of the Table contain mergefields that indicate the datafields that must be placed there. These Mergefields are formatted as: TBL_nameoftable_nameoffield. For example:

{MERGEFIELD TBL_LANG_NAME \* MERGEFORMAT}

The mergefield above tells the application that this cell contains the value of the Name-column in the selected record of the Lang-datatable. The application will add a row to the Table for each record found in the datatable. (Suffixes are not supported for tabular data. Each tablecell can only contain 1 mergefield.)

Note: The application will fill loose mergefields that are placed in the header/footer of the document, but there's no support for tabular data in headers/footers.

Using the Code

There's only one (public) method that can be invoked on the FormFiller class: GetWordReport.

This method accepts 3 parameters:

  • filename: Full path of the template docx file
  • dataset: A DataSet containing the tabular data that must be added to the template. Each datatable in the dataset must be named according to the names used in the template (see above). If the template contains a field TBL_LANG_NAME, the datatable must be called 'LANG' and must contain a column 'NAME'. This parameter can be null if there's no tabular data.
  • values: This is a Dictionary of strings where the key is the fieldname and the value is the data that must be placed in the Microsoft Word document.

If all goes well, the filled-in template is returned as an array of bytes.

A Few Highlights in the Code

Opening the Template

Opening the docx file is very easy with the SDK. Only the following code is required:

using (MemoryStream stream = new MemoryStream(filebytes))
{
    // Create a Wordprocessing document object.
    using (WordprocessingDocument docx = WordprocessingDocument.Open(stream, true))
    {
        ...
    }
}

(The filebytes variable is the read-in docx template.)

Providing a Run-object for the Data

In the OpenXML document, you can't just add text that contains plain hard returns or tabs. These must be replaced by the correct XML tags to be displayed correctly in Microsoft Word.

wordfill_xml.jpg

The mergefields in the OpenXML are represented as SIMPLEFIELD (<fldsimple>) elements and can contain child RUN (<r>) elements. The text of the field is represented as a child TEXT (<t>) element inside the RUN element. A RUN element can also have a RUNPROPERTIES (<rpr>) element with additional layout information about the displayed text, which we don't want to lose, because we'd like our data to keep the same layout as the mergefield has in the template.

So, if we want to replace a mergefield with our text we must make sure that:

  1. tabs and returns in our data are rendered correctly, and 
  2. the formatting of the mergefield is preserved

The code in the FormFiller.GetRunElementForText does exactly this:

internal static Run GetRunElementForText(string text, SimpleField placeHolder)
{
    string rpr = null;
    if (placeHolder != null)
    {
        foreach (RunProperties placeholderrpr in placeHolder.Descendants<RunProperties>())
        {
            rpr = placeholderrpr.OuterXml;
            break;  // break at first
        }
    }

    Run r = new Run();
    if (!string.IsNullOrEmpty(rpr))
        r.Append(new RunProperties(rpr));

    if (string.IsNullOrEmpty(text)) return r;
    //  first process line breaks
    string[] split = text.Split(new string[] { "\n" }, StringSplitOptions.None);
    bool first = true;
    foreach (string s in split)
    {
        if (!first) r.Append(new Break());
        first = false;

        //  then process tabs
        bool firsttab = true;
        string[] tabsplit = s.Split(new string[] { "\t" }, StringSplitOptions.None);
        foreach (string tabtext in tabsplit)
        {
            if (!firsttab) r.Append(new TabChar());

            r.Append(new Text(tabtext));
            firsttab = false;
        }
    }
    return r;
}

This method checks if there's a RUNPROPERTIES element in the given mergefield. If there is, the content is preserved (.OuterXml) and added to the newly instantiated RUN element. The data is inspected for tabs/returns and the correct elements are added to the data (BREAK and TABCHAR elements).

Saving the Template

Once all the fields have been filled in, the changes must be explicitly saved back into the document (it doesn't happen automatically).

docx.MainDocumentPart.Document.Save();  // save main document back in package

Processing Headers and Footers

The headers and footers aren't placed in the same XML file as the main document (it's a different 'document part' in the package). The code that is discussed above won't find MERGEFIELDs that are placed in the header or footer. For this, a loop over the header- and footerparts is required. Below is an example of a loop over the headers of the document:

foreach (HeaderPart hpart in docx.MainDocumentPart.HeaderParts)
{
    ... // process fields
    hpart.Header.Save();    // save header back in package
}

Points of Interest

The suffixes (see above) allow to delete paragraphs, rows and tables. If this is done while iterating over the elements, the loop suddenly stops (without throwing any error whatsoever). For example: if there are 10 mergefields in the document, you're iterating over them using the following statement:

foreach (var field in docx.MainDocumentPart.Document.Descendants<SimpleField>())
{
    ...
}

Suppose you decide to delete element 5. For example, the following code searches the parent PARAGRAPH (<p>) element of the mergefield, and deletes it (deleting also the field itself):

    Paragraph p = GetFirstParent<Paragraph>(field);
    if (p != null)
        p.Remove();

You'll never reach elements 6 to 10. The loop will quit without any indication that you've missed 4 elements.

To solve this, you'll remark in the code that there are 2 loops: the first loop will fill the mergefields with the data. This first loop will keep a list of empty mergefields and a second loop will delete all those empty mergefields.

Update provided by M. Chale

The library now supports tags for UPPER, LOWER, FirstCap and Caps. UPPER and LOWER modify the entire string to be uppercase or lowercase, FirstCap capitalizes the first letter while making everything else lowercase; and Caps title-cases words, capitalizing the first letter of every word. Note that the Caps routine is a bit naive, only capitalizing letters that directly follow spaces. The library also supports text that should appear before or after the data. They will be inserted with the same formatting as the rest of the MergeField, provided the field is not blank and marked #dp.

A sample field with formatting: MERGEFIELD MYFIELD \ UPPER \b before \f after

Thanks to Michael Chale for this update.

Update for Microsoft Word 2010

Since Microsoft Word 2010, the SimpleField element is no longer used. It has been replaced with a number of Run elements where one (or more) contain a FieldCode element with the field instruction. The code of the library has been modified to replace these with the old-style SimpleField thus remaining backwards compatible with Microsoft Word 2007 documents.

History

  • 2009-07-29: Submitted to CodeProject
  • 2009-08-12: Mergefields in headers and footers will now also be processed
  • 2009-08-14: Small update in source: formatting of mergefields in tables is now also repeated (bold, italic, ...)
  • 2009-09-15: Updated source: MemoryStream wasn't expandable and table row properties weren't copied. Fixed both issues.
  • 2010-06-14: Michael Chale added support for formatting the fields. I've updated the solution for VS2010.
  • 2010-08-02: Updated library to work with Microsoft Word 2010 generated documents
  • 2011-05-30: Added a couple of bugfixes to the library

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Xavier Spileers
CEO TRI-S bvba, Cogenius bvba
Belgium Belgium
I'm working since 1999 in an IT environment: started developing in PROGRESS 4GL, then VB6 and am working since 2003 with C#. I'm currently transitioning to HTML5, CSS3 and JavaScript for the front-end development.
I started my own company (TRI-S) in 2007 and co-founded another one (Cogenius) in 2012.
Besides being a Microsoft Certified Professional Developer (MCPD) I'm also a Microsoft Certified Trainer (MCT) and am teaching .NET and JavaScript courses.
Follow on   Google+

Comments and Discussions

 
QuestionProblem with replaced text using Tx Text Control 14.0 developer control... Pinmembermisxa7-Sep-11 3:13 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140721.1 | Last Updated 31 May 2011
Article Copyright 2009 by Xavier Spileers
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid