Click here to Skip to main content
Click here to Skip to main content

Unleashing the Full Power of Regular Expressions in Microsoft Office Documents

, 16 Jun 2008
Rate this:
Please Sign up or sign in to vote.
Part 1: A method of utilizing Regular Expressions to perform power searches in Microsoft Office Documents using .NET and the Microsoft Office Primary Interop Assemblies

Prerequisites

In order to run the sample application, the Microsoft .NET Framework 2.0 or higher must be installed. In addition, Microsoft Office 2003 or higher must be installed along with the Microsoft Office 2003 Primary Interop Assemblies (PIAs) redistributable. These PIAs are installed if one performs a full install of Microsoft Office 2003, or you can get them for free from Microsoft.

For more information on how to install and use the Primary Interop Assemblies in .NET programs, please refer to this link.

I would like to emphasize that one does not need Visual Tools for Office to run or modify this program.

Introduction

Regular Expressions are a very powerful tool for text processing. Sophisticated expressions can be used to find all kinds of patterns of text. Regular Expression engines are integrated into many text editors. Most Regular Expression examples show how to manipulate either ASCII or Unicode text. In addition to editors that handle the standard text formats mentioned previously, there are millions (or probably billions) of documents encoded in one of Microsoft’s many Office formats, such as WORD format (doc), Rich Text Format (RTF), and Excel (XLS). While one can perform searches in Microsoft Office documents using Regular Expressions through the use of Smart Tags, its implementation is cumbersome for many document processing purposes. In this article, I will present a simple methodology of applying the power of Regular Expressions to Microsoft Word documents through the use of the Microsoft .NET Framework. The methodology makes use of the System.Text.RegularExpressions namespace and the Microsoft Word interop assemblies to realize this solution. In addition, through the use of dynamically loadable assemblies, every Regular Expression match can be validated to ensure that the match is correct. For example, it is quite easy to write a Regular Expression for a numerical date of the form 02/07/2007 for February 7, 2007. But to include in the Regular Expression checks for invalid dates such as 04/31/2002 or 02/30/2007 is quite difficult without code that performs such checks.

In future articles, I plan to present ways of using Regular Expressions to perform sophisticated text search and replace algorithms through the use of the MSOFFICE interop assemblies and .NET technologies. I will also apply these techniques to other MSOFFICE documents such as EXCEL.

Background

Support for Regular Expressions for Microsoft applications first appeared in Word 97. Its implementation was quite tedious because the syntax used differed significantly from the Regular Expression Standard. Microsoft realized the shortfalls in their implementation, and reintroduced Regular Expressions as part of their Smart Tags library 2.0, which was first available with Microsoft Office 2003. Smart Tags, of which Regular Expression operations form a small part, represented a generalized, integrated way to enable users to present data from their documents. However, due to its non-intuitive, complicated manner, Microsoft itself admits in their MSDN Web site that a poll showed developers have not taken the necessary steps to develop them or use the Microsoft .NET Framework to do so. Please refer to this MSDN article for more information: Realize the Potential of Office 2003 by Creating Smart Tags in Managed Code. The focus of this article is devising a simple, yet powerful way of using Regular Expressions (along with validation code).

Using the Code

On startup, the program reads the XML file Searches.XML. This file contains information for all built-in Regular Expression searches. Included in this XML file are searches for URLs, IP addresses, US dates, European dates, US phone numbers, and email addresses. One can add as many search options as she or he wants to this file. Each search option can be activated by placing a check by the desired search.

Each search group contains the following information in the XML file:

  • Search Regex – The Regular Expression used in the search
  • Indentifier – The search title that appears in the check listbox
  • FindColor – The color used to highlight the found text in the document
  • Action – The operation used (this version only supports Find)
  • PlugInName – The name of the assembly associated with the search. If no assembly is associated, “None” is used.
  • PlugInFunction – The function called for this search block that is found in its plug-in assembly
  • Description – The description text that is displayed in the check list box

Finding the Text

MSWordRegExDemo contains methods which manipulate the Microsoft Word or RTF document using automation by way of the Microsoft Word interop assembly. All of these methods are contained in the DocumentEngine class. The two main Microsoft Word objects that are used in this application are:

Word.Application app;
Word.Document theDoc;

To open the document, we perform the following call which is triggered by the file open event in the GUI:

// Opens a Microsoft WORD or RTF document
public void OpenDocument(string documentName)
{
    object optional = Missing.Value;
    object visible = true;
    object fileName = documentName;
    if (app == null)
        app = new Word.Application();

    app.Visible = true;

    try
    {
        // have Word open the document
        theDoc = app.Documents.Open(ref fileName, ref optional,
            ref optional, ref optional, ref optional, ref optional, ref optional,
            ref optional, ref optional, ref optional, ref optional, ref visible,
            ref optional, ref optional, ref optional, ref optional);

        paraCount = theDoc.Paragraphs.Count;
    }
    catch(Exception ex)
    {
        MessageBox.Show(ex.Message + ": Error opening document");
    }
}

The first step is converting the text of the Word document into Text. Once we have the document in the text domain, we can perform a Regular Expression search on the text and see if there are any matches. See below:

// convert the text in the Microsoft Office document into a .NET string
docText = docEngine.GetRng(currentParaNum).Text;

If one or more matches occur, we then take the match text and feed it through the Microsoft Word.Find function. In searching for text, we need to select a text range to import into text. I have chosen the paragraph range specifier. This means that we will loop through the document paragraph by paragraph, performing our searches on each paragraph. For short documents, we could select the entire range of the document. If we wanted to iterate through footnotes, Word provides a footnote range. To get the range of each paragraph, the following function is used:

// returns the range of text in paragraph number
// nParagraphNumber
public Word.Range GetRng(int nParagraphNumber)
{
    try
    {
        return theDoc.Paragraphs[nParagraphNumber].Range;
    }
    catch (System.Runtime.InteropServices.COMException ex)
    {
        MessageBox.Show(ex.Message + "\nParagraph Number:
        " + nParagraphNumber.ToString() + " does not exist.");
        return null;
    }
}

The main function which performs the "find" of text is RegularExpressionFind.

// perform a search based on regular expressions
public void RegularExpressionFind(int paraNum, string docText,
       SearchStruct selSearchStruct, out List<hitinfo /> hits)
{
    HitInfo hitInfo = new HitInfo();
    hits = new List<hitinfo />();
    System.Text.RegularExpressions.Regex r;
    Word.WdColor color = GetSearchColor(selSearchStruct.TextColor);

    r = new Regex(selSearchStruct.RegExpression);
    MatchCollection matches = r.Matches(docText);

    // no matches go on to next paragraph
    if (matches.Count == 0)
        return;

    // check if we have a validation assembly
    try
    {
        if (!LoadSearchAssembly(selSearchStruct.PlugInName,
                                selSearchStruct.PlugInFunction))
            return;
    }
    catch (Exception ex)
    {
        throw ex;
    }

    int index = 0;

    // this is the start point in the Microsoft Office document
    int startSearchPos = GetRng(paraNum).Start;

    foreach (Match match in matches)
    {
        // Perform validation check
        if (hasValidationAssembly)
        {
            Object[] objList = new Object[1];
            objList[0] = (Object)match;
            if (!Convert.ToBoolean(validationMethod.Invoke
                (assemblyInstance, objList)))
                continue;
        }
        index = docText.IndexOf(match.Value, index);

        // we assume the URL extends until first white space
        string matchStr = docText.Substring(index, match.Value.Length);
        index += matchStr.Length - 1;

        // find the pattern in the Word document
        FindTextInDoc(OperationMode.DotNetRegExMode, paraNum,
        matchStr, color, startSearchPos, out  startSearchPos,
        out hitInfo.StartDocPosition);

        // add match to our hit list
        hitInfo.Text = match.Value;
        hits.Add(hitInfo);
   }
}

First, we search for the Regular Expression in the imported paragraph, by using the Regex .NET functions.

r = new Regex(selSearchStruct.RegExpression);
MatchCollection matches = r.Matches(docText);

// no matches go on to next paragraph
if (matches.Count == 0)
    return;

If there is a match, we load the search assembly if it has not already been loaded, and perform additional validation on the match.

try
{
  if (!LoadSearchAssembly(selSearchStruct.PlugInName,
            selSearchStruct.PlugInFunction))
            return;
}

The following method dynamically loads the validation assembly for the Regular Expression, if one exists. If the assembly was previously loaded, the LoadFrom method will return it.

// loads the search assembly and the desired plug-in function
 public bool LoadSearchAssembly(string plugginName, string plugInFunction)
 {
     try
     {
        // if there is no validation assembly, leave
        if (plugginName.ToLower() == "none")
        {
            hasValidationAssembly = false;
            return true;
        }
        hasValidationAssembly = true;

        // Use the file name to load the assembly into the current
        // application domain.
        string plugginPath = Path.GetDirectoryName
        (Application.ExecutablePath) + @"\Plugins\" + plugginName;
        if (!File.Exists(plugginPath))
            throw new Exception("Cannot find path to assembly: " +
                                plugginName);

        Assembly a = Assembly.LoadFrom(plugginPath);
        // Get the type to use.
        Type[] types = a.GetTypes();

        // Get the method to call.
        validationMethod = types[0].GetMethod(plugInFunction);
        // Create an instance.
        assemblyInstance = Activator.CreateInstance(types[0]);

        return true;
    }
    catch (Exception ex)
    {
        MessageBox.Show(ex.Message);
        return false;
    }
}

Below is the assembly that validates a numerical date:

// SaelSoft -- NumericalDateValidatorClass.cs
// Purpose -- Validates dates in the form of:
// US Date Format:      (month) mm/ (day) dd/ (year) yyyy  or
// European Date Format: (day) dd/ (month) mm/ (year) yyyy)
// 2008 David Saelman

namespace SaelSoft.RegExPlugIn.NumericalDateValidator
{
    public class NumericalDateValidatorClass
    {
        int month = 0;
        int day = 0;
        int year = 0;
        public bool ValidateUSDate(Match matchResult)
        {
            if (matchResult.Groups.Count < 3)
                return false;
            int nResult = 0;

            if (int.TryParse(matchResult.Groups[1].ToString(), out nResult))
                month = nResult;
            else
                return false;
            if (int.TryParse(matchResult.Groups[2].ToString(), out nResult))
                day = nResult;
            else
                return false;

            if (int.TryParse(matchResult.Groups[3].ToString(), out nResult))
                year = nResult;
            else
                return false;

            return CommonDateValidation();
        }

        public bool ValidateEuropeanDate(Match matchResult)
        {
            if (matchResult.Groups.Count < 3)
                return false;
            int nResult = 0;

            if (int.TryParse(matchResult.Groups[1].ToString(), out nResult))
                month = nResult;
            else
                return false;
            if (int.TryParse(matchResult.Groups[2].ToString(), out nResult))
                day = nResult;
            else
                return false;

            if (int.TryParse(matchResult.Groups[3].ToString(), out nResult))
                year = nResult;
            else
                return false;

            return CommonDateValidation();
        }

        private bool CommonDateValidation()
        {
            // verify that all 30 day months do not contain 31 days e.g. 4/31/2007
            if (day == 31 && (month == 4 || month == 6 || month == 9 || month == 11))
            {
                return false; // 31st of a month with 30 days
            }
            // February, a special case cannot contain 30 or more days
            else if (day >= 30 && month == 2)
            {
                return false; //  checFebruary 30th or 31st
            }
            // check for February 29 outside a leap year
            else if (month == 2 && day == 29 && !(year % 4 == 0
                                && (year % 100 != 0 || year % 400 == 0)))
            {
                return false;
            }
            else
            {
                return true; // Valid date
            }
        }
    }

Finally, if we have a real match, we perform a search for the match string in the Word document by calling the DocumentEngine function, FindTextInDoc.

internal bool FindTextInDoc(OperationMode opMode, int currentParaNum,
         string textToFind, Word.WdColor color, int start, out int end,
         out int textStartPoint)
{
    string strFind = textToFind;
    textStartPoint = 0;

    // get the range of the current paragraph
    Word.Range rngDoc = GetRng(currentParaNum);

    // make sure we are not past the end of the range
    if (start >= rngDoc.End)
    {
        end = 0;
        return false;
    }
    rngDoc.Start = start;

    // setup Microsoft Word Find based upon
    // Regular Expression Match
    rngDoc.Find.ClearFormatting();
    rngDoc.Find.Forward = true;
    rngDoc.Find.Text = textToFind;

    // make search case sensitive
    object caseSensitive = "1";
    object missingValue = Type.Missing;

    // wild cards
    object matchWildCards = Type.Missing;

    // this is for a future version
    if (opMode == OperationMode.Word97Mode)
        matchWildCards = "1";

    // find the text in the word document
    rngDoc.Find.Execute(ref missingValue, ref caseSensitive,
        ref missingValue, ref missingValue, ref missingValue,
        ref missingValue, ref missingValue, ref missingValue,
        ref missingValue, ref missingValue, ref missingValue,
        ref missingValue, ref missingValue, ref missingValue,
        ref missingValue);

    // select text if true
    if (hilightText)
        rngDoc.Select();

    end = rngDoc.End + 1;
    textStartPoint = rngDoc.Start;

    // we found the text
    if (rngDoc.Find.Found)
    {
        rngDoc.Font.Color = color;
        // the range endpoint will change if we modified the text
        return true;
    }
    return false;
}

Points of Interest

The DocumentEngine class makes use of Microsoft Office events in order to detect the situation when the user closes the Microsoft Word document that was loaded by the application. When the Quit event is invoked, the app and the document objects are set to NULL. They are reinitialized when the user opens a new document.

public DocumentEngine()
{
  app = new Word.Application();
  // the following line will not compile if the Microsoft
  ((Word.ApplicationEvents4_Event)app).Quit += new Microsoft.Office.
  Interop.Word.ApplicationEvents4_QuitEventHandler(App_Quit);
}

// notification that application was quit by user
private void App_Quit()
{
   app = null;
   theDoc = null;
}

This project can serve as the first step of a complex document processing application for Microsoft Word and RTF documents. Basically, everything that can be accomplished with Regular Expressions with ASCII or UNICODE files can now be done almost as easily for *.doc and *.rtf files. In my next article, I will show how, by means of dynamic assemblies, we can perform complex formatting using Regular Expressions.

For more online information on Microsoft Office Interop Assemblies, please refer to MSDN.

For Further Investigation

For those who would like to find out more information on regular expressions and Microsoft Office automation, I recommend the follow excellent books: Mastering Regular Expressions by Jeffrey E. F. Freidl, and Visual Studio Tools for Office - Using C# with Excel, Word, Outlook, and Infoview by Eric Carter and Eric Lippert.

History

  • 13th June, 2008: First version
  • 14th June, 2008: Fixed the *.sln (solution files) so it is a bit tidier
  • 16th June, 2008: Added a ColorCheckedBoxList component (subclassed from CheckeListBox) to so it would be able to see which color corresponds to which Regular Expression match.
    Drag and Drop functionality also added.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

David Saelman
Software Developer (Senior) CISCO Systems
United States United States
I started programming a TRS-80 Model I in Junior High School with the goal to write arcade games. Since then, I have had the opportunity to work with a wide variety of technologies and environments ranging from real-time telemetry data systems, state-of-the art digital paint and ink systems in Hollywood, 3D computer games, non-contact measurement acquisition devices, and digital TV systems. I have worked on everything from low-level device drivers to state of the art GUI apps using C#.
 
With my current job at CISCO, I have come first circle, working on embedded systems in C as well as writing advance real-time analysis tools in c#.

Comments and Discussions

 
GeneralPIAs are just a pain in the butt... PinmemberGishu Pillai23-Jun-08 20:02 
I've downloaded the PIAs on two machines (one with office 2000 and other with 2007).
the PIA redistrib is a msi that just launches and disappears.. I've tried for about half an hour... and now I just give up.
 
I'd rather save the word doc as html and then run expresso on the HTML Text. Once I nail the regexp, substitute it in some boilerplate ruby code to dump it out.. much faster than this.
 
But nice idea for an app... I've lots of friends who want this.
GeneralRe: PIAs are just a pain in the butt... PinmemberDavid Saelman23-Jun-08 23:08 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 16 Jun 2008
Article Copyright 2008 by David Saelman
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid