Introduction

In this article, I will go through how I made a simple Windows service that watches a folder for incoming PDF documents (from a scanner for example) and then renames the file and moves it to a designated folder depending on the contents of the file. The solution uses regular expressions to decide where to move the document (identification) and then it uses it to extract information that is useful for the naming of the file, such as invoice date, customer name etc.

Background

After buying a new scanner (the excellent ScanSnap IX500) for digitizing over 2000 pages of old invoices and other stuff, I was faced with the problem of sorting all the scanned documents, and I realized that doing it by hand would be far to boring for me to do, so I decided to solve the problem programmatically instead.

After playing around with the scanner and the software that came with it, I found that using high resolution scanning for the OCR and then scaling down the image for the actual PDF was the best way to go to get good OCR quality. The OCR is done by the ABBYY engine, which in turn places a transparent text over the corresponding place of the image, creating a PDF in which you can mark, copy and so on.

So, when ABBYY leaves off, I'm left with a "searchable PDF", which in turn needed parsing for my project. After investigating the open source solutions for PDF document software, I found that Apache PDFBox suited my needs the best, and it so happened that there was an article here on codeproject.com (Converting-PDF-to-Text-in-C) that had some precompiled binaries with everything you need to use it in your .NET project, so I went ahead and use the sample from there.

Using the code

Compiling

To be able to compile my project, you need to download the binaries from here and include the following files in your project's resource folder:

IKVM.OpenJDK.Core.dll
IKVM.OpenJDK.SwingAWT.dll
pdfbox-1.7.0.dll

Also, be sure to copy the following files to your bin folder of the project (otherwise it won't run):

commons-logging.dll
fontbox-1.7.0.dll
IKVM.OpenJDK.Util.dll
IKVM.Runtime.dll

Architecture

The solution creates a Windows service that needs to be installed by using the installutil.exe command found in the .NET corresponding framework folder. When in debug mode, the code is run using F5 as usual, but when compiled into release, it is turned into a service.

General flow

The whole idea of the project is to:

Watch a folder for new PDF files.
When a new file shows up, search the file for certain identifiers to decide what to do with it.
When an identifier is matched, select important information in the file and use this to name the file appropriately.
Move the file to a destination depending on the identifier.

Setting up the file watchers

Because of how ABBYY (OCR software I use) is set up, the file is named <prefix>_OCR.pdf after it has gone through the OCR process, and thus a FileSystemWatcher object is set up like this:

FileSystemWatcher watcher = new FileSystemWatcher(watchFolder, "*_OCR.pdf");
watcher.NotifyFilter = NotifyFilters.LastWrite| NotifyFilters.FileName | NotifyFilters.DirectoryName;
watcher.Created += new FileSystemEventHandler(OnCreated);
watcher.EnableRaisingEvents = true;

When testing the software out, I often found myself processing files that ended up in the wrong directory with the wrong file name due to poorly written identifiers, and to be able to reprocess files regardless of file name, I also set up a watcher for a rematchFolder where the filter just says "*.pdf" instead. That way you can change the configuration, and then any file can be thrown in the rematch folder and go through the processing again with new rules.

Configuration

There are two pieces of configurations that run the service. One is the app.config that points out where to find the in folders, the no match folder, the rules configuration file and where to put the log file.

The rules configuration is stored in an xml file, and then loaded into a list (PDFTemplates) of PDFTemplate objects. The PDFTemplate class simply holds:

A list of strings in identifiers where each string is a regular expression and ALL identifiers must be matched in a file for the rule to take action
A list of strings in contentSelectors where each string is a regular expression, holding matching groups (denoted in the regular expression with "(...)") where the first content selector to match something is used for renaming the file.
A string in fileNamePrefix setting the prefix of the file name the rule should rename the file to.
A string in destionationFolder holding the full path to a directory where the rule should move the file to.

And the XML file looks accordingly:

<ArrayOfPDFTemplate xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <PDFTemplate>
    <identifiers>
      <string>[Ss]ome company</string>
    </identifiers>
    <contentSelectors>
      <string>\bInvoice date\W+(?:\w+\W+){0,20}?([0-9] *[0-9] *[0-9] *[0-9] *- *[0-9] *[0-9] *- *[0-9] *[0-9])\b</string>
      <string>([0-9] *[0-9] *[0-9] *[0-9] *- *[0-9] *[0-9] *- *[0-9] *[0-9])</string>
    </contentSelectors>
    <fileNamePrefix>Some Company</fileNamePrefix>
    <destinationFolder>C:\Sorted PDF Files\Some Company</destinationFolder>
  </PDFTemplate>

  ...
</ArrayOfPDFTemplate>

Running the match and renaming the file

Now, when a file gets processed, it goes through all the identifiers of the objects in the PDFTemplates list, and for the first match, the rule gets applied. If no rule is matched at all, the file is moved (but not renamed) to a designated "noMatch" folder for manual processing.

The code for searching through the file for identifiers and renaming it goes like this:

...
					
//Extract all text from the PDF document
org.apache.pdfbox.pdmodel.PDDocument doc = org.apache.pdfbox.pdmodel.PDDocument.load(fullPath);
org.apache.pdfbox.util.PDFTextStripper stripper = new org.apache.pdfbox.util.PDFTextStripper();
text = stripper.getText(doc);
doc.close();

...
	
//Go through all identifiers, looking for a match
foreach (string identifier in thisTemplate.identifiers)
{
    if (!Regex.IsMatch(text, identifier, RegexOptions.IgnoreCase))
    {
        identifiersFound = false;
        break;
    }
}

...

//Look for a matching contentselector
foreach (string contentSelector in thisTemplate.contentSelectors)
{
    Match thisMatch = Regex.Match(text, contentSelector, 
             RegexOptions.IgnoreCase | RegexOptions.Multiline);
    if (thisMatch.Captures.Count != 0)
    {
        string selection = thisMatch.Groups[1].Value;
        newFileName = newFileName + "_" + selection;
        break;
    }
}

And then there is just the matter of renaming and moving the file being processed.

Points of Interest

This article is really not so much about solving things elegantly (the code needs some rework for that - it's just a hack) but rather a starting point for you if you're facing the similar situation. I really tried to find software that would do this for me instead, but I really drew a blank when it comes to using regular expressions and setting up my own rules collection to apply to a file.