Click here to Skip to main content
14,449,842 members

PDF Document Parser

Rate this:
5.00 (10 votes)
Please Sign up or sign in to vote.
5.00 (10 votes)
11 Feb 2020GPL3
A .NET toolset for building PDF parsers
Think of PdfDocumentParser if you need to build a parser for PDF files that conform to predictable graphical layouts such as reports, bills, forms, tickets and the like. PdfDocumentParser will do all the tricky job of building parsing templates, search, recognition and extraction, thus leaving you only to code a custom logic. This article is a brief review. For details, refer to the documentation and source code.

Read documentation.

Idea

The main approach of parsing by PdfDocumentParser is based on finding certain text or image fragments on a PDF page and then extracting text/images located and sized relatively to those fragments. 

Within this scope, PdfDocumentParser is capable of the following:

  • search/extract text represented by PDF entities
  • search/extract text obtained by OCR
  • search/compare/extract page fragments as images

Also, PdfDocumentParser allows to check custom conditions on a PDF page to decide which actions should be taken on it.

PdfDocumentParser provides facility of parsing tables to arrays.

Template Editor

To be able to parse a PDF document, PdfDocumentParser must be supplied with a parsing template corresponding to the document's layout. For this goal, PdfDocumentParser provides Template Editor that allows creating and debugging parsing templates in an easy manner in GUI. Template Editor should be invoked by the hosting application.

Application

An application based on PdfDocumentParser has to care about the following main aspects:

  • provide storage and management of parsing templates
  • allow a user to create and modify templates with Template Editor
  • implement a custom algorithm of processing PDF files:
    • choose a template to be applied on a PDF page
    • process data parsed by the chosen template

Algorithm

Some basic algorithm of processing a PDF file page by page would be the following:

//Pseudo-code: processing a PDF file where every page requires choosing new template.
//Note: The classes and methods are not real and serve for simplicity and clarity only.

foreach(page in pdfFile)
{
    //find the right template for the page
    if(PdfDocumentParser.ActiveTemplate == null)
    {
        foreach(template in templates)
        {
            PdfDocumentParser.ActiveTemplate = template;
            if(PdfDocumentParser.IsCondition(page, "RightTemplateForPage"))
                break;
            PdfDocumentParser.ActiveTemplate = null;
        }
    }
    
    if(PdfDocumentParser.ActiveTemplate == null)
    {
        logWarning("No template matched to page: " + page.Number);
        continue;
    }
        
    //applying the chosen template to the page 
    object value1 = PdfDocumentParser.GetValue(page, "field1");
    //doing something with value1...
    <...>    
    object value2 = PdfDocumentParser.GetValue(page, "field2");
    //doing something with value2...
    <...>
}

Notice that conditions like 'RightTemplateForPage' are introduced and predetermined by the custom application. PdfDocumentParser only provides the facility of checking them. Because of that, the parsing logic can be as complex as needed.

How exactly a condition is checked is up to the template because every template provides its own definition for it. A condition definition is a boolean expression of what was found and what was not found on PDF page.

For instance, when processing invoices, 'RightTemplateForPage' might check if the company's name or logo is located on the PDF page and thus, detect if the page corresponds to the template.

Creating a VS Solution

Do not download the latest code as is in a branch because it may be in development. Instead, go to releases and download the latest (pre-)release source code. Find SampleParser.sln there and open it in Visual Studio. It will give a complete example of using PdfDocumentParser that you can modify according to your requirements.

Steps in Visual Studio if building from scratch without SampleParser:

  • Create your project.
  • Add PdfDocumentParser project to the solution.
  • Reference PdfDocumentParser in your project.
  • Update nuget packages for the solution.
  • Start developing your parser using PdfDocumentParser API.

Enjoy!

History

  • 12th February, 2020: Initial version

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

Sergey Stoyan
Architect CliverSoft (www.cliversoft.com)
Ukraine Ukraine
Sergey is graduated as applied mathematician. He is specialized in custom development including parsing tools, client/server applications.
github: https://github.com/sergeystoyan
site: http://cliversoft.com

Comments and Discussions

 
Questionhave you considered to post this as a tip? Pin
Nelek24-Jan-19 0:44
protectorNelek24-Jan-19 0:44 
AnswerRe: have you considered to post this as a tip? Pin
Sergey Stoyan8-Feb-19 2:09
MemberSergey Stoyan8-Feb-19 2:09 
GeneralMy vote of 5 Pin
David Pierson30-Dec-18 20:57
MemberDavid Pierson30-Dec-18 20:57 
QuestionFiles in CliverRoutines/Log missing Pin
Christere29-Dec-18 8:38
MemberChristere29-Dec-18 8:38 
AnswerRe: Files in CliverRoutines/Log missing Pin
Sergey Stoyan23-Jan-19 23:40
MemberSergey Stoyan23-Jan-19 23:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 26 Dec 2018

Tagged as

Stats

15.6K views
38 bookmarked