Click here to Skip to main content
15,868,141 members
Articles / Programming Languages / C++
Article

PdfView - Peeking into the Internals of PDFs

Rate me:
Please Sign up or sign in to vote.
4.55/5 (27 votes)
6 Oct 20055 min read 200.1K   6.8K   137   15
A utility for viewing the internal structure of PDF documents.

Image 1

Introduction

PdfView is a utility that displays the structural elements of a PDF document. Since its inception in 1993, PDF has gained popularity as the format for exchange of electronic documents and forms. It is possible to create a well-formed PDF using a text editor. Simplicity of the format enables developers to create PDF documents using in-house solutions, without resorting to any external toolkits. The problem is, it becomes difficult to traverse within that document you have created after a while, due the format's hierarchical structure and the common use of indirect references within its objects. What's more, most PDF documents are a mixture of text and binary data. PdfView utility tries to address that problem and makes it possible to traverse within the PDF document tree visually.

Background

Portable Document Format (PDF) is a file format developed by Adobe Systems for representing documents in a manner that is independent of the original application software, hardware, and operating system used to create those documents. A PDF file can describe documents containing any combination of text, graphics, and images in a device independent and resolution independent format. These documents can be one page or thousands of pages, very simple or extremely complex with a rich use of fonts, graphics, colour, and images. PDF is an open standard, and anyone can write applications that can read or write PDFs royalty-free.

Main PDF concepts

PDF supports seven basic types of objects: booleans, numbers, strings, names, arrays, dictionaries, and streams. Booleans, numbers, and strings are simple values. As they are not nested, PdfView simply displays them as values (Image 2). An array (Image 3) is a sequence of PDF objects. An array may contain a mixture of object types. A dictionary (Image 4) is an associative table containing pairs of objects. The first element of each pair is called the key and the second element is called the value. The key must be a name. A value can be any kind of object, including a dictionary. A stream (Image 5) consists of a dictionary that describes a sequence of characters, followed by the keyword stream, followed by zero or more lines of characters, followed by the keyword endstream. Since streams are basically binary blobs, PdfView just ignores and skips stream blocks. An indirect reference (Image 6) is a reference to an indirect object, and consists of the indirect object's object number, generation number, and the R keyword. The cross reference table contains information that permits random access to indirect objects in the file, so that the entire file need not be read to locate any particular object.

The trailer enables an application reading a PDF file to quickly find the cross reference table and certain special objects. Applications should read a PDF file from its end. The trailer dictionary is near the very end of the PDF document. It is the root of a PDF object tree.

Using the code

PdfView is a typical MFC Document/View application. It is a utility in itself, and the code within was not intended to be reused in other applications. However, let me summarize the main classes:

CBRawPdf: This class stores the currently displayed file as a byte array. CBPdf uses it to traverse within that byte array. The class has no information of the higher level PDF structures like dictionaries, arrays and cross reference tables. It performs navigational tasks such as getting the next/previous token/line.

CBPdf: This class deals with the higher level structure of the PDF. It uses CRawPdf to traverse within the document. It can render a PDF file in a tree or a rich text control.

CBPdfValue, CBPdfReference, CBPdfArray, CBPdfDictionary, CBPdfStream: Each one of these classes stores a type of PDF object, namely values, references, arrays, dictionaries, and streams. All are derived from the same base class, CBPdfObject.

Graph visualization of PDF objects

Optionally, the utility enables you to create a relational graph of the objects within the PDF file. For this, it needs Graphviz.

Graphviz is an open source graph visualization software. It has several main graph layout programs. The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in several useful formats such as images and SVG for web pages, postscript for inclusion in PDF or other documents; or display in an interactive graph browser.

After you open a PDF file using the utility, you can create a Graphviz compatible text file by selecting "File | Save As Dot File...". After that, the following command converts that text file to an image file:

dot.exe -Tgif pdf.dot -o pdf.gif

which gives you an image similar to the following one:

Image 7

It is important to note that large PDF files have thousands of objects. Naturally, Graphviz cannot cope with these files, as the output image file tends to be huge. To prevent this, I have hard-coded a maximum limit of 250 objects into the utility. Experienced users can remove that limit, simplify the generated text file by removing the objects that are not needed in the graph and then create the image file.

A final note

Since there are dozens of PDF generators, there are probably some PDF documents that this utility cannot parse correctly. If you e-mail me a link to these documents, I can update the utility to support these documents as well.

History

  • 07th October, 2005: Version 1.1 (Graph visualization)
  • 25th September, 2005: Version 1.0

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here


Written By
Turkey Turkey
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 5 Pin
Manoj Kumar Choubey9-Feb-12 21:59
professionalManoj Kumar Choubey9-Feb-12 21:59 
GeneralOptional content Pin
ignaciordc28-Jan-07 11:39
ignaciordc28-Jan-07 11:39 
GeneralRead image from pdf Pin
OSDN17-Jun-06 2:49
OSDN17-Jun-06 2:49 
Generalsomething about pdf Pin
Adrian Bacaianu15-Nov-05 9:25
Adrian Bacaianu15-Nov-05 9:25 
GeneralMissing images ! Pin
Adrian Bacaianu13-Nov-05 0:38
Adrian Bacaianu13-Nov-05 0:38 
GeneralRe: Missing images ! Pin
Bedri Egrilmez15-Nov-05 4:13
Bedri Egrilmez15-Nov-05 4:13 
GeneralRe: Missing images ! Pin
Adrian Bacaianu15-Nov-05 9:19
Adrian Bacaianu15-Nov-05 9:19 
GeneralUnable to download Src and Demo :( Pin
Mingliang Zhu29-Sep-05 21:33
Mingliang Zhu29-Sep-05 21:33 
GeneralRe: Unable to download Src and Demo :( Pin
Bedri Egrilmez29-Sep-05 21:50
Bedri Egrilmez29-Sep-05 21:50 
GeneralRe: Unable to download Src and Demo :( Pin
Mingliang Zhu29-Sep-05 22:26
Mingliang Zhu29-Sep-05 22:26 
GeneralEnhancement Request Pin
Darren Schroeder26-Sep-05 1:56
Darren Schroeder26-Sep-05 1:56 
GeneralFew suggestions if I may... Pin
Pandele Florin25-Sep-05 21:37
Pandele Florin25-Sep-05 21:37 
Questionextraction? Pin
Huisheng Chen25-Sep-05 17:08
Huisheng Chen25-Sep-05 17:08 
AnswerRe: extraction? Pin
Jun Du11-Oct-05 9:25
Jun Du11-Oct-05 9:25 
Check out this freeware:
http://www.etrusoft.com/pdf-to-word-html/pdf-to-text.htm

jd
GeneralRe: extraction? Pin
Marco Tenuti6-Jan-10 19:47
Marco Tenuti6-Jan-10 19:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.