Click here to Skip to main content
Click here to Skip to main content

PdfView - Peeking into the Internals of PDFs

By , 6 Oct 2005
 

Introduction

PdfView is a utility that displays the structural elements of a PDF document. Since its inception in 1993, PDF has gained popularity as the format for exchange of electronic documents and forms. It is possible to create a well-formed PDF using a text editor. Simplicity of the format enables developers to create PDF documents using in-house solutions, without resorting to any external toolkits. The problem is, it becomes difficult to traverse within that document you have created after a while, due the format's hierarchical structure and the common use of indirect references within its objects. What's more, most PDF documents are a mixture of text and binary data. PdfView utility tries to address that problem and makes it possible to traverse within the PDF document tree visually.

Background

Portable Document Format (PDF) is a file format developed by Adobe Systems for representing documents in a manner that is independent of the original application software, hardware, and operating system used to create those documents. A PDF file can describe documents containing any combination of text, graphics, and images in a device independent and resolution independent format. These documents can be one page or thousands of pages, very simple or extremely complex with a rich use of fonts, graphics, colour, and images. PDF is an open standard, and anyone can write applications that can read or write PDFs royalty-free.

Main PDF concepts

PDF supports seven basic types of objects: booleans, numbers, strings, names, arrays, dictionaries, and streams. Booleans, numbers, and strings are simple values. As they are not nested, PdfView simply displays them as values (). An array () is a sequence of PDF objects. An array may contain a mixture of object types. A dictionary () is an associative table containing pairs of objects. The first element of each pair is called the key and the second element is called the value. The key must be a name. A value can be any kind of object, including a dictionary. A stream () consists of a dictionary that describes a sequence of characters, followed by the keyword stream, followed by zero or more lines of characters, followed by the keyword endstream. Since streams are basically binary blobs, PdfView just ignores and skips stream blocks. An indirect reference () is a reference to an indirect object, and consists of the indirect object's object number, generation number, and the R keyword. The cross reference table contains information that permits random access to indirect objects in the file, so that the entire file need not be read to locate any particular object.

The trailer enables an application reading a PDF file to quickly find the cross reference table and certain special objects. Applications should read a PDF file from its end. The trailer dictionary is near the very end of the PDF document. It is the root of a PDF object tree.

Using the code

PdfView is a typical MFC Document/View application. It is a utility in itself, and the code within was not intended to be reused in other applications. However, let me summarize the main classes:

CBRawPdf: This class stores the currently displayed file as a byte array. CBPdf uses it to traverse within that byte array. The class has no information of the higher level PDF structures like dictionaries, arrays and cross reference tables. It performs navigational tasks such as getting the next/previous token/line.

CBPdf: This class deals with the higher level structure of the PDF. It uses CRawPdf to traverse within the document. It can render a PDF file in a tree or a rich text control.

CBPdfValue, CBPdfReference, CBPdfArray, CBPdfDictionary, CBPdfStream: Each one of these classes stores a type of PDF object, namely values, references, arrays, dictionaries, and streams. All are derived from the same base class, CBPdfObject.

Graph visualization of PDF objects

Optionally, the utility enables you to create a relational graph of the objects within the PDF file. For this, it needs Graphviz.

Graphviz is an open source graph visualization software. It has several main graph layout programs. The Graphviz layout programs take descriptions of graphs in a simple text language, and make diagrams in several useful formats such as images and SVG for web pages, postscript for inclusion in PDF or other documents; or display in an interactive graph browser.

After you open a PDF file using the utility, you can create a Graphviz compatible text file by selecting "File | Save As Dot File...". After that, the following command converts that text file to an image file:

dot.exe -Tgif pdf.dot -o pdf.gif

which gives you an image similar to the following one:

It is important to note that large PDF files have thousands of objects. Naturally, Graphviz cannot cope with these files, as the output image file tends to be huge. To prevent this, I have hard-coded a maximum limit of 250 objects into the utility. Experienced users can remove that limit, simplify the generated text file by removing the objects that are not needed in the graph and then create the image file.

A final note

Since there are dozens of PDF generators, there are probably some PDF documents that this utility cannot parse correctly. If you e-mail me a link to these documents, I can update the utility to support these documents as well.

History

  • 07th October, 2005: Version 1.1 (Graph visualization)
  • 25th September, 2005: Version 1.0

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Bedri Egrilmez
Turkey Turkey
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionImages?memberKalpana Volety11 Jan '13 - 7:26 
Can you explain a bit more about the way that images are embedded inside pdf files. Especially the color space representations are very confusing.
 
Kalpana Volety
Extract Images from PDF
GeneralMy vote of 5membermanoj kumar choubey9 Feb '12 - 21:59 
Very Nice ....
GeneralOptional contentmemberignaciordc28 Jan '07 - 11:39 
Bedri:
 
Very nice tool. Congratulations for your work.
In the final note you ask for new PDF documents, so that your tool can support them. Don´t know how to e-mail them to you, but I can tell you that when a PDF file contains layers, the dictionary that implements the catalog of the document can contain several keys that are not currently displayed in your tree. In one doc I see the /Metadata and /Version keys. Then, inside the /OCProperties dictionary, that contains the /D dictionary, other keys are missed, such as /OFF or /Category.
 
Any way to get in touch with you with the files and/or the screenshots of your tool?
 
Best regards,
Ignacio
GeneralRead image from pdfmemberOSDN17 Jun '06 - 2:49 
Thanks for great article, but how to read a image from pdf and replace with new one?


Generalsomething about pdfmemberAdrian Bacaianu15 Nov '05 - 9:25 
Hy Again !
First of all, i want to thank you about your article, it is looking very nice and light and bring a very good tools for the ones who work with pdf files and need to know theyre internal structure.
 
I have 2 asking for you:
1. do you know something about how to bring an electronic signature hash into pdf file ?
2. do you have more C# stand alone samples about working with tree graphs ? (similar than the one you provide in the picture)
 
Adrian Bacaianu
GeneralMissing images !memberAdrian Bacaianu13 Nov '05 - 0:38 
Missing images !
 
Adrian Bacaianu
GeneralRe: Missing images !memberBedri Egrilmez15 Nov '05 - 4:13 
Hi,
 
Perry Zh posted a similar message to yours last month (the next message on this page). I have not changed anything for at least one month.
 
There seems to be some intermittent problems with codeproject server. I sometimes get 404 messages when I try to download the zip files too, or cannot see the images.
 
So my only advice would be to take a deep breath (and pray if you are religious Laugh | :laugh: ) and refresh the page.
GeneralRe: Missing images !memberAdrian Bacaianu15 Nov '05 - 9:19 
Yes, corect, you have right !
after few refresh page, i obtain full pictures !
thank you !
 
Adrian Bacaianu
GeneralUnable to download Src and Demo :(memberPerry Zh29 Sep '05 - 21:33 
File not Found
The file '/useritems/pdfview/pdfview_demo.zip' doesn't exist. Please contact webmaster@codeproject.com.

 
旧日重来
GeneralRe: Unable to download Src and Demo :(memberBedri Egrilmez29 Sep '05 - 21:50 
Hi,
 
Me and a couple of friends tested the links and they seem to work Confused | :confused: , so for anyone that have experienced similar problems, I have uploaded both the application and source code to an alternate location. Hopefully, it should work.
 
The link is:
http://rapidshare.de/files/5696328/pdfview.zip.html

GeneralRe: Unable to download Src and Demo :(memberPerry Zh29 Sep '05 - 22:26 
I tried again several times and found that sometimes it works but others dose not. Maybe there's someting wrong with codeproject server.
Sorry for bothering and thanks for replySmile | :)
 
旧日重来
GeneralEnhancement RequestmemberDarren Schroeder26 Sep '05 - 1:56 
First of all. Thank you. I've been looking for such a utility for a long time. The one thing that I'd like to see added is the ability to select and jump to the part of the PDF file when you double click on an item in the tree. I find myself scrolling all over the document to find where this tree element is.
 
Darren
GeneralFew suggestions if I may...memberPandele Florin25 Sep '05 - 21:37 
Take a look at
http://www.codeproject.com/editctrl/scintillawnd.asp
It could prove very useful for the right view.
Also it could jump into the file at the appropriate location when clicking a tree item.
This could mean a 5 star rating for you and the end of expensive PDF toolkits.;)
Thank You!
Questionextraction?memberUnruled Boy25 Sep '05 - 17:08 
how about text extraction?
 
such as pdf->txt?
 
Regards,
unruledboy@hotmail.com
AnswerRe: extraction?memberJun Du11 Oct '05 - 9:25 
Check out this freeware:
http://www.etrusoft.com/pdf-to-word-html/pdf-to-text.htm
 
jd
GeneralRe: extraction?memberMarco Tenuti6 Jan '10 - 19:47 
I tried this PDF-to-Word by Etrusoft, ma it is not a library, so that cannot be reused or linked in project and, even more, it's really unuseful with some text encodings.
 
Marco Tenuti - www.tencas.com

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 6 Oct 2005
Article Copyright 2005 by Bedri Egrilmez
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid