
1. Introduction
This project allows you to read and parse a PDF file and display its internal structure. The PDF file specification document is available from Adobe. This
project is based on “PDF Reference, Sixth Edition, Adobe Portable Document Format Version 1.7 November 2006”. It is an intimidating 1310 pages document.
The article provides a concise overview of the specifications. The associated project defines C# classes for reading and parsing a PDF file. To test these
classes the attached test program PdfFileAnalyzer allows you to read a PDF file analyzes it and display and save the result. The program breaks the PDF file
into individual page descriptions, fonts, images and other objects. Two types of PDF files are not supported by this program: encrypted files and multi-generations files.
Revision 1.1 of this program allows programmers in world regions that define decimal separator as comma to compile and run the program.
If you are interested in incorporating PDF file writer into your application, please read "PDF File Writer C# Class Library" article.
2. Overview
The PDF file is structured to allow Adobe Acrobat to display and print each page on a variety of screens and printers. If you open the file with a binary
editor you will see that most of the file is unreadable. The small sections that are readable look like:
1 0 obj
<</Lang(en-CA)/MarkInfo<</Marked true>>/Pages 2 0 R
/StructTreeRoot 10 0 R/Type/Catalog>>
endobj
2 0 obj
<</Count 1/Kids[4 0 R]/Type/Pages>>
endobj
4 0 obj
<</Contents 5 0 R/Group <</CS/DeviceRGB /S/Transparency /Type/Group>>
/MediaBox[0 0 612 792] /Parent 2 0 R
/Resources <</Font <</F1 6 0 R /F2 8 0 R>>
/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>
/StructParents 0/Tabs/S/Type/Page>>
endobj
5 0 obj
<</Filter/FlateDecode/Length 2319>>
stream
. . .
endstream
endobj
The first impression is that the file is made of objects nested between “n 0 obj” and “endobj” keywords. The PDF term is indirect objects. The
numbers before “obj” are the object number and the generation number. Items enclosed within double angle brackets <<>> are dictionaries. Items
enclosed between square brackets [] are arrays. Items starting with slash / are parameters names (i.e. /Pages). In the example above the first item “1 0 obj”
is the document catalog or the root object. The catalog has in its dictionary an item “/Pages 2 0 R”. This is a reference to an object that defines tree of
pages. In this case, object number 2 has a reference to one page “/Kids[4 0 R]”. This is a one page document. Object number 4 is the only page definition. The
page size is 612 by 792 points. In other words 8.5” by 11” (1” is 72 points). The page uses two fonts F1 and F2. They are defined in objects 6 and 8. The
page contents are being described in object number 5. Object number 5 has a stream that describes the painting of the page. In the example we have “. . .”
as place holder for this description. If you tried to look at the PDF file with binary editor the stream will look as a long block of unreadable random
numbers. The reason for it is that you are looking at compressed data. The stream is compressed with ZLib deflate method. This is specified in the
dictionary by “/Filter /FlateDecode”. The compressed stream is 2319 bytes long. If you decompress the stream the first few items will look something like this:
q
37.08 56.424 537.84 679.18 re
W* n
/P <</MCID 0>> BDC 0.753 g
36.6 465.43 537.96 24.84 re
f*
EMC /P <</MCID 1/Lang (x-none)>> BDC BT
/F1 18 Tf
1 0 0 1 39.6 718.8 Tm
0 g
0 G
[(GRA)29(NOTECH LI)-3(MIT)-4(ED)] TJ
ET
This is a small sample of page description language. In this example “re” stands for rectangle. The four numbers before it are position and size “X Y Width Height”.
This simplified example demonstrates the general idea behind PDF files. You start with a root object that point to hierarchy of pages. Each page defines
resources such as fonts, images and contents streams. Contents streams are made
of operators and arguments required to paint the pages. The PdfFileAnalyzer
will produce an object summary file. This file contains all the objects without
the streams. Each stream will be decoded and saved as a separate file. Page
descriptions are saved as text files. Image streams are saved as .jpg or .bmp
files. Font streams are saved as .ttf files. Other streams that are binary are
saved as .bin files. Text streams are saved as .txt files. Page descriptions go
through another parsing process that translates the cryptic one or two letters
codes into a pseudo C# source. As an example the page description above is
translated to:
SaveGraphicsState(); // q
Rectangle(37.08, 56.424, 537.84, 679.18); // re
ClippingPathEvenOddRule(); // W*
NoPaint(); // n
BeginMarkedContentPropList("/P", "<</MCID 0>>"); // BDC
GrayLevelForNonStroking(0.753); // g
Rectangle(36.6, 465.43, 537.96, 24.84); // re
FillEvenOddRule(); // f*
EndMarkedContent(); // EMC
BeginMarkedContentPropList("/P", "<</Lang(x-none)/MCID 1>>"); // BDC
BeginText(); // BT
SelectFontAndSize("/F1", 18); // Tf
TextMatrix(1, 0, 0, 1, 39.6, 718.8); // Tm
GrayLevelForNonStroking(0); // g
GrayLevelForStroking(0); // G
ShowTextWithGlyphPos("[(GRA)29(NOTECH LI)-3(MIT)-4(ED)]"); // TJ
EndTextObject(); // ET
The remaining part of this article will go into PDF file structure and the
parsing process in more details. The following sections will cover: object
definitions, file structure, file parsing, File reading, and using the PdfFileAnalyzer
program.
3. Disclaimer
The PdfFileAnalyzer will work with most PDF files. This was my experience
scanning many of the PDF files on my own system. However, the program does not
support encrypted files or multi-generations files (the second number before
obj is not zero). The number of features available in the PDF specifications is
very significant. It is not possible for a single developer to systematically
test all the features. If the program will throw an exception during file
analysis, an error message will be displayed showing the source code module name
and line number.
4. Object Definitions
PDF file is made of objects. Each PDF object has a corresponding class in
the PdfFileAnalyzer project. All of these object classes are derived classes
from PdfBase class. The source code for objects class definition is
BasicObjects.cs. The exact PDF objects definition is available in chapter 3 of
the Adobe's PDF specifications.
4.1. Basic Objects
- Boolean object is implemented by
PdfBoolean class. The PDF definition of
Boolean is the same as C#.
- Integer object is implemented by
PdfInt class. The PDF definition is the
same as Int32 in C#.
- Real number object is implemented by
PdfReal class. The PDF definition
is the same as Single in C#.
- String object is implemented by
PdfStr class. The PDF definition is
different than C#. String is made out of bytes not characters. It is enclosed
in parenthesis (). The PdfFileAnalyzer saves the PDF string in a C# string
including the parenthesis. PDF string is useful for ASCII encoding.
- Hexadecimal string object is implemented by
PdfHex class. It is a string
of characters defined by two hex digits per byte and enclosed within angle
brackets <>. The PdfFileAnalyzer saves the PDF hex string in C# string
including the angle brackets. For PDF readers the string and the hex string objects serve
the same purpose. The string (AB) is the equivalent to <4142>. PDF hex
string is useful for any encoding.
- Name object is implemented by
PdfName class. Name object are made of
forward slash followed by a sequence of characters. For example /Width.
Named objects are used as parameters names. The PdfFileAnalyzer saves the
name object in C# string including the leading /.
- Null object is implemented by
PdfNull class. The PDF definition of
null is basically the same as in C#.
4.2. Compound Objects
- Array object is implemented by
PdfArray class. PDF array is a
collection of objects enclosed within square brackets []. The objects of one array
can be a mix of any type except stream. The PdfFileAnalyzer saves the objects
in a C# array of PdfBase class. Since all objects are derived classes of
PdfBase there is no problem saving a mix of object types within this array.
When array object is converted to a string (ToString() method), the program
adds a leading and trailing square brackets. Array can be empty. Example of
array with six objects: [120 9.56 true null (string) <414243>].
- Dictionary object is implemented by
PdfDict class. PDF dictionary
is a collection of key and value pairs enclosed within double angle brackets
<<>>. Dictionary key is a name object and value is any
object except stream. The PdfFileAnalyzer saves one key value pair in PdfPair
class. The key is a C# string and the value is PdfBase. The PdfDict class has
an array of PdfPair classes. Dictionary is accessed by key. Therefore pair
ordering is not important. PdfFileAnalyzer sorts the pairs by key value. Example
of dictionary with three pairs: <</CropBox [0 0 612 792] /Rotate 0 /Type
/Page>>.
- Stream object is implemented by
PdfStream. Streams are used to
hold page description language, images and fonts. PDF Stream is made of two
parts a dictionary and a stream of bytes. The dictionary defines the stream
parameters. One of the stream dictionary entries is /Filter. The PDF document
defines 10 types of filters. PdfFileAnalyzer supports 4 filters.
These 4 filters are the only ones I found to be in general use. The compression filter
FlateDecode is the most used filter by current PDF writers. FlateDecode supports
ZLib deflate decompression. The LZWDecode compression filter was used a few years
ago. In order to read older PDF files, this program supports this filter. ASCII85Decode
filter converting printable ASCII to binary. DCTDecode for JPEG image compression. The
PdfFileAnalyzer implement decompression for the first three. The DCTDecode stream
is saved as is with file extension .jpg. It is an image file that can be
viewed.
- Object stream was introduced in PDF 1.5. It is a stream that
contains multiple indirect objects (described below). Stream objects described
above are compressed one stream at a time. Object stream compresses all the included
streams in one compressed section.
- Cross-reference stream was introduced in PDF 1.5. It is a stream
that contains cross-reference table described later in the article.
- Inline image object is implemented by
PdfInlineImage. It is a
stream within a stream. Inline image is part of page description language. It
is made of three operators BI-begin image, ID-image data and EI-end image. The
area between BI and ID is an image dictionary and the area between ID and EI is
the image data.
4.3. Indirect Objects
- Indirect object is implemented by
PdfIndirectObject. It is the
main building block of a PDF document. An indirect object is any object encased
between “n 0 obj” and “endobj”. Other objects can refer to indirect object by
specifying “n 0 R”. The “n” is the object number. The “0” is the generation
number. This program does not support generation number other than 0. The PDF
specification allows for other numbers. The idea behind multi-generation is to
allow PDF modifications by keeping the original file and appending changes.
- Object reference is a way of referring to indirect objects. For
example /Pages 2 0 R is a dictionary entry in the catalog object. It is a
pointer to /Pages object. The pages object is indirect object number 2.
4.4. Operators and keywords
- Operators and keywords are not considered PDF objects. However,
the
PdfFileAnalyzer program has a PdfOp and
a PdfKeyword classes that are derived classes of PdfBase.
During the parsing process the parser creates a PdfOp
or a PdfKeyword for each valid sequence of characters. Appendix A Operator
Summary of the Adobe's PDF file specification lists all the operators. The list
is made of 73 operators. Here are some examples of operators: BT-begin text
object, G-set gray level for stroking operations, m-move to, re-rectangle and
Tc-set character spacing. Examples of keywords: stream, obj, endobj, xref.
5. File Structure
PDF file is made of four parts: header, body, cross-reference and trailer
signature.
- Header: The header is the file signature. It must be %PDF-1.x
where x is 0 to 7.
- Body: The body area contains all the indirect objects.
- Cross-reference: The cross-reference is a table of file position
pointers to all indirect objects. There are two types of cross reference tables.
The original style made of ASCII characters. The new style is a stream within
an indirect object. The information is encoded as binary numbers. At the end of
the cross-reference table there is a trailer dictionary. A file can have more
than one cross-reference area.
- Trailer signature: The trailer signature is made of: keyword
“startxref”, byte offset to the last cross-reference table, and end signature
%%EOF. Please note: trailer dictionary is part of cross-reference area.
6. File Parsing
The PDF file is a sequence of bytes. Some of the bytes have special meaning.
White space is defined as: null, tab, line feed, form feed, carriage return
and space.
Delimiters are defined as: (, ), <, >, [, ], {, }, /, %, and white space characters.
File parsing is done with PdfParser class. To start the parsing process the
program sets file position to the area to be parsed. ParseNextItem() is the
method that extract the next object.
The parser skips white space and comments. If next byte is “(“ the object is
a string. If next byte is “[“ the object is an array. If next two bytes are
“<<“ the object is a dictionary. If next byte is “<“ the object is a
hex string. If next byte is “/“ the object is a name. If the next byte is none
of the above the parser accumulates the following bytes until a delimiter is
found. The delimiter is not part of the current token. The token can be
integer, real number, operator or keyword. In the case of integer, the program
will search further for object reference “n 0 R” or indirect object “n 0 obj”
where n is the integer. The returned value from ParseNextItem() is the
appropriate object as per section 4. Object Definitions. The object class is
returned as PdfBase class.
In the case of array or dictionary, the program will perform recursive
calling of the ParseNextItem() to parse the internal objects of the array or
dictionary.
7. File Reading
PdfDocument class is the main class of PDF file analysis. The entry method
is ReadPdfFile(String FileName). The program opens the PDF file for binary
reading (one byte at a time).
File analysis starts with checking the header signature %PDF-1.x where x is
0 to 7 and the trailer end signature %%EOF. One would think that all PDF
writers would put the header at position zero of the file and the trailer at
the very end of the file. Unfortunately it is not the case. The program has to
search for these two signatures at the two ends of the file. If the header
signature is not at position zero, all indirect objects file position pointers
have to be adjusted.
Just before the trailer signature there is a pointer to the start of the
last cross-reference table.
The parser sets file position for cross-reference table. If the next object
is “xref” keyword we have the original style cross reference. Otherwise, it is
the new stream bases cross reference. The file can have more than one cross
reference table. The file can have both the new and old style of tables. Each
table is a list of object numbers and file position pointers to the starting
point of indirect reference objects. For each active object the program creates
a PdfIndirectObject object and saves it in ObjectArray. The object is empty
except for object number and position. For original cross-reference table the
position is relative to the file. For the stream type cross-reference the
position is relative to a parent indirect object stream.
During this process if indirect object has generation number other than
zero, program execution will be aborted. PdfFileAnalyzer does not support
multi-generation.
At the end of the cross-reference table we have a trailer dictionary. In
order to include this dictionary in the analysis we create a dummy indirect
object with negative object number and save the dictionary in it.
The program looks for four particular entries in the trailer dictionary. If
/Encrypt is found, the file is encrypted and execution is terminated because
this program does not support encryption. Next the program looks for /Root the
object number of the catalog object. If /XRefStm entry exist, we have both
types of cross reference. Finally if /Prev exist we have another
cross-reference table to process.
After the cross-reference processing is done we have an array of all
indirect objects. The available information at this stage of the process is
object number and position. Next, the program loops through the array and reads
and parses each indirect object. This process sets the object value. If the
object is a stream, only the dictionary part is being parsed. The reason is
that the stream length might not be known at this time. In addition to the
object, the system sets object type and subtype members for dictionary and
stream objects if these two values are available.
Next the program loops through all objects and process stream objects.
Stream objects have object type equal to “/ObjStm”. The program reads the
stream associated with these objects and breaks the streams to multiple indirect
objects.
Next the program searches all dictionary objects and stream dictionary
objects for object reference objects. The program is looking for key value pairs such
as: “/name n 0 R”. If a pair like that is found, the program checks the object type.
If the object type was not set during object parsing phase, the type is set to the /name
value.
The next step is to read all streams that were not read before. The system reads the
stream from the file. The stream is decoded is being saved to an appropriate file. The
PdfFileAnalyzer supports the following filters: /FlateDecode, /LZWDecode,
/ASCII85Decode and /DCTDecode. Text file will have extension .txt, binary files .bin, image
files .jpg or .bmp, font files .ttf and cross-reference file .xref. The
/FlateDecode is ZLib Deflate compression method. The decompression source code
is taken from “Processing Standard ZIP Files with C# Compression/Decompression
Classes” article published in CodeProject.com website Click
here to view
The next step is to build page contents. The program follows the page tree
starting from the root. Page objects are not stream objects. In other words,
page description commands are not available directly within the page object. Page
objects directories have a /Contents key value pair. If this pair is missing,
the page is blank. The value of the contents entry can be a single reference or
an array of references. The program will create a dummy contents stream for the
page from the one or multiple contents streams. The page contents dummy streams
are saved in PageObj_xx.txt and in PageSource_xx.txt. The former file is the
actual page description contents for the page. The later file is the same
information converted to pseudo C# source code. Section 2. Overview has examples
of these two files.
The page contents stream is made of arguments and operators. For example rectangle
will be four real numbers followed by re. Inline image is the exception to this
rule. It is described above in Section 3. Object Definitions.
Finally, the program produces the object summary file ObjectSummary.txt. The
file shows all indirect objects information without the streams.
8. PdfFileAnalyzer Program
The PdfFileAnalyzer application was
developed to test the PDF file parsing classes. If you want to test the
executable program outside the development environment, create a PdfFileAnalyzer
directory and copy the PdfFileAnalyzer.exe program into this directory and run
the program. If you run the project from the Visual C# development environment,
make sure you define a working directory in the Debug tab of the project
properties. This program was developed using Microsoft Visual C# 2005. If you
want to run it with Visual C# 2010, the development environment will convert it
with no errors.
Start the program.
The available options are: Open, Setup and Exit.
On first program execution you must run Setup and define project directory.
This directory will hold all sub-directories that will be created for each PDF
file being analyzed.
Open button will display a standard file selection dialog. Navigate to the
PDF file you want to analyze.
The PdfFileAnalyzer screen will change to object summary screen:

Each row represents an indirect PDF object. Each column is:
- Object No. The indirect object number. In the case of trailer dictionary, the object
number is a dummy number, it is negative but on the screen it shows as TRn.
- Object. The type of object as per Section 4. Object Definitions.
- Type. If the object is a dictionary or a stream, the type is the value of /Type
dictionary pair. If the object is not a dictionary or the dictionary does not
contain /Type, the displayed value comes from an indirect reference to this
object.
- Subtype. If the object is a dictionary or a stream and if the dictionary contains
/Subtype entry it is displayed in this column.
- Parent Object No. If the indirect object is part of object stream (see Section 4.2.
Compound Objects), this column is the object number of the object stream.
- Parent Index. If the indirect object is part of object stream, this number is the
index number within the parent object stream.
- File Name. File name exists for stream objects and page objects. It is the name of
the file storing the stream contents. This file can have the following
extensions: .txt for text files, .bin for binary files, .bmp for images, .jpg
for images, .ttf for fonts and .xref for cross reference stream. If the file
being analyzed is MyFile.PDF the stream files are located in the subdirectory
MyFile
of the project directory as specified in the setup screen. Page objects are not
streams. The file represents concatenation of all contents objects for this
page.
- Object Position. For indirect object files that are not object stream type; this is the
object position within the PDF file. Indirect objects that are part of object
stream; this is the position within the parent. Position is given in decimal
and hexadecimal for programmers who would like to view the PDF file in binary
editor.
- Stream Position and Stream Length. The position and length of the stream. The position
is relative to the file or the parent in the same way as object position above.
To view the ObjectSummary.txt file, press the Summary button.
To view the details of an indirect object either select a row and press the
View button or double click on a row. The object analysis screen will be displayed.

For all non stream objects, the first three buttons are disabled. The only
information available is the object itself. You can view it in text or hexadecimal formats.
For stream objects the first button name
is the object type. The first two buttons object type and Stream allow you to
toggle between viewing the object or the stream. The Hex and Text allow you to
view in binary or text format. If the stream is image, the image will be
displayed rather than text. If the stream is a cross-reference stream, the text
format shows four columns: (1) object number, (2) type (0-unused, 1-normal
object, 2-stream object), (3) position for type 1 and parent for type 2 and (4)
parent index number. If the stream is binary (i.e. font), it can be viewed in
hexadecimal only.
Page object is treated as a stream object.
The text displayed is the concatenation of all contents objects. In addition,
the Source button allows you to view the page description language in what
appears as C# code.
Images (.jpg and .bmp) can be rotated and scaled.
9. References
Adobe PDF file specification document: “PDF Reference, Sixth Edition, Adobe
Portable Document Format Version 1.7 November 2006”. Available from Adobe
website Click
here to view
“Processing Standard Zip Files with C# Compression/Decompression Classes” by
Uzi Granot published in CodeProject.com website. Click
here to view
10. History
- 2012/08/25: Version 1.0, Original revision.
- 2013/04/10 Version 1.1. Support for world regions that define comma as decimal separator.