PDF Parser and FlateDecoder

Corey Fournier

3.54/5 (8 votes)

Jul 23, 2009

MPL

1 min read

83873

3513

Demonstrates how to parse objects in a PDF and inflate FlateDecode sections.

Download source code - 1.53 MB

Introduction

Here is full code from start to finish on how to extract streams from a PDF file and inflate FlateDecode sections. SharpZipLib source is also included so everything will run right out of the box.

Using the code

In the attached code is a small project with one file that shows how to operate everything. Below is the code in the OpenDocument event so you can see how simple it is.

Dim ofd As New OpenFileDialog()
Dim ow() As PDFParser.ObjectWrapper
Dim sb As New System.Text.StringBuilder()

ofd.Filter = "PDF|*.pdf"
ofd.InitialDirectory = _
  System.Environment.GetEnvironmentVariable("%USERPROFILE%") + "\Desktop"

If ofd.ShowDialog() = Windows.Forms.DialogResult.OK Then
    ow = PDFParser.Objects.GetAllObjectBlobs( _
            New System.IO.MemoryStream( _
            System.IO.File.ReadAllBytes(ofd.FileName)))
For Each wrapper As PDFParser.ObjectWrapper In ow
    sb.Append("********************" + wrapper.header + _
              "**************************" + vbCrLf)
    If wrapper.header.Contains("FlateDecode") AndAlso Not _
           wrapper.header.Contains("DecodeParms") Then
       Try
        sb.Append(PDFParser.Inflator.FlateDecodeToASCII(New _
                  System.IO.MemoryStream(wrapper.bytes)))
       Catch ex As Exception
        sb.Append("EXCEPTION: " + ex.Message)
       End Try
    End If
    sb.Append(vbCrLf)
    sb.Append("*********************************" & _ 
              "***************************************" + vbCrLf)
Next
txtInflatedContents.Text = sb.ToString()

Detailed code use

Use the static method "GetAllObjectBlobs" and pass in the bytes of the PDF file.

PDFParser.Objects.GetAllObjectBlobs()

The method will return an array of ObjectWrappers. This will give you all of the bytes in the stream as well as the header.
You can then determine what you want to do with the stream. I implemented a simple decode method. I say simple because this does not reflect Adobe's specifications, since the encoded methods could be nested or flate-decoded several times.
Once you determine if the stream needs to be decoded, make a call to "FlateDecodeToASCII".

PDFParser.Inflator.FlateDecodeToASCII(New System.IO.MemoryStream(wrapper.bytes))

That's it. Very simple functions to give you the ability to break out object streams and inflate them using FlateDecode.

Points of interest

The code does not look for encryption.
Only inflates to a stream or ASCII.
I have noticed while testing a file that is not compressed, it has sections marked as FlateDecode, but it gives an invalid header exception. I don't know why that is.
This has only been tested on PDFs created by Adobe LiveCycle Designer ES 8.2.
Code examples are in VB.NET and all libraries are in C#.

History

06-17-09 - First release.