Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / VB

PDF Parser and FlateDecoder

3.54/5 (8 votes)
23 Jul 2009MPL1 min read 83.1K   3.5K  
Demonstrates how to parse objects in a PDF and inflate FlateDecode sections.

Introduction

Here is full code from start to finish on how to extract streams from a PDF file and inflate FlateDecode sections. SharpZipLib source is also included so everything will run right out of the box.

Image 1

Using the code

In the attached code is a small project with one file that shows how to operate everything. Below is the code in the OpenDocument event so you can see how simple it is.

VB
Dim ofd As New OpenFileDialog()
Dim ow() As PDFParser.ObjectWrapper
Dim sb As New System.Text.StringBuilder()

ofd.Filter = "PDF|*.pdf"
ofd.InitialDirectory = _
  System.Environment.GetEnvironmentVariable("%USERPROFILE%") + "\Desktop"

If ofd.ShowDialog() = Windows.Forms.DialogResult.OK Then
    ow = PDFParser.Objects.GetAllObjectBlobs( _
            New System.IO.MemoryStream( _
            System.IO.File.ReadAllBytes(ofd.FileName)))
For Each wrapper As PDFParser.ObjectWrapper In ow
    sb.Append("********************" + wrapper.header + _
              "**************************" + vbCrLf)
    If wrapper.header.Contains("FlateDecode") AndAlso Not _
           wrapper.header.Contains("DecodeParms") Then
       Try
        sb.Append(PDFParser.Inflator.FlateDecodeToASCII(New _
                  System.IO.MemoryStream(wrapper.bytes)))
       Catch ex As Exception
        sb.Append("EXCEPTION: " + ex.Message)
       End Try
    End If
    sb.Append(vbCrLf)
    sb.Append("*********************************" & _ 
              "***************************************" + vbCrLf)
Next
txtInflatedContents.Text = sb.ToString()

Detailed code use

  1. Use the static method "GetAllObjectBlobs" and pass in the bytes of the PDF file.
  2. VB
    PDFParser.Objects.GetAllObjectBlobs()
  3. The method will return an array of ObjectWrappers. This will give you all of the bytes in the stream as well as the header.
  4. You can then determine what you want to do with the stream. I implemented a simple decode method. I say simple because this does not reflect Adobe's specifications, since the encoded methods could be nested or flate-decoded several times.
  5. Once you determine if the stream needs to be decoded, make a call to "FlateDecodeToASCII".
  6. VB
    PDFParser.Inflator.FlateDecodeToASCII(New System.IO.MemoryStream(wrapper.bytes)) 
  7. That's it. Very simple functions to give you the ability to break out object streams and inflate them using FlateDecode.

Points of interest

  • The code does not look for encryption.
  • Only inflates to a stream or ASCII.
  • I have noticed while testing a file that is not compressed, it has sections marked as FlateDecode, but it gives an invalid header exception. I don't know why that is.
  • This has only been tested on PDFs created by Adobe LiveCycle Designer ES 8.2.
  • Code examples are in VB.NET and all libraries are in C#.

History

  • 06-17-09 - First release.

License

This article, along with any associated source code and files, is licensed under The Mozilla Public License 1.1 (MPL 1.1)