Click here to Skip to main content
Licence MPL
First Posted 23 Jul 2009
Views 22,618
Downloads 802
Bookmarked 16 times

PDF Parser and FlateDecoder

By Corey Fournier | 23 Jul 2009
Demonstrates how to parse objects in a PDF and inflate FlateDecode sections.
2 votes, 25.0%
1

2
2 votes, 25.0%
3
2 votes, 25.0%
4
2 votes, 25.0%
5
3.54/5 - 8 votes
μ 3.54, σa 2.77 [?]

Introduction

Here is full code from start to finish on how to extract streams from a PDF file and inflate FlateDecode sections. SharpZipLib source is also included so everything will run right out of the box.

Using the code

In the attached code is a small project with one file that shows how to operate everything. Below is the code in the OpenDocument event so you can see how simple it is.

Dim ofd As New OpenFileDialog()
Dim ow() As PDFParser.ObjectWrapper
Dim sb As New System.Text.StringBuilder()

ofd.Filter = "PDF|*.pdf"
ofd.InitialDirectory = _
  System.Environment.GetEnvironmentVariable("%USERPROFILE%") + "\Desktop"

If ofd.ShowDialog() = Windows.Forms.DialogResult.OK Then
    ow = PDFParser.Objects.GetAllObjectBlobs( _
            New System.IO.MemoryStream( _
            System.IO.File.ReadAllBytes(ofd.FileName)))
For Each wrapper As PDFParser.ObjectWrapper In ow
    sb.Append("********************" + wrapper.header + _
              "**************************" + vbCrLf)
    If wrapper.header.Contains("FlateDecode") AndAlso Not _
           wrapper.header.Contains("DecodeParms") Then
       Try
        sb.Append(PDFParser.Inflator.FlateDecodeToASCII(New _
                  System.IO.MemoryStream(wrapper.bytes)))
       Catch ex As Exception
        sb.Append("EXCEPTION: " + ex.Message)
       End Try
    End If
    sb.Append(vbCrLf)
    sb.Append("*********************************" & _ 
              "***************************************" + vbCrLf)
Next
txtInflatedContents.Text = sb.ToString()

Detailed code use

  1. Use the static method "GetAllObjectBlobs" and pass in the bytes of the PDF file.
  2. PDFParser.Objects.GetAllObjectBlobs()
  3. The method will return an array of ObjectWrappers. This will give you all of the bytes in the stream as well as the header.
  4. You can then determine what you want to do with the stream. I implemented a simple decode method. I say simple because this does not reflect Adobe's specifications, since the encoded methods could be nested or flate-decoded several times.
  5. Once you determine if the stream needs to be decoded, make a call to "FlateDecodeToASCII".
  6. PDFParser.Inflator.FlateDecodeToASCII(New System.IO.MemoryStream(wrapper.bytes)) 
  7. That's it. Very simple functions to give you the ability to break out object streams and inflate them using FlateDecode.

Points of interest

  • The code does not look for encryption.
  • Only inflates to a stream or ASCII.
  • I have noticed while testing a file that is not compressed, it has sections marked as FlateDecode, but it gives an invalid header exception. I don't know why that is.
  • This has only been tested on PDFs created by Adobe LiveCycle Designer ES 8.2.
  • Code examples are in VB.NET and all libraries are in C#.

History

  • 06-17-09 - First release.

License

This article, along with any associated source code and files, is licensed under The Mozilla Public License 1.1 (MPL 1.1)

About the Author

Corey Fournier

Software Developer

United States United States

Member
Graduate of University of Louisiana at Lafayette in computer science.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralHeader checksum illegal PinmemberZrytyBeret5:50 20 Apr '10  
GeneralRe: Header checksum illegal PinmemberCorey Fournier10:56 21 Apr '10  
GeneralLooking for the right component PinmemberMarco Tenuti14:33 22 Dec '09  
GeneralRe: Looking for the right component PinmemberCorey Fournier6:55 7 Jan '10  
GeneralRe: Looking for the right component PinmemberMarco Tenuti12:02 8 Jan '10  
GeneralRe: Looking for the right component PinmemberCorey Fournier3:35 12 Jan '10  
GeneralRe: Looking for the right component PinmemberMarco Tenuti3:55 12 Jan '10  
GeneralMy vote of 1 PinmemberJamindar1236:10 28 Oct '09  
GeneralRe: My vote of 1 PinmemberCorey Fournier11:03 30 Oct '09  
GeneralNice PinmemberCoderOnline6:08 28 Oct '09  
GeneralRe: Nice PinmemberCorey Fournier11:05 30 Oct '09  
Generalextract text PinmemberUnruled Boy16:37 23 Jul '09  
GeneralRe: extract text PinmemberCorey Fournier5:56 29 Jul '09  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web01 | 2.5.120206.1 | Last Updated 23 Jul 2009
Article Copyright 2009 by Corey Fournier
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid