PDF Parser and FlateDecoder






3.54/5 (8 votes)
Demonstrates how to parse objects in a PDF and inflate FlateDecode sections.
Introduction
Here is full code from start to finish on how to extract streams from a PDF file and inflate FlateDecode sections. SharpZipLib source is also included so everything will run right out of the box.
Using the code
In the attached code is a small project with one file that shows how to operate everything. Below is the code in the OpenDocument
event so you can see how simple it is.
Dim ofd As New OpenFileDialog()
Dim ow() As PDFParser.ObjectWrapper
Dim sb As New System.Text.StringBuilder()
ofd.Filter = "PDF|*.pdf"
ofd.InitialDirectory = _
System.Environment.GetEnvironmentVariable("%USERPROFILE%") + "\Desktop"
If ofd.ShowDialog() = Windows.Forms.DialogResult.OK Then
ow = PDFParser.Objects.GetAllObjectBlobs( _
New System.IO.MemoryStream( _
System.IO.File.ReadAllBytes(ofd.FileName)))
For Each wrapper As PDFParser.ObjectWrapper In ow
sb.Append("********************" + wrapper.header + _
"**************************" + vbCrLf)
If wrapper.header.Contains("FlateDecode") AndAlso Not _
wrapper.header.Contains("DecodeParms") Then
Try
sb.Append(PDFParser.Inflator.FlateDecodeToASCII(New _
System.IO.MemoryStream(wrapper.bytes)))
Catch ex As Exception
sb.Append("EXCEPTION: " + ex.Message)
End Try
End If
sb.Append(vbCrLf)
sb.Append("*********************************" & _
"***************************************" + vbCrLf)
Next
txtInflatedContents.Text = sb.ToString()
Detailed code use
- Use the static method "
GetAllObjectBlobs
" and pass in the bytes of the PDF file. - The method will return an array of
ObjectWrapper
s. This will give you all of the bytes in the stream as well as the header. - You can then determine what you want to do with the stream. I implemented a simple decode method. I say simple because this does not reflect Adobe's specifications, since the encoded methods could be nested or flate-decoded several times.
- Once you determine if the stream needs to be decoded, make a call to "
FlateDecodeToASCII
". - That's it. Very simple functions to give you the ability to break out object streams and inflate them using FlateDecode.
PDFParser.Objects.GetAllObjectBlobs()
PDFParser.Inflator.FlateDecodeToASCII(New System.IO.MemoryStream(wrapper.bytes))
Points of interest
- The code does not look for encryption.
- Only inflates to a stream or ASCII.
- I have noticed while testing a file that is not compressed, it has sections marked as FlateDecode, but it gives an invalid header exception. I don't know why that is.
- This has only been tested on PDFs created by Adobe LiveCycle Designer ES 8.2.
- Code examples are in VB.NET and all libraries are in C#.
History
- 06-17-09 - First release.