Introduction
This article presents VB.NET code to create thumbnail images from a directory of Adobe Acrobat PDF documents.
Often when looking for documents it is much easier to find what you want visually, for example seeing the cover of a document.
The application was written for a website that I was developing that needed to display links to PDF documents. Instead of just showing a little PDF icon next to each document we wanted to display the front page of the actual document.
As shown below, this gives the listings better aesthetics and also enables the users to find documents quicker if they recognise it.
VS
Note: please ignore the strange text, lorem ipsum is simply dummy text for this example
Hopefully people will agree that having the actual front cover displayed next to the hyperlink works better than the generic PDF icon.
Background
The web site was a Content Management System (CMS) so new PDF documents were uploaded to the site by the users. We then had this application scheduled as a batch service to run every 5 minutes and check for new files.
In the backend system the documents have metadata stored in a SQL Server 2000 database. We would then write a flag to say the thumbnail had been created and when we generated the HTML content for the page request in ASP/ASP.NET we would return the appropriate IMG
tag and source as appropriate.
Using the Acrobat SDK also meant we could programmically read the PDF metadata and retrieve the number of pages in the document, which could then be displayed as well. Although the end users could have entered that information it meant less work for them and a better overall impression of the web site. Another advantage was that many users relied on the number of pages to determine how large the document was rather than the more technical Kb/Mb value.
Approach
To generate the thumbnail image for each document I used the Adobe Acrobat 5.0 SDK and the Microsoft .NET 1.1 Framework.
Note: do not confuse the thumbnails that are part of a PDF document with the .png files this application generates.
The Acrobat SDK combined with the full version of Adobe Acrobat (sadly the free reader does not expose the COM interfaces) exposes a COM library of objects that can be used to manipulate and access PDF information.
So using these COM objects via COM Interop, we can load the PDF document, get the first page and render that page to the clipboard. Then using the .NET Framework we can copy this to a bitmap, scale and combine that image and then save the result as a .gif or .png file.
At first I just saved the scaled down image, but then decided to “fancy” up the thumbnail with a drop-shadow and folded corner. To achieve this effect I created a transparent .gif, called pdftemplate_portrait.gif, using Macromedia Fireworks MX where the main body of the page template was transparent.
By making the bottom-left pixel transparent too we can easily set the transparent colour for a bitmap in .NET.
I keep the top-right of the image white where the corner folds over, that means I can just combine the images by drawing the transparent template directly over the PDF image to achieve the final look.
Pre-requisites
The full version of Adobe Acrobat (the free reader does not expose the COM interfaces) which exposes a COM library of objects to manipulate and access PDF information.
The Adobe Acrobat 5.0 SDK which is a free download from the Adobe Solutions Network website (note: the site requires registration). The latest SDK for Acrobat 6.0 requires paid membership, so we will use the previous SDK version.
To quickly see if you have the full version of Adobe Acrobat installed, use regedit.exe and look under HKEY_CLASSES_ROOT for entry entry called AcroExch.PDDoc.
You'll also need the .NET 1.1 Framework and some PDF files to test the solution.
The code was written in VB.NET using the .NET 1.1 Framework and Visual Studio.NET 2003 on Windows XP, but there is no reason it wouldn't work on Windows NT/2000 or .NET 1.0.
Using the code
The code is quite simple with a try/catch over the main body. It is purposely in one large block so it's easy to see what it happening and to step through and examine with the debugger.
Initially we create an instance of AcroExch.PDDoc
using late-binding. The referenced Adobe Acrobat 5.0 Type Library (Acrobat.tlb from C:\Program Files\Adobe\Acrobat 5.0 SDK\InterAppCommunicationSupport\Headers) does not expose a COM class you can create using early-binding. By referencing the type library we can get the Intellisense and strong-typing of the other Acrobat objects.
Pass the filename of the PDF documents to be opened to the PDDoc
object, which can then be accessed to get metadata on the document; GetNumPages()
and GetInfo()
for custom document properties.
pdfDoc = CreateObject("AcroExch.PDDoc")
ret = pdfDoc.Open(inputFile)
If ret = False Then
Throw New FileNotFoundException
End If
pageCount = pdfDoc.GetNumPages()
Set a reference to the first page of the document as pdfPage
, which is of type Acrobat.CAcroPDPage
. From this we can get a rectangle object of the actual page dimensions. One strange point to notice here is that the Adobe Acrobat SDK documentation seems incorrect, as the PDFRect
that is returned from the GetSize()
method has IDispatch properties x, y but the PDFRect
we need to supply to CopyToClipboard
must have left, right, top, bottom.
Finally we render the PDF page to the clipboard at full size. We could have Acrobat scale the image down for us by a percentage, but we can get better visual results using the .NET scaling algorithms of the Bitmap
class.
It would have been more efficient to render directly to an off-screen bitmap, and also not have overwritten what ever was previously on the clipboard, but I found the clipboard method the most stable way to get a rendered bitmap of the page using Acrobat.
Although it looks like the pdfPage
object has a DrawEx
method that can take an H<CODE>DC
I couldn't get the method to work in a consistently successful way. Calling DrawEx
in the paint event of a Windows Forms application did work but it still wouldn't write to an off-screen bitmap directly. Therefore the clipboard method is used and if the process runs on a batch server it won't cause too much worry.
Note: the Draw
method is deprecated, as it only works on Win16 systems where hWnd
was unique to Windows and not to each process as on NT.
pdfPage = pdfDoc.AcquirePage(0)
pdfRectTemp = pdfPage.GetSize
pdfRect = CreateObject("AcroExch.Rect")
pdfRect.Left = 0
pdfRect.right = pdfRectTemp.x
pdfRect.Top = 0
pdfRect.bottom = pdfRectTemp.y
Call pdfPage.CopyToClipboard(pdfRect, 0, 0, 100)
Dim clipboardData As IDataObject = Clipboard.GetDataObject()
Grab the rendered page bitmap from the clipboard and based on the pdfRectTemp
object determine if it's a portait or landscape document. Set the correct file to load as the template, and if it is landscape, switch the width and height.
Dim pdfBitmap As Bitmap = clipboardData.GetData(DataFormats.Bitmap)
Dim thumbnailWidth As Integer = 38
Dim thumbnailHeight As Integer = 52
Dim templateFile As String
If (pdfRectTemp.x < pdfRectTemp.y) Then
templateFile = templatePortraitFile
Else
templateFile = templateLandscapeFile
thumbnailWidth = thumbnailWidth Xor thumbnailHeight
thumbnailHeight = thumbnailWidth Xor thumbnailHeight
thumbnailWidth = thumbnailWidth Xor thumbnailHeight
End If
Load the template file as as Bitmap
and as an Image
. We use both because the Bitmap
class supports MakeTransparent
and the image can easily be passed to the Graphics.DrawImage()
method. It is slightly inefficent but speed isn't the primarly objective for this application.
Render the pdfImage
using the GetThumbnailImage()
method of the .NET Framework Bitmap
class, this provides a very smooth scaled version of the image.
Next create a blank bitmap with room for the template border. Set the templateBitmap
to use the bottom-left pixel of the image as the transparency colour using calling MakeTransparent()
. See an article on Chris Sells website for more on transparencies in .NET.
Using the new blank bitmap, draw the rendered pdf page image to it and then the template with transparency directly over the top. Because it is transparent the main area of the page template will still appear through.
Finally, save the composited image back as a .png or .gif file, although .png does look better.
Dim templateBitmap As Bitmap = New Bitmap(templateFile)
Dim templateImage As Image = Image.FromFile(templateFile)
Dim pdfImage As Image = pdfBitmap.GetThumbnailImage(thumbnailWidth, _
thumbnailHeight, _
Nothing, Nothing)
Dim thumbnailBitmap As Bitmap = New Bitmap(thumbnailWidth + 7, _
thumbnailHeight + 7, _
Imaging.PixelFormat.Format32bppArgb)
templateBitmap.MakeTransparent()
Dim thumbnailGraphics As Graphics = Graphics.FromImage(thumbnailBitmap)
thumbnailGraphics.DrawImage(pdfImage, 2, 2, thumbnailWidth, thumbnailHeight)
thumbnailGraphics.DrawImage(templateImage, 0, 0)
thumbnailBitmap.Save(outputFile, Imaging.ImageFormat.Png)
Write some feedback to the console as we work through each of the files.
Then actively release the reference code to the COM objects as Acrobat it isn't the best suited application to opening and closing multiple PDF documents without falling over. Luckily the code doesn't cause Acrobat to display any UI that might cause the process to hang waiting for user interaction.
Console.WriteLine("Generated thumbnail... {0}", outputFile)
thumbnailGraphics.Dispose()
pdfDoc.Close()
Marshal.ReleaseComObject(pdfPage)
Marshal.ReleaseComObject(pdfRect)
Marshal.ReleaseComObject(pdfDoc)
Visual Studio.NET Solution
The project you can download has all the VB.NET code and the COM Interop DLL that was generated. Even though the application is actually a console application we still need System.Windows.Form
as the clipboard dataformats are from there.
Use the app.config to set the input and output paths for the .pdf files and .png files respectively. By default it reads and write to C:\thumbnails\.
Output
Running the PDFThumbnail.exe console application will enumerate all the .pdf files in the directory specified in the .config file writing out a .png image of the first page.
Which we can see in the screenshot below.
Further Enhancements
Further improvements might be to:
- Render directly to an off-screen bitmap rather than to the clipboard.
- Remove the reliance on having a full version of Adobe Acrobat by using Ghostscript libraries instead.
One case we had was documents that could be viewed internally but were blocked due to compliance issues for external users, by designing different templates and rendering them with the page it was obviously the document was private further enhancing usuability, eg.
Points of Interest
The Adobe Acrobat 5.0 SDK is not the greatest written documentation but most information is there if you dig a little.
If running under an NT service account the screen resolution and depth make a difference; for example if your server is only set for 256 colours in 640 x 480, and if the console application is run via the service it will not be able to render 24-bit colour thumbnails. I've seen the same effect when using charting controls from ASP, where the production IIS servers had low screen resolutions set and the colour-depth of the charts was low.
Also, if running in a batch on a server you should check the terms of the Acrobat license agreement to whether you are allowed to run the Adobe Acrobat application in a server-type process.
The images are about 2-3Kb in size and for about 3Gb of documents the thumbnails would take an additional 60MB - so storage requirements are not excessive. The actual time to generate thumbnails for thousands of documents would be a few hours, as Acrobat needs to load each document as well as the rendering to the clipboard, and the .NET bitmap scaling, etc.
References
- Microsoft .NET Framework 1.1 documentation
- Chris Sells' web site for the transparency example code
- Adobe Acrobat 5.0 SDK documentation and examples
- Code Complete Second Edition for the example PDF document (which I hope Steve doesn't mind me including and which I can totally recommend even nearly ten years since it was first published)
Conclusion
This article has shown how to manipulate PDF documents using the Acrobat SDK and combine images using the .NET framework.
At first it can be quite daunting trying to find good information on working with PDF documents programmatically, although there are now a number of good commercial components which hide a lot of the underlying postscript complexities.
I originally wrote this utility in Visual Basic 6 using a third-party imaging components, but now it is easier to share the code using the .NET framework. Especially as the complex imaging and manipulation can now be done with a few simple statements.
Thanks and I hope you enjoyed reading this article; I'd be interested to hear if people found it useful.
History
- 19th January 2004 - Initial release to the Code Project.
- 12th May 2004 - Added C# version