Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: PDF VB.NET
Hello,
I have a project where i need to extract text and images from PDF pages and build a documentation database.
I am able to do it using vb.net, ikvm and pdfbox. However i still cannot get the x, y position of the text and images i am extracting.
 
Any solutions right there (other than going full Java - i am not a Java developerSmile | :) ?
 
Here is the piece of code i am using to extract images (adapting some examples from pdfbox documentation). Problem is that ImageX and ImageY are always returning 0. Other properties for the image (Heigh and Width) are correctly set.
 
 
    Private PDF As PDDocument = Nothing
    Private PDFPage As PDPage = Nothing
    Private PDFPageResources As PDResources = Nothing
    Private PDFPageStream As COSStream = Nothing
 
    Private PDFDocumentPages As java.util.ArrayList = Nothing
    Private ImageItem As PDXObjectImage = Nothing
    Private ImageMap As java.util.Map = Nothing
    Private ImageMapIterator As java.util.Iterator = Nothing
 
Dim PDFEngine = New PDFStreamEngine
 
 PDFDocumentPages = PDF.getDocumentCatalog.getAllPages()
 PDFPage = PDFDocumentPages.get(0)
 PDFEngine.processStream(PDFPage, PDFPage.findResources, PDFPage.getContents.getStream)
 
 '
 ImageMap = PDFPage.getResources.getImages()
 If ImageMap IsNot Nothing Then
     Dim ImageNumber As Integer = 1
     ImageMapIterator = ImageMap.keySet.iterator
     While ImageMapIterator.hasNext()
 
         Dim key As String
         key = CType(ImageMapIterator.next(), String)
         ImageItem = ImageMap.get(key)
 
         Dim CTM As org.apache.pdfbox.util.Matrix
         CTM = PDFEngine.getGraphicsState.getCurrentTransformationMatrix()
 
         Dim rotationInRadians As Double = (PDFPage.findRotation * Math.PI) / 180
         Dim rotation As New java.awt.geom.AffineTransform
         rotation.setToRotation(rotationInRadians)
 
         Dim rotationInverse As java.awt.geom.AffineTransform = rotation.createInverse
         Dim rotationInverseMatrix As New org.apache.pdfbox.util.Matrix
         rotationInverseMatrix.setFromAffineTransform(rotationInverse)
 
         Dim rotationMatrix As New org.apache.pdfbox.util.Matrix
         rotationMatrix.setFromAffineTransform(rotation)
 
         Dim unrotatedCTM As org.apache.pdfbox.util.Matrix = CTM.multiply(rotationInverseMatrix)
         Dim xScale As Single = unrotatedCTM.getXScale()
         Dim yScale As Single = unrotatedCTM.getYScale()
 
         Dim ImageX As Single = unrotatedCTM.getXPosition()
         Dim imageY As Single = unrotatedCTM.getYPosition()
         Dim ImageH As Single = yScale / 100.0F * ImageItem.getHeight()
         Dim ImageW As Single = xScale / 100.0F * ImageItem.getWidth()
 
...... code to save the image, etc
 
         ImageNumber += 1
     End While
 End If
Posted 12-Jul-12 19:01pm

1 solution

Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

Hi,
Here is an article to do this job. Please go through that:
http://bytescout.com/products/developer/pdfextractorsdk/find-text-and-get-coordinates-pdf[^]
 
Hope this will help you.
--Amit
  Permalink  
Comments
Patrick PAL at 13-Jul-12 11:19am
   
Indeed this might help, thanks.
I downloaded the sdk and will check it over the week-end.
Of course, I would rather go for an open source solution, but if we have to....
Sumit Rastogi SRA at 3-Mar-14 1:28am
   
Hello Amit,
 
Thanks for your reply but I can't use paid library if you know any library which is open source then it would really help me.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 280
1 Sergey Alexandrovich Kryukov 279
2 CPallini 205
3 Maciej Los 197
4 Afzaal Ahmad Zeeshan 160
0 OriginalGriff 5,635
1 DamithSL 4,496
2 Maciej Los 3,942
3 Kornfeld Eliyahu Peter 3,480
4 Sergey Alexandrovich Kryukov 3,180


Advertise | Privacy | Mobile
Web04 | 2.8.141216.1 | Last Updated 13 Jul 2012
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100