 |
|
|
 |
|
 |
Hello,
I tried following statements, it is compiling successfully. No error is occurring but document is not getting loaded at first statement.
PDDocument doc = PDDocument.load(SourceFile); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc);
I have given valid path, if I give wrong path, it throws error. What can be wrong? The reason I am saying doc object is not getting filled with file is, I am getting "Object reference not set to an instance of an object." error on line 3rd ( return stripper.getText(doc)
~Sanjivani
modified on Tuesday, June 9, 2009 5:59 AM
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
 |
Hi,
I can read only text in PDF file with PDFBox. but it doesn't allow me to read the Images in PDF file.
how can i read images from PDF file.
Kind Regards, Saurabh
|
| Sign In·View Thread·PermaLink | 2.00/5 (2 votes) |
|
|
|
 |
|
 |
Hi
I have tried to use: PDFTextStripper but it is impossible to parse the text since the table cells are not delimited with any character.
PDFStreamParser but i failed to understand how to navigate through the result . see code bellow:
... page = CType(allPages.get(pindex), PDPage)
contents = page.getContents()
Dim parser As org.pdfbox.pdfparser.PDFStreamParser = New org.pdfbox.pdfparser.PDFStreamParser(contents.getStream())
parser.parse()
Dim tokens As java.util.List = parser.getTokens() For tokenI As Integer = 0 To tokens.size() ' here the i should try and identify table start/end Console.WriteLine(String.Format(" Token {0}/{1}", tokenI, tokens.size))
Next 'For tokenI As Integer = 0 To tokens.size()
1. Is there a way to identify a table in PDF file ? 2. What are the alternatives for extracting tables data only using pdfBox ? 3. How is it possible to step through a table ?
Regards, Hanan
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
 |
I've followed the code examples, and placed references where they needed to go, but I keep running into this error. Can anyone help?
This is my code:
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using org.pdfbox.util; using org.pdfbox.pdmodel; using IKVM.GNU.Classpath; using IKVM.Runtime;
namespace PDF_Parse { public partial class Form1 : Form { public Form1() { InitializeComponent(); }
private static string parseUsingPDFBox(string filename) { PDDocument doc = PDDocument.load(filename); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); }
public static void Main(string[] args) { StreamWriter writer = File.CreateText("output.txt"); writer.WriteLine(parseUsingPDFBox("pod.pdf")); writer.Close(); } }
}
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
 |
I want to convert Pdf that is in a URL to text. Is there a way to do that without having to save the PDF file?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Initally I used pdfBox with Vb.Net and it is working smoothly. But when I ported code to C# with this code try catch with WrappedIOException (org.pdfbox.exceptions.WrappedIOException) is required and always WrappedIOException is thrown for any pdf file; and e.getMessage() is "The signature is incorrect."
What could be wrong? same time vb.net code is working without any try catch.
I'm using PDFBox-0.7.3 Visual C# 2008 Express edition & VB.Net 2008 Express edition
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I created another new project in C# only to test pdfbox. It works. But old project still having same problem of exception. How to repair this problem with that specefic c# project?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
With out FontBox-*-dev.dll library you will recieve the same error
So, my working directory is follows: 19.02.2009 14:17 16 384 Pdf2Text.exe 12.10.2006 12:20 4 653 056 PDFBox-0.7.3.dll 10.08.2006 10:17 9 568 256 IKVM.GNU.Classpath.dll 19.02.2009 14:14 1 290 714 sample.pdf 12.10.2006 12:20 86 016 FontBox-0.1.0-dev.dll 10.08.2006 10:14 344 064 IKVM.Runtime.dll
|
| Sign In·View Thread·PermaLink | 1.50/5 (2 votes) |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
Why is it you all are able to load a PDFBox-0.7.x.dll into the GAC? Are you compiling your own PDFBox-0.7.x.dll with a .snk? If so, from what PDFBox-0.7.x.dll source? If not, where can I locate, for download, a strongly named PDFBox-0.7.[2 or 3 or whatever].dll?
I've tried PDFBox versions .2 and .3 and my gacutil.exe fails on adding either assembly to the cache.
But you guys appear to have no problem with that. I'm using .NET SDK v2.0.
BTW, is anybody even using the GAC? Or are you allowed to just drop these DLLs directly into a directory path and start compiling the PDF sample?
Thanks for any replies.
modified on Friday, December 26, 2008 8:10 PM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
hello when i converting pdf file to text that display in othere font type
Amitkumar Prajapati Anjar(Kutch)/Baroda
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I want to covert pdf to doc.I am using PDFBox-0.7.3 from code project. I am using c# .net, while converting the file from pdf to doc the text's are converted correctly but without formatting and also I can't get the images from pdf file.
|
| Sign In·View Thread·PermaLink | 2.67/5 (6 votes) |
|
|
|
 |
|
 |
I want to read/parse the tablular data from a pdf documents. I have found some third party softwares which can covert the entire pdf to text preserving the layout(display). But none of the tools provide pre-defined seperators/delimiters between the text of each cell.
I has also investigated for some tools which can covert this pdf to html which can then be parsed. But even in the html files the entire table is represented by absolute positioned divs. It would've been easy to parse tables from HTML.
Is there any way I can read the tables from pdf document in some object(which can be easily represented in terms of rows and columns)? Or is there any third party developer library using which I can easily read the cells of a table in pdf? Please let me know even if there is some third party software to convert the pdf containg to html document representing tabular data in html tables
Thanks in anticipation.
|
| Sign In·View Thread·PermaLink | 2.00/5 (2 votes) |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
Hi,
trying to run your code i m getting this error at run time :
Could not load file or assembly 'bcprov-jdk14-132, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. The system cannot find the file specified.
on the 1st line of the given code :
PDDocument doc = PDDocument.load(filename);
I ve copied:
bcprov-jdk14-132.dll FontBox-0.1.0-dev.dll IKVM.GNU.Classpath.dll IKVM.Runtime.dll PDFBox-0.7.3.dll
from the PDFBox-0.7.3 bin directory to my project but the problem pesists 
any suggentions???
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
I had copied the dll files to the bin library and inported the classpath and PDFBox dll file references, and put in the namespaces
using System; using System.Collections.Generic; using System.Text; using System.IO; using org.pdfbox.util; using org.pdfbox.pdmodel;
but it still was not working. It threw a System.IO.File exception on my input file.
The problem was the later version of PDFBox (0.7.3).
I used the following files from http://www.netlikon.de/docs/PDFBox-0.7.2/bin/?C=M;O=A :
IKVM.Runtime.dll (9/7/2005 356K) IKVM.GNU.Classpath.dll (9/7/2005 6.8M) PDFBox-0.7.2.dll (9/11/2005 8.1M)
and this fixed it, along with a re-write (even though the re-write prior to the file changing did NOT solve the issue, so this wasn't the reason, but the code does make more sense this way... this is all in C#)
This assumes the input and output files have been created and are in the same directory as your built exe file. As I said in the subject, your input PDF file CANNOT be a URL path, as this is NOT supported.
static void Main() // string[] args { // DateTime dt = DateTime.Now; StreamWriter writer = File.CreateText("output.txt"); writer.WriteLine(TransformPdfToText("input.pdf")); writer.Close(); }
static string TransformPdfToText(string SourceFile) { PDDocument doc = PDDocument.load(SourceFile); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); }
Happy coding!!
-Tom
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
I'm using Visual Studio 2005 and C# .NET.
I put the DLL files into the bin directory, added the files as References, added these namespaces as directed in this forum:
using org.pdfbox.util; using org.pdfbox.pdmodel;
and am trying to use this code in my Main function:
string filename = "test.pdf"; PDDocument doc = PDDocument.load(filename); PDFTextStripper stripper = new PDFTextStripper(); StreamWriter writer = File.CreateText("output.txt"); writer.Write(stripper.getText(doc)); writer.Close();
Wouldn't this be the right way to do it? No matter what I put for filename (i.e. C:\\test.pdf, http://localhost/test.pdf when I put it in my C:\inetpub\wwwroot directory) it throws an exception on the PDDocument line: The type initializer for 'java.io.File' threw an exception.
Any help? Thanks in advance.
-Tom
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
 |
The problem was the later version (0.7.3).
I used the following files from http://www.netlikon.de/docs/PDFBox-0.7.2/bin/?C=M;O=A :
IKVM.Runtime.dll (9/7/2005 356K) IKVM.GNU.Classpath.dll (9/7/2005 6.8M) PDFBox-0.7.2.dll (9/11/2005 8.1M)
and this fixed it, along with a re-write following Dan's lead with his VB, I converted his back to C#  This assumes the files are in the same directory as your built exe file.
static void Main() // string[] args { // DateTime dt = DateTime.Now; StreamWriter writer = File.CreateText("output.txt"); writer.WriteLine(TransformPdfToText("input.pdf")); writer.Close(); }
static string TransformPdfToText(string SourceFile) { PDDocument doc = PDDocument.load(SourceFile); PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); }
Happy coding!! 
-Tom
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |