 |
|
 |
Dll Library not shown (
|
|
|
|
 |
|
 |
Wow, surprised nobody helped.
Basically for PDdocument the dll files have to go in the bin and the bin/debug
This will show when you print the exception method that it cannot find them there.
private static string parseUsingPDFBox(string filename)
{
PDDocument doc = null; try
{
doc = PDDocument.load(n);
}
catch (Exception f)
{
Console.WriteLine(f);
}
PDFTextStripper stripper = new PDFTextStripper();
String s = null;
try
{
s = stripper.getText(doc); }
catch (Exception g)
{
Console.WriteLine(g);
}
return s;
}
Stripper had a problem as well.
System.NullReferenceException: Object reference not set to an instance of an object.
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)
private static string parseUsingPDFBox(string filename)
{
PDDocument doc = new PDDocument();
try
{
doc = PDDocument.load(filename);
}
catch (Exception f)
{
Console.WriteLine(f);
}
PDFTextStripper stripper = new PDFTextStripper();
string s = null;
try
{
s = stripper.getText(doc);
doc.close();
Console.WriteLine(s);
}
catch (Exception g)
{
Console.WriteLine(g);
}
return s;
}
This is the furthest I got with it, with the above error. I don't get why this code won't work.
If anybody could be so kind as to post an Answer for these java.io questions. It would be great
|
|
|
|
 |
|
 |
System.NullReferenceException: Object reference not set to an instance of an object.
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)
I am getting this same error. I have pinpointed it down to that this occurs only with certain PDF files.
For instance...
Works:
PDF Producer: FPDF 1.6
Fast Web View: No
PDF Version: 1.3
Doesn't Work:
PDF Producer: Acrobat Distiller 9.0.0 (Windows)
Fast Web View: Yes
PDF Version: 1.5
I already attempted to downconvert my file to 1.3 and then process it but that still did not work. I think the problem is just that PDFBox cannot handle the newer format or the way the producer constructs the PDF.
|
|
|
|
 |
|
|
 |
|
 |
hi,
I want to convert pdf to text tool.Please provide some codes and guidelines for my project.
Thank you,
prem
|
|
|
|
 |
|
|
 |
|
 |
PDFBOX is not working in trust level = medium. Is there any way to use it in medium trust?
|
|
|
|
 |
|
 |
First of all great solution.
I have problem with this code and hebrew (and i think that the same problem is for pdf in Arabic).
The text appear to be mirrored (from left to right instead of from right to left).
Does anybody have a solution?
Thanks in advance
|
|
|
|
 |
|
 |
Hi, I am still looking for a solution to the hebrew wrong order
Did you end up solving it?
Thank you
|
|
|
|
 |
|
 |
Hey,
Did you find a solution fro the hebrew problem ?
|
|
|
|
 |
|
 |
First of all, thank you for posting this article! It's exactly what I needed.
Just wanted to let you know that PDFBox-0.7.3 already has a tool for extracting text (extracttext.exe) and another for extracting images (extractimages.exe). Here's usage information for the former:
C:\PDFBox-0.7.3\bin>extracttext
Usage: java org.pdfbox.ExtractText [OPTIONS] <PDF file> [Text File]
-password <password> Password to decrypt document
-encoding <output encoding> (ISO-8859-1,UTF-16BE,UTF-16LE,...)
-console Send text to console instead of file
-html Output in HTML format instead of raw text
-sort Sort the text before writing
-startPage <number> The first page to start extraction(1 based)
-endPage <number> The last page to extract(inclusive)
<PDF file> The PDF document to use
[Text File] The file to write the text to
I don't know if those tools were available when you posted the article. Thank you once again for taking the time to share your findings with the community.
|
|
|
|
 |
|
|
 |
|
 |
it's not getting the excat result
|
|
|
|
 |
|
 |
When i use this code, i get the following message. Please help
System.NullReferenceException: Object reference not set to an instance of an object.
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List , COSDictionary , Boolean )
at org.pdfbox.pdmodel.PDPageNode.getAllKids(List result)
at org.pdfbox.pdmodel.PDDocumentCatalog.getAllPages()
at org.pdfbox.util.PDFTextStripper.writeText(PDDocument doc, Writer outputStream)
at org.pdfbox.util.PDFTextStripper.getText(PDDocument doc)
at WindowsApplication1.Form1.getpdftext(String filename) in D:\My Documents\Visual Studio 2008\Projects\WindowsApplication3\WindowsApplication3\Form1.vb:line 12
|
|
|
|
 |
|
 |
im getting the same proble,... any soliutons?
|
|
|
|
 |
|
 |
I have the same problem. Did anyone ever find a reason for this?
|
|
|
|
 |
|
 |
Hi,
I converted a java code to dll by using ikvm. I’m using the dll in .net, but its functions return null value. What’s the problem? (my source code is working truely)
|
|
|
|
 |
|
 |
I'm using this simly example but some PDF documents works graet, but some output file are empty, where is the problem ? some Filter ? some structure ?
|
|
|
|
 |
|
 |
as the subject saying..
ofcourse there is no problem Converting file in english,
but how can i converting it in japanese?
by the way,
the japanese file can be converted but it was not correctly,
some word is copied many times!
if anyone got ideas please tell me!
and thanks a lot!
modified on Friday, August 21, 2009 12:21 AM
|
|
|
|
 |
|
|
 |
|
 |
hi , i have application elearn system that need to convert PDF to html ??
how ??
|
|
|
|
 |
|
 |
Hello,
I tried following statements, it is compiling successfully. No error is occurring but document is not getting loaded at first statement.
PDDocument doc = PDDocument.load(SourceFile);
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
I have given valid path, if I give wrong path, it throws error. What can be wrong? The reason I am saying doc object is not getting filled with file is, I am getting "Object reference not set to an instance of an object." error on line 3rd ( return stripper.getText(doc);)
~Sanjivani
modified on Tuesday, June 9, 2009 5:59 AM
|
|
|
|
 |
|
 |
Hi,
I can read only text in PDF file with PDFBox.
but it doesn't allow me to read the Images in PDF file.
how can i read images from PDF file.
Kind Regards,
Saurabh
|
|
|
|
 |
|
 |
I reckon PDResources.getImages will extract images from the PDF document for you.
Examine the PDFBox in your object browser.
|
|
|
|
 |
|
 |
Hi
I have tried to use:
PDFTextStripper but it is impossible to parse the text since the table cells are not delimited with any character.
PDFStreamParser but i failed to understand how to navigate through the result . see code bellow:
...
page = CType(allPages.get(pindex), PDPage)
contents = page.getContents()
Dim parser As org.pdfbox.pdfparser.PDFStreamParser = New org.pdfbox.pdfparser.PDFStreamParser(contents.getStream())
parser.parse()
Dim tokens As java.util.List = parser.getTokens()
For tokenI As Integer = 0 To tokens.size()
' here the i should try and identify table start/end
Console.WriteLine(String.Format(" Token {0}/{1}", tokenI, tokens.size))
Next 'For tokenI As Integer = 0 To tokens.size()
1. Is there a way to identify a table in PDF file ?
2. What are the alternatives for extracting tables data only using pdfBox ?
3. How is it possible to step through a table ?
Regards,
Hanan
|
|
|
|
 |