PDF Text Extraction Problem

Question

0.00/5 (No votes)

See more:

When I'm trying to extract plain text from a PDF it is giving me some unclear data instead of exact text. For that PDF the fonts are something like TT222FO00 embedded subset and encoding is custom.

Can anybody help me with this?

Thanks in advance.

[moved up from comment]

This is how I'm doing it:

Posted 6-Jul-11 23:50pm

ajaad

Updated 30-Jul-12 3:52am

v4

Add a Solution

Comments

Joan M 7-Jul-11 6:06am

Just in case that the Manfred R. Bihy answer would not work for you (which I think it will) you should post a small sample of what are you doing and then we will be able to help... Good luck...

Nagy Vilmos 7-Jul-11 9:46am

I've removed the duplicate comment and placed them in the question, if you want to add something click the nice green "Improve question" link

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Manfred Rudolf Bihy · Answer 1 · 2011-07-07T00:01:00

Solution 1

Maybe you'd want to try one these free libraries here: http://java-source.net/open-source/pdf-libraries[^].

Hope you'll find something appropriate there :).

Cheers!

—MRB

Posted 7-Jul-11 0:01am

Manfred Rudolf Bihy

Comments

ajaad 8-Jul-11 7:27am

presently im using same library only.
is there any better solution for this

TorstenH. · Answer 2 · 2012-07-30T04:13:00

Solution 2

I can recommend PDF Clown[^]

well documented, works fine.

Posted 30-Jul-12 4:13am

TorstenH.

Pandvi · Answer 3 · 2012-07-30T15:24:00

Solution 3

Itext is the 3rd party library that most developers used. And for extraction, please see this discussion: http://stackoverflow.com/questions/4026614/extract-text-from-pdf-files[^]

Posted 30-Jul-12 15:24pm

Pandvi