 |
|
|
It works only english text language but not work with the bengali or hindi or chinese or any other language.
RAM
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
There aren't any spaces between the words - the text all runs together.
This also doesn't work for PDFs from a URL, so if that's anybody's goal, don't waste your time.
-Tom
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Hi, after playing with this example I found that it needs bool result = pdfParser.ExtractText("sample.pdf","output.pdf"); rather than = String result = pdfParser.ExtractText(pdfFile); When I try to enter an input and output filename it reads in (or at least does not complain about reading in) the example.pdf and outputs a pdf that cannot be read by acrobat. Any comments? Im new at this so I might not be seeing something obvious. Thank you.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
There is Text Mining Tool, a freeware program that can convert pdf, doc, rtf, chm, html files to text (extracting text) without need to have installed any other programs like Word, Arcrobat, etc.
Its one of the most important features - simple and user-friendly interface with hotkeys available. It includes also the console tool minetext.exe, which can be helpful for developers or system administrators. The tool is based on .NET 2.0 Framework which should be installed from microsoft.com if you do not have one.
|
| Sign In·View Thread·PermaLink | 2.44/5 (5 votes) |
|
|
|
 |
|
|
Actually, the tool is based on .NET 2.0, IKVM and pdfbox. (look at the dll files it uses) It seems to work quite well. For my needs it's very useful. Thanks.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I like the spirit of this article, but even with the correction posted by http://www.kilon.co.uk, this code is not as good as the original posted by Dan Letecky (http://www.codeproject.com/cs/samples/pdf2text.asp). Notably, spacing is not handled correctly as spaces will appear in the middle of words that appear whole in the PDF.
Has anybody fixed this problem with the code? Or does anybody have an alternative solution that uses iTextSharp or another 100% .NET library?
Thanks in advance.
-Craig
|
| Sign In·View Thread·PermaLink | 4.00/5 (1 vote) |
|
|
|
 |
|
|
Hi,
I've been playing with this sample and had found that it only pulled certain text from my PDF document, I have since tracked this down to the CheckToken(string[] tokens, char[] recent) of the PDFParser class, basically I was getting a index out of bounds error. To resolve this simply replace the following lines;
foreach(string token in tokens) { if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
with
foreach(string token in tokens) { if (token.Length > 1) // Otherwise the "If" fails { if ((recent[_numberOfCharsToKeep - 3] == token[0]) && Obviously add the matching end brackets.
 This resolved my problem of only being able to read two pages out of nine!
|
| Sign In·View Thread·PermaLink | 4.75/5 (4 votes) |
|
|
|
 |
|
|
The code works great for document text, but I am unable to pull footer information.
Does anyone have a solution for this?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I try the pdf of the English edition..it works,but the Complex-Chinese edition can't extract the text file of the correct Chinese characters I need!! Please offer some advice or good news to me~~ Thanks a lot!
|
| Sign In·View Thread·PermaLink | 2.57/5 (6 votes) |
|
|
|
 |
|
|
I have a .Net web application based on C#. Can this code be used in my web app? If so, what do I need to do to port it to the web?
Thanks!
|
| Sign In·View Thread·PermaLink | 1.86/5 (5 votes) |
|
|
|
 |
|
|
CheckToken is called in a few places to check if a particulat sequence of characters exist at the end of the previouscharacters array. In CheckToken there is always 2 characters checked token[0] & token[1], but there is a line in ExtractTextFromPDFBytes "if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))". Note, "'" is a single character string and checking token[1] in CheckToken will give an index out of range error.
I don't under stand what checking "'" is doing. How do I work around this.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Does any one know why I would get the "Attempted to read past end of stream" error on some pdf files....Note these pdf are only one page long.
Thanks
|
| Sign In·View Thread·PermaLink | 3.29/5 (7 votes) |
|
|
|
 |
|
|
Hi,
I've noticied that your algo does not work with some pdf : it does not extract any text. Trying with pdfbox : it works !
Here is a exemple pdf : http://petoulachi.coldwire.net/datas/test.pdf
I'm beginning to search where is the problem.
|
| Sign In·View Thread·PermaLink | 2.00/5 (4 votes) |
|
|
|
 |
|
|
 |
|
|
I found the "bug", but not the solution.
PDF's which have embedded font subsets refer to the character as a byte number which you need to look up against the embedded font.
If you manage to figure out how to load the font and see which characters are embedded you can replace and parse correctly.
|
| Sign In·View Thread·PermaLink | 2.67/5 (3 votes) |
|
|
|
 |
|
|
I tried the demo command line tool. Looks good except that it didn't extract all the text from the document. I have a suspicion that part of the text was marked with a code to prevent it being extracted in addition to the global flag.
|
| Sign In·View Thread·PermaLink | 2.29/5 (7 votes) |
|
|
|
 |
|
|
 |
|
|
 |
|
|
Well, as you can see on this page, the author said : "Although the code worked well for me, i didn't find in Adobe's pdf reference how to parse special characters. So if someone knows how to do this, just post it and i will update the class."
|
| Sign In·View Thread·PermaLink | 2.25/5 (20 votes) |
|
|
|
 |