 |
|
 |
Thanks, your code is useful and is very simple to overload the extract method to get single pages or page range. Nice work.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
PDFBox creates many, many temporary files that it does not automatically get rid of. Beware! Your disk will fill up quickly! These are created in the default Windows temp folder.
PDFParser does not create these temp files. It certainly is far from perfect (index out of range exceptions, spaces in words etc) but it is fast, and with some tweaking can be made to work, kind of...
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
hello ,
i am merging number of pdf in single pdf. its working fine. when i click on File--> Property. he show the Pdf producer name. i don't want to show this .
if there is any way and some line of code for that...please help me???
thx
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
hi to all,
i want to update tiff image text when drawing rectangle in tiff of that text. if there is any way, Rectangle draw in tiff and user want to update that rectangle text image ????
i am able to draw rectangle in tiff image and showing that text in richtextbox. if user want to change that text in richtextbox he also update the tiff image corresponding richtextbox.
please help me for that....
any example would be great help.....!!
Thanks, Raj
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi, I'm trying to run this class to extract text. But it only extract a small portion ofthe text from the PDF. Can anyone tell me why?
modified on Monday, July 27, 2009 7:32 PM
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
 |
According PDF documentation non-ASCII characters are coded as a octal value after a scape character (\). To adjust the code, just create a string (called octalCode, for example) outside the block and add the following lines (in bold):
if (nextLiteral && c >= '0' && c <= '9') { octalCode = ((char)input[i]).ToString() + ((char)input[++i]).ToString() + ((char)input[++i]).ToString(); c = (char)Convert.ToInt32(octalCode, 8); }
if (((c >= ' ') && (c <= '~')) || ((c >= 128) && (c < 255))) { resultString += c.ToString(); }
nextLiteral = false;
It works for me.
Regards.
Alberto
Alberto
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Hi,
Every special character (as Ç, for example) must be converted to an octal string. Just for sample (I don't have the correct codes) if I need to write a "Ç" I must obtain the ASCII code and then convert this value to octal. An ASCII char 192 converted to octal is 300. To put this char in a PDF file you must then write "\300" in the file.
If you need to write "ÇÇ" then you must write "\300\300".
Hope this help.
Alberto
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi, thanks for the fast response.
When I try to write "\300\300" in a string I get syntax error.
I tried to use this method to convert the string but also I get error.
public static String convertToOctal(String input) { String resultString=""; char c; for(int i=0;i<input.Length;i++) { c = (char)input[i]; if (c > 500) { resultString+=(char)(Convert.ToInt32(((int)c).ToString(),8)); } else { resultString += c; } } return resultString; }
What am I missing? 
Thanks very much in advanced.
Regards, Zlatko
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi,
To adjust your code you can use:
resultString += @"\" + Convert.ToString(input[i], 8);
Instead of:
resultString+=(char)(Convert.ToInt32(((int)c).ToString(),8));
Regards,
Alberto
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
It works only english text language but not work with the bengali or hindi or chinese or any other language.
RAM
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
 |
There aren't any spaces between the words - the text all runs together.
This also doesn't work for PDFs from a URL, so if that's anybody's goal, don't waste your time.
-Tom
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
 |
Hi, after playing with this example I found that it needs bool result = pdfParser.ExtractText("sample.pdf","output.pdf"); rather than = String result = pdfParser.ExtractText(pdfFile); When I try to enter an input and output filename it reads in (or at least does not complain about reading in) the example.pdf and outputs a pdf that cannot be read by acrobat. Any comments? Im new at this so I might not be seeing something obvious. Thank you.
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
 |
There is Text Mining Tool, a freeware program that can convert pdf, doc, rtf, chm, html files to text (extracting text) without need to have installed any other programs like Word, Arcrobat, etc.
Its one of the most important features - simple and user-friendly interface with hotkeys available. It includes also the console tool minetext.exe, which can be helpful for developers or system administrators. The tool is based on .NET 2.0 Framework which should be installed from microsoft.com if you do not have one.
|
| Sign In·View Thread·PermaLink | 2.44/5 (5 votes) |
|
|
|
 |
|
 |
Actually, the tool is based on .NET 2.0, IKVM and pdfbox. (look at the dll files it uses) It seems to work quite well. For my needs it's very useful. Thanks.
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
 |
I like the spirit of this article, but even with the correction posted by http://www.kilon.co.uk, this code is not as good as the original posted by Dan Letecky (http://www.codeproject.com/cs/samples/pdf2text.asp). Notably, spacing is not handled correctly as spaces will appear in the middle of words that appear whole in the PDF.
Has anybody fixed this problem with the code? Or does anybody have an alternative solution that uses iTextSharp or another 100% .NET library?
Thanks in advance.
-Craig
|
| Sign In·View Thread·PermaLink | 2.67/5 (3 votes) |
|
|
|
 |
|
 |
Hi,
I've been playing with this sample and had found that it only pulled certain text from my PDF document, I have since tracked this down to the CheckToken(string[] tokens, char[] recent) of the PDFParser class, basically I was getting a index out of bounds error. To resolve this simply replace the following lines;
foreach(string token in tokens) { if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
with
foreach(string token in tokens) { if (token.Length > 1) // Otherwise the "If" fails { if ((recent[_numberOfCharsToKeep - 3] == token[0]) && Obviously add the matching end brackets.
 This resolved my problem of only being able to read two pages out of nine!
|
| Sign In·View Thread·PermaLink | 4.33/5 (5 votes) |
|
|
|
 |
|
 |
The code works great for document text, but I am unable to pull footer information.
Does anyone have a solution for this?
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
 |
I try the pdf of the English edition..it works,but the Complex-Chinese edition can't extract the text file of the correct Chinese characters I need!! Please offer some advice or good news to me~~ Thanks a lot!
|
| Sign In·View Thread·PermaLink | 2.88/5 (7 votes) |
|
|
|
 |
|
 |
I have a .Net web application based on C#. Can this code be used in my web app? If so, what do I need to do to port it to the web?
Thanks!
|
| Sign In·View Thread·PermaLink | 1.75/5 (6 votes) |
|
|
|
 |
|
 |
CheckToken is called in a few places to check if a particulat sequence of characters exist at the end of the previouscharacters array. In CheckToken there is always 2 characters checked token[0] & token[1], but there is a line in ExtractTextFromPDFBytes "if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))". Note, "'" is a single character string and checking token[1] in CheckToken will give an index out of range error.
I don't under stand what checking "'" is doing. How do I work around this.
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |