 |
|
 |
According PDF documentation non-ASCII characters are coded as a octal value after a scape character (\). To adjust the code, just create a string (called octalCode, for example) outside the block and add the following lines (in bold):
if (nextLiteral && c >= '0' && c <= '9') { octalCode = ((char)input[i]).ToString() + ((char)input[++i]).ToString() + ((char)input[++i]).ToString(); c = (char)Convert.ToInt32(octalCode, 8); }
if (((c >= ' ') && (c <= '~')) || ((c >= 128) && (c < 255))) { resultString += c.ToString(); }
nextLiteral = false;
It works for me.
Regards.
Alberto
Alberto
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Hi,
Every special character (as Ç, for example) must be converted to an octal string. Just for sample (I don't have the correct codes) if I need to write a "Ç" I must obtain the ASCII code and then convert this value to octal. An ASCII char 192 converted to octal is 300. To put this char in a PDF file you must then write "\300" in the file.
If you need to write "ÇÇ" then you must write "\300\300".
Hope this help.
Alberto
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi, thanks for the fast response.
When I try to write "\300\300" in a string I get syntax error.
I tried to use this method to convert the string but also I get error.
public static String convertToOctal(String input) { String resultString=""; char c; for(int i=0;i<input.Length;i++) { c = (char)input[i]; if (c > 500) { resultString+=(char)(Convert.ToInt32(((int)c).ToString(),8)); } else { resultString += c; } } return resultString; }
What am I missing? 
Thanks very much in advanced.
Regards, Zlatko
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Hi,
To adjust your code you can use:
resultString += @"\" + Convert.ToString(input[i], 8);
Instead of:
resultString+=(char)(Convert.ToInt32(((int)c).ToString(),8));
Regards,
Alberto
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
 |
|
 |
It works only english text language but not work with the bengali or hindi or chinese or any other language.
RAM
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
 |
There aren't any spaces between the words - the text all runs together.
This also doesn't work for PDFs from a URL, so if that's anybody's goal, don't waste your time.
-Tom
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
 |
Hi, after playing with this example I found that it needs bool result = pdfParser.ExtractText("sample.pdf","output.pdf"); rather than = String result = pdfParser.ExtractText(pdfFile); When I try to enter an input and output filename it reads in (or at least does not complain about reading in) the example.pdf and outputs a pdf that cannot be read by acrobat. Any comments? Im new at this so I might not be seeing something obvious. Thank you.
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
 |
There is Text Mining Tool, a freeware program that can convert pdf, doc, rtf, chm, html files to text (extracting text) without need to have installed any other programs like Word, Arcrobat, etc.
Its one of the most important features - simple and user-friendly interface with hotkeys available. It includes also the console tool minetext.exe, which can be helpful for developers or system administrators. The tool is based on .NET 2.0 Framework which should be installed from microsoft.com if you do not have one.
|
| Sign In·View Thread·PermaLink | 2.44/5 (5 votes) |
|
|
|
 |
|
 |
Actually, the tool is based on .NET 2.0, IKVM and pdfbox. (look at the dll files it uses) It seems to work quite well. For my needs it's very useful. Thanks.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
I like the spirit of this article, but even with the correction posted by http://www.kilon.co.uk, this code is not as good as the original posted by Dan Letecky (http://www.codeproject.com/cs/samples/pdf2text.asp). Notably, spacing is not handled correctly as spaces will appear in the middle of words that appear whole in the PDF.
Has anybody fixed this problem with the code? Or does anybody have an alternative solution that uses iTextSharp or another 100% .NET library?
Thanks in advance.
-Craig
|
| Sign In·View Thread·PermaLink | 3.00/5 (2 votes) |
|
|
|
 |
|
 |
Hi,
I've been playing with this sample and had found that it only pulled certain text from my PDF document, I have since tracked this down to the CheckToken(string[] tokens, char[] recent) of the PDFParser class, basically I was getting a index out of bounds error. To resolve this simply replace the following lines;
foreach(string token in tokens) { if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
with
foreach(string token in tokens) { if (token.Length > 1) // Otherwise the "If" fails { if ((recent[_numberOfCharsToKeep - 3] == token[0]) && Obviously add the matching end brackets.
 This resolved my problem of only being able to read two pages out of nine!
|
| Sign In·View Thread·PermaLink | 4.33/5 (5 votes) |
|
|
|
 |
|
 |
The code works great for document text, but I am unable to pull footer information.
Does anyone have a solution for this?
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
 |
I try the pdf of the English edition..it works,but the Complex-Chinese edition can't extract the text file of the correct Chinese characters I need!! Please offer some advice or good news to me~~ Thanks a lot!
|
| Sign In·View Thread·PermaLink | 2.57/5 (6 votes) |
|
|
|
 |
|
 |
I have a .Net web application based on C#. Can this code be used in my web app? If so, what do I need to do to port it to the web?
Thanks!
|
| Sign In·View Thread·PermaLink | 1.86/5 (5 votes) |
|
|
|
 |
|
 |
CheckToken is called in a few places to check if a particulat sequence of characters exist at the end of the previouscharacters array. In CheckToken there is always 2 characters checked token[0] & token[1], but there is a line in ExtractTextFromPDFBytes "if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))". Note, "'" is a single character string and checking token[1] in CheckToken will give an index out of range error.
I don't under stand what checking "'" is doing. How do I work around this.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
 |
Does any one know why I would get the "Attempted to read past end of stream" error on some pdf files....Note these pdf are only one page long.
Thanks
|
| Sign In·View Thread·PermaLink | 3.33/5 (9 votes) |
|
|
|
 |
|
 |
Hi,
I've noticied that your algo does not work with some pdf : it does not extract any text. Trying with pdfbox : it works !
Here is a exemple pdf : http://petoulachi.coldwire.net/datas/test.pdf
I'm beginning to search where is the problem.
|
| Sign In·View Thread·PermaLink | 2.00/5 (5 votes) |
|
|
|
 |
|
|
 |
|
 |
I found the "bug", but not the solution.
PDF's which have embedded font subsets refer to the character as a byte number which you need to look up against the embedded font.
If you manage to figure out how to load the font and see which characters are embedded you can replace and parse correctly.
|
| Sign In·View Thread·PermaLink | 2.50/5 (4 votes) |
|
|
|
 |
|
 |
I tried the demo command line tool. Looks good except that it didn't extract all the text from the document. I have a suspicion that part of the text was marked with a code to prevent it being extracted in addition to the global flag.
|
| Sign In·View Thread·PermaLink | 2.63/5 (8 votes) |
|
|
|
 |