 |
|
|
 |
|
 |
Dear all,
anybody pls point me the way how can i export data from excel to pdf.
or export data from datagridview to pdf that support unicode string..
feel is very complicated serveral days ago but can't find any solution..
pls help me
|
|
|
|
 |
|
 |
very simple feedback : it does not work !
|
|
|
|
 |
|
 |
I couldn't extract text from pdf. I have just download the coding and include the dll file and PDFSharper file to my project but i didn't get any string in output file.
Anybody please help me.
Thanks in advance..
|
|
|
|
 |
|
 |
The iTextSharp.dll that is included with this project bombed when I ran the program on a test file that I created with Acrobat X. The latest version of iTextSharp works better. The program itself works sort of with PDF files created with ABBYY, but it does not interpret all the tokens correctly. The result is unwanted spaces within the text. While looking for an explanation of the tokens that are embedded in the stream, I came arcross http://www.java2s.com/Open-Source/CSharp/PDF/iTextSharp/iTextSharp/text/pdf/parser/Catalogparser.htm. It has the source that compiles to a program that not only extracts the text, but also lists the dictionary and content stream. The only drawback is that you have to copy and paste 26 files into Visual Studio since I have not been able to find a download link, but it does what I needed it to do and more.
|
|
|
|
 |
|
 |
This code is actually in the latest version of iTextSharp (5.1.3.0) and is much simpler to use than the code in this project (and works more reliably too). Here's a simple class that uses iTextSharp:
using System;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;
public class PdfTextParser
{
public string ExtractTextFromPDFPage(string pdfFile, int pageNumber)
{
PdfReader reader = new PdfReader(pdfFile);
string text = PdfTextExtractor.GetTextFromPage(reader, pageNumber);
try { reader.Close(); }
catch {}
return text;
}
}
|
|
|
|
 |
|
 |
Thanks you very much for your CODE!
you saved my ass on my current project when PDFBox fail to extract the text!
I encounter some problem reading some pages of some document anyway
because in the code (method ExtractTextFromPDFBytes)you call:
if (CheckToken(new string[] {"'", "T*", "\""}, previousCharacters))
{
resultString += "\n";
}
But the CheckToken take for granted that ALL tokens are 2 character long at least
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
...
checking if the token is long 1 character or more solve the problem.
you need to change the CheckToken method with this one:
private bool CheckToken(string[] tokens, char[] recent)
{
foreach(string token in tokens)
{
if (token.Length > 1)
{
if ((recent[_numberOfCharsToKeep - 3] == token[0]) &&
(recent[_numberOfCharsToKeep - 2] == token[1]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
{
return true;
}
}
else
{
if ((recent[_numberOfCharsToKeep - 2] == token[0]) &&
((recent[_numberOfCharsToKeep - 1] == ' ') ||
(recent[_numberOfCharsToKeep - 1] == 0x0d) ||
(recent[_numberOfCharsToKeep - 1] == 0x0a)) &&
((recent[_numberOfCharsToKeep - 4] == ' ') ||
(recent[_numberOfCharsToKeep - 4] == 0x0d) ||
(recent[_numberOfCharsToKeep - 4] == 0x0a))
)
{
return true;
}
}
}
return false;
}
|
|
|
|
 |
|
 |
Thanks for pointing this out
The CheckToken method seems to find [whitespace][token_chars][whitespace] at the end of recent
So when tokens[i] contains 2 characters (from -4 to -1) :
(
(recent[_numberOfCharsToKeep - 4] == ' ')
|| (recent[_numberOfCharsToKeep - 4] == 0x0d)
|| (recent[_numberOfCharsToKeep - 4] == 0x0a)
)
&& (recent[_numberOfCharsToKeep - 3] == token[0])
&& (recent[_numberOfCharsToKeep - 2] == token[1])
&& (
(recent[_numberOfCharsToKeep - 1] == ' ')
|| (recent[_numberOfCharsToKeep - 1] == 0x0d)
|| (recent[_numberOfCharsToKeep - 1] == 0x0a)
)
But when tokens[i] contains 1 character, it should be (from -3 to -1) :
(
(recent[_numberOfCharsToKeep - 3] == ' ')
|| (recent[_numberOfCharsToKeep - 3] == 0x0d)
|| (recent[_numberOfCharsToKeep - 3] == 0x0a)
)
&& (recent[_numberOfCharsToKeep - 2] == token[0])
&& (
(recent[_numberOfCharsToKeep - 1] == ' ')
|| (recent[_numberOfCharsToKeep - 1] == 0x0d)
|| (recent[_numberOfCharsToKeep - 1] == 0x0a)
)
Right ?
modified on Wednesday, June 22, 2011 1:47 PM
|
|
|
|
 |
|
 |
Hi all,
is there any way to know when the new line comes from the pdf file. I gоt txt from pdf file, but I need to know when the lines end.
I tried to compare, is this
char c = (char) input [i];
is new line,
but without success.
Is there a way to do it.
BR,
Dejan
|
|
|
|
 |
|
 |
this solution is not working on scand image PDF file ....output is blank text file.
|
|
|
|
 |
|
|
 |
|
|
 |
|
 |
The code segment provided does NOT match the pdfParser class supplied in the source code.
|
|
|
|
 |
|
 |
PDF stores ASCII strings (one byte per character) as is in parentheses: (string). But Unicode strings (two bytes per character) are stored as hex string in angle brackets: <0073007400720069006e0067>. Each 4 hex digits are one Unicode character.
I have added decoding of such hex strings into ExtractTextFromPDFBytes(). I have added new utility function GetCharFromHex() which converts 4 hex digits to Unicode char. I prepare result string in StringBuilder instead of String to improve performance. And I included support of ANSI characters encoded as octal code (thanks to MrVeloso). I've also removed try/catch in ExtractTextFromPDFBytes() because it masks any program/logic problems and can lead to losing a text without any warning.
#region GetCharFromHex
private char GetCharFromHex(char[] previousCharacters, int hexDigits)
{
short code = 0;
code = Convert.ToInt16(new string(previousCharacters, previousCharacters.Length - hexDigits, hexDigits), 16);
while (hexDigits < 4)
{
code <<= 4;
hexDigits++;
}
return (char)code;
}
#endregion
#region ExtractTextFromPDFBytes
public string ExtractTextFromPDFBytes(byte[] input)
{
if (input == null || input.Length == 0) return "";
StringBuilder resultString = new StringBuilder(4096);
bool inTextObject = false;
bool nextLiteral = false;
int bracketDepth = 0;
bool inHexString = false;
int hexDigits = 0;
char[] previousCharacters = new char[_numberOfCharsToKeep];
for (int j = 0; j < _numberOfCharsToKeep; j++) previousCharacters[j] = ' ';
for (int i = 0; i < input.Length; i++)
{
char c = (char)input[i];
if (inTextObject)
{
if (bracketDepth == 0 && !inHexString)
{
if (CheckToken(new string[] { "TD", "Td" }, previousCharacters))
resultString.Append("\n\r");
else
{
if (CheckToken(new string[] { "'", "T*", "\"" }, previousCharacters))
resultString.Append("\n");
else
{
if (CheckToken(new string[] { "Tj" }, previousCharacters))
resultString.Append(' ');
}
}
}
if (bracketDepth == 0 && !inHexString &&
CheckToken(new string[] { "ET" }, previousCharacters))
{
inTextObject = false;
if (resultString.Length > 0 && resultString[resultString.Length - 1] != ' ')
resultString.Append(' ');
}
else
{
if ((c == '(') && (bracketDepth == 0) && (!nextLiteral))
{
bracketDepth = 1;
}
else
{
if ((c == ')') && (bracketDepth == 1) && (!nextLiteral))
{
bracketDepth = 0;
}
else
{
if (c == '<' && !inHexString && bracketDepth == 0)
{ inHexString = true;
hexDigits = 0;
}
else
{
if ((c == '<' || c == '>') && inHexString) {
if (hexDigits > 0) resultString.Append(GetCharFromHex(previousCharacters, hexDigits));
inHexString = false;
}
else
if (bracketDepth == 1)
{
if (c == '\\' && !nextLiteral)
{ nextLiteral = true;
}
else
{
if (nextLiteral && c >= '0' && c <= '9')
{ char[] octalCode = {(char)input[i], (char)input[++i], (char)input[++i]};
c = (char)Convert.ToInt32(new string(octalCode), 8);
}
if (((c >= ' ') && (c <= '~')) ||
((c >= 128) && (c < 255)))
{
resultString.Append(c);
}
nextLiteral = false;
}
}
else
if (inHexString)
{
if (hexDigits == 4)
{ resultString.Append(GetCharFromHex(previousCharacters, hexDigits));
hexDigits = 0;
}
hexDigits++; }
}
}
}
}
}
for (int j = 0; j < _numberOfCharsToKeep - 1; j++)
{
previousCharacters[j] = previousCharacters[j + 1];
}
previousCharacters[_numberOfCharsToKeep - 1] = c;
if (!inTextObject && CheckToken(new string[] { "BT" }, previousCharacters))
{
inTextObject = true;
}
}
return resultString.ToString();
}
#endregion
modified on Wednesday, September 29, 2010 12:47 PM
|
|
|
|
 |
|
 |
I can't get this work with my PDF documents.
For example PDF which has content like below return very strange characters. Could this be character set problem or something?
BT
308.8 521.8 Td /F1 10 Tf[<1110>7<10>-3<0E>4<12>]TJ
ET
Q
q 0 0 0 rg
BT
344.8 510.4 Td /F1 10 Tf[<2E>4<08>3<01>-3<08>3<16>1<04>2<0B17>-4<0C>3<04>2<18>-5<0C>-17<15>1<15>1<1B>-1<27>-2<02>9<2B>-5<1C>]TJ
ET
Q
q 0 0 0 rg
BT
92.9 489.8 Td /F1 14 Tf[<29>1<0C>-3<07>3<18>-1<04>-1<24>-4<04>-1<04>-1<18>-1<18>-1<06>3<15>-2<27>1<2F>1<02>-2<1B>]TJ
ET
Q
q 0 0 0 rg
BT
92.9 461.7 Td /F1 10 Tf[<0D>3<0E>4<0F>-4<10>-3<1112>]TJ
|
|
|
|
 |
|
 |
The project from this article cannot parse TJ operator (do not mix up with Tj). Does it return anything for you?
|
|
|
|
 |
|
 |
It returns very strange characters like hebrew or something.. Do you know how the code should be modified that it could work with these type of PDF's also?
|
|
|
|
 |
|
 |
how to get the correct position (X,Y in pixels) of each 'word'
or 'char' of the extracted text ?
|
|
|
|
 |
|
 |
This sample have problem with extract text from pdf when exist char " - ascii (34) in this file.
Any suggestions ?
|
|
|
|
 |
|
 |
String result = pdfParser.ExtractText (path);
The line is totally wrong as your method pdfparser.extracttext return boolean , and there it is written console.writeline
Doesn't work for me.
|
|
|
|
 |
|
 |
Yes, it is a mistake. It is actually just:
pdfParser.ExtractText(path, "path to output text file");
|
|
|
|
 |
|
 |
In your example result is a string but pdfparser returns a bool. Next up it populated a txt file with rubbish, no text at all (the pdf file does have legitimate text in it). Otherwise this would have been great. Oh well.
|
|
|
|
 |
|
|
 |
|
 |
This is really very useful article. But the only point here i have is this code works only for the textPDF. What for a scanned image PDF. How can we know wether this is scanned or text PDF
|
|
|
|
 |
|
 |
I try to port this console application to web but i can´t.
I chage the path when this application call but in the outFile (StreamWriter on PDFParser.cs) he always return false and i don´t understand why.
Someone can help?
|
|
|
|
 |