Code to extract plain text from a PDF file

I am unable to locate zlib-compiled-dll files from zlib, can you help me?

Hi,
I've been able to build and use debug, but the release build is crashing somewhere inside the "inflate" call. Anyone else have this issue?
I'm building a simple console app and am linking with the static version of zlib.
The problem goes away if force my release exe to link with the debug version of zlib. (although I need to ignore some default libraries at that point to get around some linker warnings)
Any tips?
Thanks in advance.

Hi y'all,
i have downloaded the code and tried to compile.
unfortunately it didn't work :/

I must add something to the code or it should works without any changes?

Thanks a lot,
BR
Lukas

Which environment do you have? If its microsoft VStudio, ensure you have ZLIB_WINAPI defined in your ZLIB.H file

Your code is very helpful, it helps me alot !
But i have a query here, our pdf documents also many images, how to extract them from a pdf document, as i want to encrypt a pdf document, so i have to read all the data from it and put it into a char array so my algorithm can encrypt it !
So how can i extract all the data including images also from a PDF document ?

modified 20-May-22 21:01pm.

Lots of !!! for simplicity and portability!

Just noted that the text strings in the PDF text objects aren't ASCII (<128) as is assumed in the code. Changing the data type of each byte read from char to unsigned char throughout solved some issues with accented characters (å,ä,ö), but we would really need a more flexible treatment of the encodings. I think that the PDF spec says that we are to expect BOM markers for UTF-16, so modifying the code wouldn't be that difficult.

A note on efficiency:
FindStringInBuffer() would benefit greatly from using the unix memmem.c "non-naive" implementation.

Hi All,
Can anybody send the c# source code for the same logic(PDF to Text).

Ashok.R

Any one know how we make a search engine in c language to access the computer folder of any type such that pdf media player and some thing

I was using your code example and it failed to work when using a different input file set than I had previously. After much research, I found that versions of PDF above 1.4 use the Tj and TJ commands to 'show text' for end-of-line, which cases are not sensed in your code. After I get my version working I will be glad to send them to you to help improve your previous work. I appreciate the base work you did that got me started.

Did not worked....
????

Guess the version is not right.

The function: FindStringInBuffer (buffer, stream, filelen); has a very BIG bug in and around it.

The function restarts searching the buffer at the beginning again with each call instead of sequentially through the file to find each stream (object)! This makes chunks of converted text appear many, many times throughout the output of a large file.

FindStringInBuffer must be used by starting the next search where it left off last.

I will gladly submit my fixes when I am done.

My company doesn't allow .zip files to be downloaded here, would it be possible to get the source files in .tar format?

I am quite interested in trying your code out.

Thanks in advance!

Good work !

In some PDF files TAB is used in place of SPACE.

You need this two rows:

&& (recent[oldchar-1]==' ' || recent[oldchar-1]=='\t' || recent[oldchar-1]==0x0d || recent[oldchar-1]==0x0a)
&& (recent[oldchar-4]==' ' || recent[oldchar-4]=='\t' || recent[oldchar-4]==0x0d || recent[oldchar-4]==0x0a)

x

Hy!

For me the following extension worked well with my Identity-H coded files, but I tested it not with other files and this was worked out only by try and error and not with reading the standards!

C++

//This file contains extremely crude C source code to extract plain text
//from a PDF file. It is only intended to show some of the basics involved
//in the process and by no means good enough for commercial use.
//But it can be easily modified to suit your purpose. Code is by no means
//warranted to be bug free or suitable for any purpose.
//
//Adobe has a web site that converts PDF files to text for free,
//so why would you need something like this? Several reasons:
//
//1) This code is entirely free including for commericcial use. It only
//   requires ZLIB (from www.zlib.org) which is entirely free as well.
//
//2) This code tries to put tabs into appropriate places in the text,
//   which means that if your PDF file contains mostly one large table,
//   you can easily take the output of this program and directly read it
//   into Excel! Otherwise if you select and copy the text and paste it into
//   Excel there is no way to extract the various columns again.
//
//This code assumes that the PDF file has text objects compressed
//using FlateDecode (which seems to be standard).
//
//This code is free. Use it for any purpose.
//The author assumes no liability whatsoever for the use of this code.
//Use it at your own risk!


//PDF file strings (based on PDFReference15_v5.pdf from www.adobve.com:
//
//BT = Beginning of a text object, ET = end of a text object
//5 Ts = superscript
//-5 Ts = subscript
//Td move to start next line

//No precompiled headers, but uncomment if need be:
#include "stdafx.h"

#include <stdio.h>
#include <windows.h>

//YOur project must also include zdll.lib (ZLIB) as a dependency.
//ZLIB can be freely downloaded from the internet, www.zlib.org
//Use 4 byte struct alignment in your project!

#include "zlib.h"

bool identity = false;

//Find a string in a buffer:
size_t FindStringInBuffer (char* buffer, char* search, size_t buffersize)
{
	char* buffer0 = buffer;

	size_t len = strlen(search);
	bool fnd = false;
	while (!fnd)
	{
		fnd = true;
		for (size_t i=0; i<len; i++)
		{
			if (buffer[i]!=search[i])
			{
				fnd = false;
				break;
			}
		}
		if (fnd) return buffer - buffer0;
		buffer = buffer + 1;
		if (buffer - buffer0 + len >= buffersize) return -1;
	}
	return -1;
}

//Keep this many previous recent characters for back reference:
#define oldchar 15

//Convert a recent set of characters into a number if there is one.
//Otherwise return -1:
float ExtractNumber(const char* search, int lastcharoffset)
{
	int i = lastcharoffset;
	while (i>0 && search[i]==' ') i--;
	while (i>0 && (isdigit(search[i]) || search[i]=='.')) i--;
	float flt=-1.0;
	char buffer[oldchar+5]; ZeroMemory(buffer,sizeof(buffer));
	strncpy_s(buffer, search+i+1, lastcharoffset-i);
	if (buffer[0] && sscanf_s(buffer, "%f", &flt))
	{
		return flt;
	}
	return -1.0;
}

//Check if a certain 2 character token just came along (e.g. BT):
bool seen2(const char* search, char* recent)
{
if (    recent[oldchar-3]==search[0] 
     && recent[oldchar-2]==search[1] 
	 && (recent[oldchar-1]==' ' || recent[oldchar-1]==0x0d || recent[oldchar-1]==0x0a) 
	 && (recent[oldchar-4]==' ' || recent[oldchar-4]==0x0d || recent[oldchar-4]==0x0a)
	 )
	{
		return true;
	}
	return false;
}

//This method processes an uncompressed Adobe (text) object and extracts text.
void ProcessOutput(FILE* file, char* output, size_t len)
{
	//Are we currently inside a text object?
	bool intextobject = false;

	//Is the next character literal (e.g. \\ to get a \ character or \( to get ( ):
	bool nextliteral = false;
	
	//() Bracket nesting level. Text appears inside ()
	int rbdepth = 0;

	//Keep previous chars to get extract numbers etc.:
	char oc[oldchar];

	char tb, te;

	int j=0;
	for (j=0; j<oldchar; j++) oc[j]=' ';

	if(identity)
	{
		tb='<';
		te='>';
	}
	else
	{
		tb='(';
		te=')';
	}

	j=1;
	for (size_t i=0; i<len; i+=j)
	{
		char c = output[i];
		if (intextobject)
		{
			if (rbdepth==0 && seen2("TD", oc))
			{
				//Positioning.
				//See if a new line has to start or just a tab:
				float num = ExtractNumber(oc,oldchar-5);
				if (num>1.0)
				{
					fputc(0x0d, file);
					fputc(0x0a, file);
				}
				if (num<1.0 && !identity)
				{
					fputc('\t', file);
				}
			}
			if (rbdepth==0 && seen2("ET", oc))
			{
				//End of a text object, also go to a new line.
				intextobject = false;
				fputc(0x0d, file);
				fputc(0x0a, file);
				//fputc(0x0d, file);
				//fputc(0x0a, file);
			}
			else if (c==tb && rbdepth==0 && !nextliteral) 
			{
				//Start outputting text!
				rbdepth=1;
				//See if a space or tab (>1000) is called for by looking
				//at the number in front of (
				if(!identity)
				{
					float num = ExtractNumber(oc,oldchar-1);
					if (num>0.0)
					{
						if (num>1000.0)
						{
							fputc('\t', file);
						}
						else if (num>100.0)
						{
							fputc(' ', file);
						}
					}
				}
			}
			else if (c==te && rbdepth==1 && !nextliteral) 
			{
				//Stop outputting text
				rbdepth=0;
				if(identity)
					j=1;
			}
			else if (rbdepth==1) 
			{
				if(identity)
				{
					char tmp;
					j=4;
					if ((output[i+2]>='0') && (output[i+2]<='9'))
						tmp = (output[i+2]-0x30)<<4;
					else if ((output[i+2]>='a') && (output[i+2]<='f'))
						tmp = (output[i+2]-0x57)<<4;
					if ((output[i+3]>='0') && (output[i+3]<='9'))
						tmp += (output[i+3]-0x30);
					else if ((output[i+3]>='a') && (output[i+3]<='f'))
						tmp += (output[i+3]-0x57);

					fputc(tmp, file);
				}
				else
				{
					//Just a normal text character:
					if (c=='\\' && !nextliteral)
					{
						//Only print out next character no matter what. Do not interpret.
						nextliteral = true;
					}
					else
					{
						nextliteral = false;

						if ( ((c>=' ') && (c<='~')) || ((c>=128) && (c<255)) )
						{
							fputc(c, file);
						}
					}
				}
			}
			
		}
		//Store the recent characters for when we have to go back for a number:
		//for (j=0; j<oldchar-1; j++) oc[j]=oc[j+1];
		memmove(oc, oc+1, oldchar-1);
		oc[oldchar-1]=c;
		if (!intextobject)
		{
			if (seen2("BT", oc))
			{
				//Start of a text object:
				intextobject = true;
			}
		}
	}
}

int _tmain(int argc, _TCHAR* argv[])
{
	FILE *fileo, *filei;

	//Discard existing output:
	fopen_s(&fileo, "C:\\pdf\\output.txt", "w");
	if (fileo) fclose(fileo);
		fopen_s(&fileo, "C:\\pdf\\output.txt", "a");

	//Open the PDF source file:
	fopen_s(&filei, "C:\\pdf\\somepdf.pdf", "rb");

	if (filei && fileo)
	{
		//Get the file length:
		int fseekres = fseek(filei,0, SEEK_END);   //fseek==0 if ok
		long filelen = ftell(filei);
		fseekres = fseek(filei,0, SEEK_SET);

		//Read ethe ntire file into memory (!):
		char* buffer = new char [filelen]; ZeroMemory(buffer, filelen);
		size_t actualread = fread(buffer, filelen, 1 ,filei);  //must return 1

		bool morestreams = true;

		if(FindStringInBuffer(buffer, "Identity-H", filelen) > 0)
			identity = true;

		//Now search the buffer repeated for streams of data:
		while (morestreams)
		{
			//Search for stream, endstream. We ought to first check the filter
			//of the object to make sure it if FlateDecode, but skip that for now!
			size_t streamstart = FindStringInBuffer (buffer, "stream", filelen);
			size_t streamend   = FindStringInBuffer (buffer, "endstream", filelen);
			if (streamstart>0 && streamend>streamstart)
			{
				//Skip to beginning and end of the data stream:
				streamstart += 6;

				if (buffer[streamstart]==0x0d && buffer[streamstart+1]==0x0a) streamstart+=2;
				else if (buffer[streamstart]==0x0a) streamstart++;

				if (buffer[streamend-2]==0x0d && buffer[streamend-1]==0x0a) streamend-=2;
				else if (buffer[streamend-1]==0x0a) streamend--;

				//Assume output will fit into 50 times input buffer:
				size_t outsize = (streamend - streamstart)*50;
				char* output = new char [outsize]; ZeroMemory(output, outsize);

				//Now use zlib to inflate:
				z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));

				zstrm.avail_in = streamend - streamstart + 1;
				zstrm.avail_out = outsize;
				zstrm.next_in = (Bytef*)(buffer + streamstart);
				zstrm.next_out = (Bytef*)output;

				int rsti = inflateInit(&zstrm);
				if (rsti == Z_OK)
				{
					int rst2 = inflate (&zstrm, Z_FINISH);
					if (rst2 >= 0)
					{
						//Ok, got something, extract the text:
						size_t totout = zstrm.total_out;
						ProcessOutput(fileo, output, totout);
					}
				}
				delete[] output; output=0;
				buffer+= streamend + 7;
				filelen = filelen - (streamend+7);
			}
			else
			{
				morestreams = false;
			}
		}
		fclose(filei);
	}
	if (fileo) fclose(fileo);
	return 0;
}

FG

Linking...
nafxcw.lib(thrdcore.obj) : error LNK2001: unresolved external symbol __endthreadex
nafxcw.lib(thrdcore.obj) : error LNK2001: unresolved external symbol __beginthreadex
Release/PDF2TEXT.exe : fatal error LNK1120: 2 unresolved externals

PDF2TEXT.exe - 3 error(s), 0 warning(s)

can any body send the header file for this code am getting below linking errors...

Linking...
nafxcw.lib(thrdcore.obj) : error LNK2001: unresolved external symbol __endthreadex
nafxcw.lib(thrdcore.obj) : error LNK2001: unresolved external symbol __beginthreadex
Release/PDF2TEXT.exe : fatal error LNK1120: 2 unresolved externals

PDF2TEXT.exe - 3 error(s), 0 warning(s)

Thank you, it solved my trouble with pdf's, but I have to comment the if sentence that check if the character is between ' ' and '~' or between 128 and 255, because it didn't show these characters, it didn't enter then, and my language is full of this characters XD.

Thank you again.

not working Big Grin | :-D

how can I change the text streams of a PDF file? I want to replace streams with their translation in another language.

My MalwareBytes reports the executable within the setup.exe as a malicious threat and blocks it.

andy 15/11/12

1) The line "int num = ExtractNumber(oc,oldchar-1);" should be:
"float num = ExtractNumber(oc,oldchar-1);".
2) "//Assume output will fit into 10 times input buffer:" was wrong for some pdf-files created by FreePDF 4.06 and by PDFCreator where every letter was separately in brackets followed by a long number. That lead to an empty result-text, because the inflate-procedure returned error -5.
Assuming output would fit into 100 times input buffer succeeded in those cases.
Gerald Schade

modified 1-Nov-12 6:49am.

hi,
I want to convert a word file to pdf format.
i need complete code for this.
will u please help me.
i am using VS2010.
thanks in advance..

Hi,
excelent code, very helpfull for develepers who need extract text only from a pdf file, but i find a litle bug i think Laugh | :laugh:

.
If the pdf file size >1mb kabooom!.
Bugtraking and debbuging i find this.

C++

z_stream zstrm; ZeroMemory(&zstrm, sizeof(zstrm));
zstrm.avail_in = streamend - streamstart + 1;
zstrm.avail_out = outsize;
zstrm.next_in = (Bytef*)(buffer + streamstart);
zstrm.next_out = (Bytef*)output;
int rsti = inflateInit(&zstrm);
if (rsti == Z_OK)
{
  int rst2 = inflate (&zstrm, Z_FINISH);
  if (rst2 >= 0)
  {
    //Ok, got something, extract the text:
    size_t totout = zstrm.total_out;
    ProcessOutput(fileo, output, totout);
  }
  //if not inflate()  print msg
  else 
	{
		cout<<zstrm.msg<<endl;
	}
}

displayed error "Incorrect header check"

any advice?

thanks for advance Thumbs Up | :thumbsup:

.

I got an error similar to this. I had to increase the size of the inflated stream, changing it to support up to 50 times the size of the input. Here is the modified code:

//Assume output will fit into 50 times input buffer:
size_t outsize = (streamend - streamstart)*50;

Code to extract plain text from a PDF file

Introduction

Why?

Basics

About Code

Using The Code

Future Enhancements

Code Snippets

License

Comments and Discussions