Click here to Skip to main content
15,884,237 members
Articles / Programming Languages / Visual Basic
Article

PDF to Text

Rate me:
Please Sign up or sign in to vote.
2.94/5 (15 votes)
9 Oct 2007CPOL2 min read 101.1K   10.6K   37   13
Convert a PDF to text.

Introduction

Looking around trying to find examples of how to extract text out of a PDF, I didn't find much. Well, there are a few, but cost money. I found an example done in Java, and converted it to VB.NET with add-ons and a different logic. The code in this application is very incomplete, and it will be eventually used in an automated process using a file watcher to extract text out of PDFs and then format the text to put it into a SQL Server database. I hope that some one finds this code and the recommend changes or updates useful.

Using the code

The code is pretty easy to use. Both the test functions are stored in a class ExtractPDF. The function to extract the text requires a PDF file name and a password. The password can be Nothing and will be ignored. If the PDF file has a password, a valid password needs to be converted to Bytes and then passed. ItextSharp.dll needs to be referenced. The source code files for itextsharp.dll are also available.

I have two Case statements in the function, so new or more options/formats or whatever else comes in a PDF file can be read and the appropriate action taken.

VB
While (Token.NextToken)
    Select Case Token.TokenType
        Case Token.TK_STRING
            StrBuf.Append(Token.StringValue)
        Case Token.TK_OTHER
            ' What to do with other characters
            Select Case Token.StringValue
                Case "ET"
                    StrBuf.Append(vbCrLf)
            End Select
            'Could add more here
    End Select
End While

Update

I have updated the program and figured out why I was getting the cast error. Sometimes the object is returned as an array and not individually. There is probably a smarter way to get it right with one loop, but I store the streams in an ArrayList and process it later:

VB
Dim Stream As New ArrayList

If objectref.IsArray Then
    Dim Counter As Integer

    For Counter = 0 To objectref.ArrayList.Count - 1
        Stream.Add(Reader.GetPdfObject(objectref.arraylist(Counter)))
    Next
Else
    Stream.Add(Reader.GetPdfObject(objectref))
End If

This code is far from complete, but I thought that it would help some VB programmer out there as the other examples I found where in C# (funny that ItextSharp.dll is all written in C#). If any body has any additions, please feel free to use the code.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
South Africa South Africa
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
GeneralMy vote of 4 Pin
roni.net19-Apr-16 0:47
roni.net19-Apr-16 0:47 
QuestionDoesn't work with itextsharp5.4.0 Pin
Nikolaj Strauss25-Mar-13 4:41
Nikolaj Strauss25-Mar-13 4:41 
GeneralMy vote of 5 Pin
Manoj Kumar Choubey9-Feb-12 22:13
professionalManoj Kumar Choubey9-Feb-12 22:13 
GeneralMy vote of 5 Pin
myaccountram14-Aug-10 9:20
myaccountram14-Aug-10 9:20 
GeneralHelp, problematic text extraction Pin
Member 336444027-Oct-08 20:09
Member 336444027-Oct-08 20:09 
GeneralRe: Help, problematic text extraction Pin
rhwiebe19-Nov-08 8:02
rhwiebe19-Nov-08 8:02 
GeneralMessage Closed Pin
23-Nov-07 22:20
Vitaliy Petrenko23-Nov-07 22:20 
GeneralRe: Free Text Mining Tool that can convert PDF files to text Pin
rhwiebe19-Nov-08 8:06
rhwiebe19-Nov-08 8:06 
GeneralWill not read some files Pin
Darcy J Williamson5-Nov-07 7:40
Darcy J Williamson5-Nov-07 7:40 
I have found this article exellent to read some pdf files. but it will not return anything more than the info stream in the file produced by
Acrobat Distiller 4.05 for Windows
It reads the files produced by
PDFXC Library (version 2.5). or
Producer : Acrobat Distiller 7.0 (Windows) Creator : PScript5.dll Version 5.2.2
Do you have a suggested update to for this.


Darcy Williamson

AnswerRe: Will not read some files Pin
RG_SA5-Nov-07 18:59
RG_SA5-Nov-07 18:59 
GeneralRe: Will not read some files Pin
Darcy J Williamson5-Nov-07 19:34
Darcy J Williamson5-Nov-07 19:34 
QuestionSpecified cast is not valid Pin
coppocks9-Oct-07 7:59
coppocks9-Oct-07 7:59 
AnswerRe: Specified cast is not valid Pin
RG_SA10-Oct-07 1:47
RG_SA10-Oct-07 1:47 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.