retrieving content from a pdf

Question

0.00/5 (No votes)

See more:

Hi Guys

I am doing a university project and it based on "Extracting Information from PDF"

My idea is to

1. search and find out correct pdf using the text which I input.
eg - there are lots of jumbled PDF in hard disk and I want to select pdf regarding "Artificial Intelligent"

2. My searching query is also "Artificial Intelligent" I also need to extract the content of Artificial intelligent content inside in the PDF

3. the content relevant to my input query will display in the interface finally

Can anyone help me to sort it out this matter including coding help?

is hOOt - full text search engine help me in indexing?

I am kindly looking forward

Student

Posted 15-Apr-13 23:52pm

Asa code

Updated 18-Apr-13 3:52am

v2

Add a Solution

Comments

Jochen Arndt 16-Apr-13 6:16am

What you are describing is a full text search engine limited to PDF files. You may look for an Open Source engine that fits your needs or provides sources that can be used as starting point for your own implementation.

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Hemant761 · Accepted Answer · 2013-04-16T00:15:00

You will need to create Full Text Search on PDF file.
I added the code below, hope you will understand.
At step 4 , you need to run the quesry with you "search text".

SQL

Step 1: Create Full Text Catalog
EXEC sp_fulltext_database 'enable'
GO

IF NOT EXISTS ( SELECT * FROM sys. fulltext_catalogs
            WHERE name = 'Ducuments_Catalog' )
BEGIN
    EXEC sp_fulltext_catalog 'Ducuments_Catalog' , 'create' ;
END

GO


Step 2: Create a Table

CREATE TABLE [dbo].[T_Document](
    [ID] [bigint] IDENTITY(1,1) NOT NULL,
    [FileName] [varchar](100) NULL,
    [FileType] [varchar](50) NULL,
    [Content] [varbinary](max) NULL,
CONSTRAINT [PK_T_Document] PRIMARY KEY CLUSTERED
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Step 3:  Create Full Text Index on Table Columns

CREATE FULLTEXT INDEX ON [dbo].[T_Document]

(      Content Type Column FileType Language 1033,
        [FileName] language 1033
)
KEY INDEX PK_T_Document
ON Ducuments_Catalog WITH CHANGE_TRACKING AUTO;
GO


Step 4: Run the Query


SELECT * FROM T_Document WHERE FREETEXT (Content,'Borrower Name')

SELECT * FROM T_Document WHERE CONTAINS (Content,'"Borrower Name"')

Matthew Faithfull · Accepted Answer · 2013-04-16T00:10:00

First of no one here is going to write your university project for you and neither should they. It's supposed to be about what you've learned and what you can do.

However here are some pointers.

You're going to need code that can read a PDF and parse the format to get the plain text separated from all the other stuff that's in a PDF.

All searching these days is based on forms of indexing. It sounds like you're going to want full text search so you'll be wanting to investigate the kind of techniques that Google use. I gather their key data structure is called a BigTable. I can't imagine why :-)

You need to consider how to build the index, all at once or incremental, how to keep it up to date when PDFs are added and deleted or whether to throw it away and start again for each search.

You need to decide what to allow in the user entered query: single word no punctuation, multiple words, exact phrase for matching, search codes like +intelligence -Einstien or even full regular expressions.

Once you have all of this 'specification' stuff nailed down and all the low level technologies like actually reading a PDF file working in test cases then you're ready to write your application and also to write it up properly.

I don't know what it's like where you are but in my day most of the credit for university programming projects was for the write up. As long as the program worked they didn't dig much deeper into usability criteria or quality of the source code but skipped straight to the documents.

If you get stuck with the code parts then by all means post more questions CP is a great source.

retrieving content from a pdf

2 solutions

Solution 2

Solution 1

Add your solution here

Preview 0