Click here to Skip to main content
15,886,724 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi Guys

I am doing a university project and it based on "Extracting Information from PDF"

My idea is to

1. search and find out correct pdf using the text which I input.
eg - there are lots of jumbled PDF in hard disk and I want to select pdf regarding "Artificial Intelligent"

2. My searching query is also "Artificial Intelligent" I also need to extract the content of Artificial intelligent content inside in the PDF

3. the content relevant to my input query will display in the interface finally

Can anyone help me to sort it out this matter including coding help?

is hOOt - full text search engine help me in indexing?

I am kindly looking forward

Student
Posted
Updated 18-Apr-13 3:52am
v2
Comments
Jochen Arndt 16-Apr-13 6:16am    
What you are describing is a full text search engine limited to PDF files. You may look for an Open Source engine that fits your needs or provides sources that can be used as starting point for your own implementation.

You will need to create Full Text Search on PDF file.
I added the code below, hope you will understand.
At step 4 , you need to run the quesry with you "search text".

SQL
Step 1: Create Full Text Catalog
EXEC sp_fulltext_database 'enable'
GO

IF NOT EXISTS ( SELECT * FROM sys. fulltext_catalogs
            WHERE name = 'Ducuments_Catalog' )
BEGIN
    EXEC sp_fulltext_catalog 'Ducuments_Catalog' , 'create' ;
END

GO


Step 2: Create a Table

CREATE TABLE [dbo].[T_Document](
    [ID] [bigint] IDENTITY(1,1) NOT NULL,
    [FileName] [varchar](100) NULL,
    [FileType] [varchar](50) NULL,
    [Content] [varbinary](max) NULL,
CONSTRAINT [PK_T_Document] PRIMARY KEY CLUSTERED
(
    [ID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Step 3:  Create Full Text Index on Table Columns

CREATE FULLTEXT INDEX ON [dbo].[T_Document]

(      Content Type Column FileType Language 1033,
        [FileName] language 1033
)
KEY INDEX PK_T_Document
ON Ducuments_Catalog WITH CHANGE_TRACKING AUTO;
GO


Step 4: Run the Query


SELECT * FROM T_Document WHERE FREETEXT (Content,'Borrower Name')

SELECT * FROM T_Document WHERE CONTAINS (Content,'"Borrower Name"')
 
Share this answer
 
v2
Comments
Asa code 22-Apr-13 13:29pm    
Can hoot help me to sort out to find the location of pdf? but it doesn't extract the content which match the search query?

Can someone help ?
First of no one here is going to write your university project for you and neither should they. It's supposed to be about what you've learned and what you can do.

However here are some pointers.

You're going to need code that can read a PDF and parse the format to get the plain text separated from all the other stuff that's in a PDF.

All searching these days is based on forms of indexing. It sounds like you're going to want full text search so you'll be wanting to investigate the kind of techniques that Google use. I gather their key data structure is called a BigTable. I can't imagine why :-)

You need to consider how to build the index, all at once or incremental, how to keep it up to date when PDFs are added and deleted or whether to throw it away and start again for each search.

You need to decide what to allow in the user entered query: single word no punctuation, multiple words, exact phrase for matching, search codes like +intelligence -Einstien or even full regular expressions.

Once you have all of this 'specification' stuff nailed down and all the low level technologies like actually reading a PDF file working in test cases then you're ready to write your application and also to write it up properly.

I don't know what it's like where you are but in my day most of the credit for university programming projects was for the write up. As long as the program worked they didn't dig much deeper into usability criteria or quality of the source code but skipped straight to the documents.

If you get stuck with the code parts then by all means post more questions CP is a great source.
 
Share this answer
 
Comments
Asa code 16-Apr-13 7:17am    
I understood that "no one here is going to write your university project for you and neither should they."

to extract the correct pdf from database I have worked it out this algorithm

// extracting 1-itemset (candidates) from all XML dataset belonging to a category AI
Forall categories AI in C do begin
Forall pages t in AI do begin
Forall terms x in t do
X = { x ⎜ x.freq ≥ minfreq}
End
L1 = { x ⎜ x.support ≥ minsup}
// association rules extraction algorithm to discover termset
For ( k = 2; Lk-1 ≠ φ; k++ ) do begin
Ck = apriori-gen(Lk-1 ); // generating new candidates using Apriori algorithm
Forall pages t in d do begin
Ct = subset(Ck , t); // Candidates contained in t
Forall candidates c ∈ Ct do c:count++;
End
Lk = {c ∈ Ck ⎜ c:count ≥ minsup }
End
Answer = ∪k Lk; // set of rules R
// Repeat the above for all categories
End
Matthew Faithfull 16-Apr-13 8:53am    
That's seriously hard to read but I get that you have worked out a logical algorithm for a part of the project. Have a go at turning that into working code in the language of your choice or whatever you're having to use and let us know if you get stuck along the way.
Just to note if you are going to be wanting a topical/category or semantic search then you are going to need to build an index by hand Microsoft style. This is going to be very tricky and you will need to write or get very good dictionary editing tools to help you.
Asa code 22-Apr-13 13:29pm    
Can hoot help me to sort out to find the location of pdf? but it doesn't extract the content which match the search query?

Can someone help ?

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900