Click here to Skip to main content
15,886,137 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I have a task on my hands.

Basically what I have to do is to create a simple search engine that goes through a group of text documents and record for each word in the document collection all documents that contain a particular word.

The simple search engine must accept a search query (containing a set of keywords) and identify each document that contain all or some keywords.

It should then print documents names in descending order of keywords found, this means the document that contains all keywords should appear at the top of the list

I'm struggling with the pseudocode let alone the program for it.
Posted
Updated 15-Apr-15 14:00pm
v2
Comments
Nelek 15-Apr-15 17:15pm    
Don't think we can read minds or do astral projections to see your monitor. If you need help, the least you could do is to add some relevant code to your question or to explain your problem in such a way, that the users of CP can understand it. Otherwise, nobody will be able to help you.

You just gave a list of requirements and say I am stuck with the pseudocode. Ok, perfect... where? why? What have you tried?[^]
Sascha Lefèvre 15-Apr-15 18:13pm    
If it is not your task to develop this yourself but to get any solution then go for Lucene.
Member 11610671 16-Apr-15 7:09am    
Apologies, to be honest I'm pretty weak at programming and I'm not too sure what to do. To basically sum it up, the user has a search query and if they input a word or words they can find out if the keywords exists in the document/documents

For example, the pseudocode might be:

> define a class Result with variables int count and string filename
> make an ArrayList or other collection to add Results to
> get List of file names from directory
> get list of keywords from user
> for each file in file names do:
>> for each keyword do:
>>> search for keyword
>>>> if found: count++
>>> end
>>if count>0: add Result to list
>>end

>sortByCount
>print List
 
Share this answer
 
I don't know Java, but in C# I'd read the whole file with System.IO.File.ReadAllText(String) then use a RegularExpression.
I definitely would not use IndexOf -- that will lead to false-positives.

For example:

C#
System.Text.RegularExpressions.Regex reg = 
  new System.Text.RegularExpressions.Regex
  ( @"(?i)\b(a)|(the)|(this)\b" ) ; // Create the expresion from the provided terms

System.Text.RegularExpressions.MatchCollection mat = reg.Matches ( args [ 0 ] ) ;
          
System.Console.WriteLine ( mat.Count ) ;
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900