5,276,406 members and growing! (16,178 online)
Email Password   helpLost your password?
Languages » C# » General     Advanced

Full Text Search Tool

By getmayukh

This is an implementation of a text search tool, which can be used to index and search for texts within a collection of documents. The code is written in C# .NET and requires framework 1.1 to run. No third party search tool used, every bit of argorithm implemented from scratch.
C#, Windows, .NET, Visual Studio, Dev

Posted: 14 Sep 2006
Updated: 14 Sep 2006
Views: 21,845
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
1 vote for this article.
Popularity: 0.00 Rating: 4.00 out of 5
0 votes, 0.0%
1
0 votes, 0.0%
2
0 votes, 0.0%
3
1 vote, 100.0%
4
0 votes, 0.0%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Sample screenshot

 Sample screenshot

 Sample screenshot

Sample screenshot

Introduction

The code is capable of searching any "office document", ".txt", ".htm", ".html" files etc.
You have an option in the code to specify the file patterns you want to include during indexing.

Included is the IndexManager namspace which is the core library of the project.

The Search Documents project is basically a demo project which shows you how to use the index manager class.
In this example I have used the IndexManager class to build something like a desktop search, you may use the IndexManager to build an intranet search engine etc.

 

We will begin by defining the data structures we would require to store the indexes for documents. We need to understand here that we would require storing a global index which represents a collection of unique words extracted from each of the documents which we intend to search.

 

We define a document class which would represent a document which we would use to store the information about the document such as the location of the document, name of the document and a hashtable “index” which will represent the indexed representation of the text contained in the document.

 

[Serializable]

            public class Document

            {

                        private string _name = string.Empty;

                        private string _filename = string.Empty;

                        private string _applicationdirectory = string.Empty;

                        private string _virtualdirectory = string.Empty;

                        private Hashtable _index;

                       

 

                        public Document(){}

                        public void SetContent(string sContent){}

                       

                        public Hashtable Index

                        {

                                    get { return _index;}

                                    set { _index = value;}

                        }

 

                        public string Name

                        {

                                    get { return _name; }

                                    set { _name = value; }

                        }

                       

                        public string FileName

                        {

                                    get { return _filename; }

                                    set { _filename = value; }

                        }

                       

                        public string ApplicationDirectory

                        {

                                    set {_applicationdirectory = value;}

                                    get { return _applicationdirectory;}

                        }

 

                        public string VirtualDirectory

                        {

                                    get { return _virtualdirectory;}

                                    set { _virtualdirectory = value;}

                        }

            }

 

What follows is the definition of the Documents class which is a collection of documents.

[Serializable]

            public class Documents : CollectionBase

            {

                        public void Add(Document document)

                        {

                                    List.Add(document);

                        }

                        public void Remove(Document document)

                        {

                                    List.Remove(document);

                        }

                        public Document this[int index]

                        {

                                    get { return (Document) List[index];}

                                    set { List[index] = value;}

                        }

            }

We need to parse the documents, extract the words from the documents and build our global word list as well as the indexes for each of the documents.

We will now look at a Word Breaker implementation.

We could use the English_US wordbreaker implementation which is free to use in most of the cases. If not it can be verified using the GetLicenceToUse call defined within the WordBreaker interface we would create.

For more details on WordBreaker and WordSink interfaces you can refer to

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/html/ixrefint_18ac.asp

 

We begin by looking up the class ids which we would use to define the COM interfaces for the word breaker interface we would develop.

Look for the key WBreakerClass in the path HKey Local Machine\SYSTEM\CurrentControlSet\Control\ContentIndex\Language in the registry. This class id will be used to mark our WordBreaker COM interface.

 

Also look up the file indexsrv.h in the location “../Program Files\Microsoft Visual Studio .NET 2003\Vc7\PlatformSDK\Include”, search for the class id for the IWordBreaker interface. You may use this to mark the interface IWordBreaker.

 

[ComImport]

            [Guid("D53552C8-77E3-101A-B552-08002B33B0E6")]

            [InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]

            public interface IWordBreaker

            {

                        ....

            }

 

There is an excellent article

http://sqljunkies.com/How%20To/B28D8759-C093-4A09-AF79-257B46C41D49.scuk which can guide you how to implement a WordBreaker for the search engine.

 

In the present case, we will not use the IWordBreaker and IWordSink interfaces, rather we would use regular expression achieve somewhat similar results.

 

Here is a snap shot of the BreakWords method in the WordCollector class:

 

                        public void BreakWords(string sString)

                        {

                                    Regex regEx=new Regex("([ \\t{}():;., \n\r\\s*])");                    

                                    string []  strArray = regEx.Split(sString.ToLower());

                                    foreach(string str in strArray)

                                    {

                                                if(str == string.Empty) continue;

                                                if(str.Length == 1)

                                                {

                                                            bool found = false;

                                                            foreach(string c in regexpatterns)

                                                            {

                                                                        if (str == c)

                                                                        {

                                                                                    found = true;

                                                                                    break;

                                                                        }

                                                            }

                                                            if (found) continue;

                                                }

                                                if (fetchunique)

                                                {

                                                            if (!arrCombined.Contains(str) && !StopWords.ContainsKey(str))

                                                                        arrCombined.Add(str);

                                                }

                                                else

                                                            if (!StopWords.ContainsKey(str))

                                                            arrCombined.Add(str);

                                    }

                        }

You can improve the above method further by filtering the special characters from the list of words/characters retrieved this way.

 

Using stopwords, a collection of words we would like to eliminate from indexing as well as search we will instantiate word collector class like

public WordCollector(string [] stopwords)

                        {

                                    if (StopWords == null)

                                    {

                                                StopWords = new Hashtable();

                                                double dummy = 0;

                                                foreach (string word in stopwords)

                                                {

                                                            AddWords(StopWords, word, dummy);

                                                }

                                    }

                        }

 

 

We would now write a method called LoadDocument which will call the WordBreaker to break the words from a text and return the result in an ArrayList.

 

private ArrayList LoadDocument(string sString, ref ArrayList ar, bool uniquewords)

                        {

WordProcessor.WordCollector wc = new IndexManager.WordProcessor.WordCollector(stopwordslist);

                                    wc.fetchunique = uniquewords;

                                    wc.BreakWords(sString);

                                    ar = Combine(ar, wc.arrCombined);

                                    return wc.arrCombined;

}

 

The above implementation of the word splitter is slow when the documents being parsed are huge, in which case the IWordSink can help.

 

Finally, we are ready to populate the data structures with the indexing information for the documents.

We will maintain a global collection of words in a hashtable which will be used to generate the hashes for each of the documents we process.  The hash for each of these documents will be called the “Index”.

 

For any input document, we parse the document to collect the set of unique words and populate a global list of words. After we have processed all the documents, we have a global list of unique words which will be used to calculate the hash for each of the documents.

We go back to the each of the documents and compare the word list from each of them to the global list of words to populate a hash table associated to each document.

 

You will have to be familiar with the concepts of term weighting to completely understand the implementation of the algorithm.

You may refer to the article

http://www.perl.com/pub/a/2003/02/19/engine.html for the details of implementing a “Building a Vector Space Search Engine”.

 

What you would find in my sample project is an implementation of the “Vector Space Search Engine” in C#.

 

Once we have indexed the documents, we serialize the documentscollection object and persist it in the local file system. When you issue a search, this persisted object is de-serialized and the Vector operations are performed on the indexes (which are necessarily hash tables).

 

 

You would also find the usage of IFilter within this project which is used to filter and parse the text from documents.

You may refer to the article http://www.codeproject.com/csharp/IFilter.asp for a detailed understanding and usage of the IFilter.

 

I have also included a simple implementation of the directory selector. To enhance the performance, it loads only up to depth 1 initially and loads the rest as required. This will be used to select the directory which you want to index and later search in.

 

The current implementation does not support recursive searching within directories although I have implemented an asynchronous indexing which would prevent the program from choking if case the user inadvertently selects a folder with too many files.

 

Steps to use:

From file menu, choose "File->Search In" to specify the folder containing the files to Index and to search. (Sub folders ill not be included in indexing)

 

Choose : Tools -> Index, this will index all the files in the folder specified above.

 

Finally: we are ready to Search

 

Author

Mayukh Dutta

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

getmayukh



Occupation: Web Developer
Location: United States United States

Other popular C# articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 10 of 10 (Total in Forum: 10) (Refresh)FirstPrevNext
Subject  Author Date 
GeneralFolderBrowserDialogmemberRubenve12:46 29 Oct '07  
GeneralSome issuesmember88Keys18:02 15 Nov '06  
GeneralRe: Some issuesmembergetmayukh10:32 6 Dec '06  
Generalrevise article layoutmemberwurakeem9:56 19 Sep '06  
AnswerRe: revise article layoutmembergetmayukh10:18 19 Sep '06  
GeneralRe: revise article layoutmemberwurakeem10:49 19 Sep '06  
GeneralYou have a couple bad image linksmemberEric Engler2:46 19 Sep '06  
GeneralRe: You have a couple bad image linksmembergetmayukh5:49 19 Sep '06  
GeneralRe: You have a couple bad image linksmemberEric Engler6:09 19 Sep '06  
GeneralRe: You have a couple bad image linksmembergetmayukh6:26 19 Sep '06