5,699,997 members and growing! (12,246 online)
Email Password   helpLost your password?
Languages » C / C++ Language » General     Advanced

Simple Text Indexer Using SQLite Database

By Chulliyan

Simple Text Indexer Using SQLite Database
VC8.0, C++Windows, WinXP, Visual Studio, Dev

Posted: 14 Jun 2006
Updated: 14 Jun 2006
Views: 21,767
Bookmarked: 26 times
Announcements
Loading...



Search    
Advanced Search
Sitemap
9 votes for this Article.
Popularity: 3.74 Rating: 3.92 out of 5
0 votes, 0.0%
1
0 votes, 0.0%
2
3 votes, 33.3%
3
4 votes, 44.4%
4
2 votes, 22.2%
5
Note: This is an unedited contribution. If this article is inappropriate, needs attention or copies someone else's work without reference then please Report This Article

Introduction

Text Indexer Using SQLite

SQLite is a powerful, free, open source database. It has an excellent implementation of the B-Tree, which is fast and small. SQLite is not copyrighted and therefore you can is it freely in any commercial projects. This project uses the SQLite’s B-Tree extensively to index text files. Each text document is scanned and the content is tokenized to build a dictionary in the SQLite database. Later searches are conducted against the dictionary and related map tables to retrieve the file names containing the search string.

 

This article assumes the reader is well versed in C++ and data structures. Also, an in-depth understanding of Windows APIs is also necessary.

Why use B-Tree?

A well balanced B-Tree has several advantages over a Hashtable

  1. Faster to insert a new item
  2. It is fast and easy to search for ‘Nearest Hit’ items. This is very important in building a search dictionary. Searches can be done with partial string, for example to search for the word ‘where’, we can start with the string ‘whe’.

Once the first “Nearest Hit” node is obtained, it is easy to walk through the B-Tree list to obtain the matching nodes.

Design

The indexer consists of a set of B-Trees (Word Dictionary and MAP tables) and a collection of ‘observers’ looking for ‘user input’, ‘file changes’ and ‘system idle time’ to update or query the B-Trees.  Each observer runs in separate threads.

 

  1. ‘User Input’ observer reads user’s input and queries the indexer for matches
  2. ‘File Change’ observer waits for file change notifications from the observed directories and queue the change information in a list.
  3. ‘System Idle Time’ observer wakes up when the system is idling and process the queued file change information. This observer is responsible for indexing modified files.

File Change Observer

            This observer opens the directory names specified in the ‘settings.ini’ file (located in the same folder as ‘INDEXER.EXE’). Later it creates an IoCompletionPort and waits for file changes. ‘IoCompletionPort’ allows a single thread to observe several directories at the same time. It uses ‘ReadDirectoryChangesW’ API to retrieve the file changes from the system whenever a IoCompletionPort reports a completion status. All the new, modified or deleted files names are put in a file name list.

             Class Name: CFileObserver  (filechg.h and filechg.cpp)

System Idle Observer

                   This observer sleeps most of the time and whenever it wakes up it uses the ‘NtQuerySystemInformation’ API exposed by NTDLL.LL to calculate the CPU usage. If the CPU usage is less than 5% and there are file names in the file name list (queued up by the File Change Observer), it will de-queue the file names will parse and index the content of the files.

            Class Name: CIdleObserver  (observer.h and observer.cpp)

User Input Observer

            This observer waits for user inputs on ‘STDIN’ and queries the indexer to retrieve file names containing the input string. It runs in the main application thread and will continue to run until user presses CTRL+C.

             Class Name: CInputObserver  (observer.h and observer.cpp)

The Indexer

            This class indexes the file content. It uses the dictionary lookup algorithm to index and search words. There is only ONE dictionary in the sample code provided. If you need high performance lookup, you should create an array of dictionaries based on the hash value of a portion of the word (say, the first three characters of the word). In such a case the dictionary is a Hashtable of B-Trees and should provide fast lookups. See below for the schema design of the indexer.

  1. Dictionary (m_dict of CIndexer class)

            This table maps words to its address. An address is 64bit integer. Every query will lookup the address of the word from this table and later uses the address for all other table lookups.

 

Word1

Address1

Word2

Address2

Word3

Address3

 

  1. File Map Table (m_file of CIndexer class)

            This table maps file names to FILE IDs. Each FILE ID is a unique 64bit integer.

File name1

FILE ID1

File nam2

FILE ID2

File name3

FILE ID3

 

  1. File ID to File Name map table (m_fileId of CIndexer class)

            This table is a reverse lookup table for looking up a file name given the FILE ID.

 

FILE ID1

File name1

FILE ID2

File name2

FILE ID3

File name3

 

  1. WORD address to FILE ID map table (MAP table) (m_map of CIndexer class)

            This table maps a word address to a set of FILE IDs. Each word address entry is a primary key to a vector record containing the FILE IDs of all the files containing at least one occurrence of the word. 

Adress1

FILE ID1, FILE ID2

Address2

FILE ID1

Address3

FILE ID3

 

Indexing Algorithm:

  • Create an entry in the FILE ID table if the file name is new. Also, create an entry in the reverse lookup table.
  • Parse the content of the file by reading one line at a time and tokenizing the string. Each token is assumed to be a WORD and index the word.
  • Lookup the Dictionary to get the word’s address. If the word is not present in the dictionary, create an entry.
  • Lookup up the MAP table using the word’s address as the primary key to retrieve the vector of FILE IDs. Insert the new FILE ID into the vector and save it in the database.
  • Create a line number and column number entry in the Line number map table.

 

Searches Algorithm:

·       Get the nearest hit word addresses from Dictionary (For example, a search of ‘w’ may result in ‘where’, ‘while’, ‘which’ etc).

  • For every word address, lookup the MAP table to retrieve the FILE IDs
  • For every FILE IDs, lookup the FILE ID to File Name reverse lookup table to retrieve the file name.
  • Lookup the line number map table to obtain the line number and column number of the word in the file. 

Source Code

The source code is arranged as follows.

 Directory ‘./sqlite’ contains all the SQLite source code. I have picked only the relevant ‘C’ and ‘H’ files from the SQLite source tree and have modified some of the files to remove the extra dependencies. If you want to use the latest code from the SQLite site (http://www.sqlite.org), you will have to modify the code to make it work.

  • Directory ‘./indexer’ contains all the indexer code
  • All the project files are based on Visual Studio 2005
  • Binary file is created in ./bin/win32 directory
  • Settings file (settins.ini) is located in ./bin/win32 directory
  • The index file is created in %TEMP%/index directory

 

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Chulliyan


I have been writing code for a living for last three hundred years! I have written code in almost all the languages- C/C++, JAVA, C#, VB, Pascal, Delphi, JScript and so on.
Occupation: Web Developer
Location: United States United States

Other popular C / C++ Language articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 9 of 9 (Total in Forum: 9) (Refresh)FirstPrevNext
GeneralCombo usememberAnderson Luís16:38 19 Jun '06  
GeneralRe: Combo usememberdude1238:46 20 Jun '06  
GeneralHmmm ThanksmemberRandor3:20 15 Jun '06  
GeneralRe: Hmmm ThanksmemberNeville Franks10:20 15 Jun '06  
GeneralRe: Hmmm ThanksmemberQuynh Nguyen8:01 19 Jun '06  
GeneralRe: Hmmm Thanksmemberdude1239:32 19 Jun '06  
GeneralRe: Hmmm ThanksmemberNeville Franks10:45 19 Jun '06  
GeneralRe: Hmmm ThanksmemberRandor3:59 22 Jul '06  
GeneralRe: Hmmm ThanksmemberNeville Franks15:57 22 Jul '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 14 Jun 2006
Editor:
Copyright 2006 by Chulliyan
Everything else Copyright © CodeProject, 1999-2008
Web15 | Advertise on the Code Project