|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
Introduction
Text Indexer Using SQLite
SQLite is a powerful, free, open source database. It has an excellent implementation of the B-Tree, which is fast and small. SQLite is not copyrighted and therefore you can is it freely in any commercial projects. This project uses the SQLite’s B-Tree extensively to index text files. Each text document is scanned and the content is tokenized to build a dictionary in the SQLite database. Later searches are conducted against the dictionary and related map tables to retrieve the file names containing the search string.
This article assumes the reader is well versed in C++ and data structures. Also, an in-depth understanding of Windows APIs is also necessary.
Why use B-Tree?
A well balanced B-Tree has several advantages over a Hashtable
- Faster to insert a new item
- It is fast and easy to search for ‘Nearest Hit’ items. This is very important in building a search dictionary. Searches can be done with partial string, for example to search for the word ‘where’, we can start with the string ‘whe’.
Once the first “Nearest Hit” node is obtained, it is easy to walk through the B-Tree list to obtain the matching nodes.
Design
The indexer consists of a set of B-Trees (Word Dictionary and MAP tables) and a collection of ‘observers’ looking for ‘user input’, ‘file changes’ and ‘system idle time’ to update or query the B-Trees. Each observer runs in separate threads.
- ‘User Input’ observer reads user’s input and queries the indexer for matches
- ‘File Change’ observer waits for file change notifications from the observed directories and queue the change information in a list.
- ‘System Idle Time’ observer wakes up when the system is idling and process the queued file change information. This observer is responsible for indexing modified files.
File Change Observer
This observer opens the directory names specified in the ‘settings.ini’ file (located in the same folder as ‘INDEXER.EXE’). Later it creates an IoCompletionPort and waits for file changes. ‘IoCompletionPort’ allows a single thread to observe several directories at the same time. It uses ‘ReadDirectoryChangesW’ API to retrieve the file changes from the system whenever a IoCompletionPort reports a completion status. All the new, modified or deleted files names are put in a file name list.
Class Name: CFileObserver (filechg.h and filechg.cpp)
System Idle Observer
This observer sleeps most of the time and whenever it wakes up it uses the ‘NtQuerySystemInformation’ API exposed by NTDLL.LL to calculate the CPU usage. If the CPU usage is less than 5% and there are file names in the file name list (queued up by the File Change Observer), it will de-queue the file names will parse and index the content of the files.
Class Name: CIdleObserver (observer.h and observer.cpp)
User Input Observer
This observer waits for user inputs on ‘STDIN’ and queries the indexer to retrieve file names containing the input string. It runs in the main application thread and will continue to run until user presses CTRL+C.
Class Name: CInputObserver (observer.h and observer.cpp)
The Indexer
This class indexes the file content. It uses the dictionary lookup algorithm to index and search words. There is only ONE dictionary in the sample code provided. If you need high performance lookup, you should create an array of dictionaries based on the hash value of a portion of the word (say, the first three characters of the word). In such a case the dictionary is a Hashtable of B-Trees and should provide fast lookups. See below for the schema design of the indexer.
- Dictionary (m_dict of CIndexer class)
This table maps words to its address. An address is 64bit integer. Every query will lookup the address of the word from this table and later uses the address for all other table lookups.
|
Word1 |
Address1 |
|
Word2 |
Address2 |
|
Word3 |
Address3 |
- File Map Table (m_file of CIndexer class)
This table maps file names to FILE IDs. Each FILE ID is a unique 64bit integer.
|
File name1 |
FILE ID1 |
|
File nam2 |
FILE ID2 |
|
File name3 |
FILE ID3 |
- File ID to File Name map table (m_fileId of CIndexer class)
This table is a reverse lookup table for looking up a file name given the FILE ID.
|
FILE ID1 |
File name1 |
|
FILE ID2 |
File name2 |
|
FILE ID3 |
File name3 |
- WORD address to FILE ID map table (MAP table) (m_map of CIndexer class)
This table maps a word address to a set of FILE IDs. Each word address entry is a primary key to a vector record containing the FILE IDs of all the files containing at least one occurrence of the word.
|
Adress1 |
FILE ID1, FILE ID2 |
|
Address2 |
FILE ID1 |
|
Address3 |
FILE ID3 |
Indexing Algorithm:
- Create an entry in the FILE ID table if the file name is new. Also, create an entry in the reverse lookup table.
- Parse the content of the file by reading one line at a time and tokenizing the string. Each token is assumed to be a WORD and index the word.
- Lookup the Dictionary to get the word’s address. If the word is not present in the dictionary, create an entry.
- Lookup up the MAP table using the word’s address as the primary key to retrieve the vector of FILE IDs. Insert the new FILE ID into the vector and save it in the database.
- Create a line number and column number entry in the Line number map table.
Searches Algorithm:
· Get the nearest hit word addresses from Dictionary (For example, a search of ‘w’ may result in ‘where’, ‘while’, ‘which’ etc).
- For every word address, lookup the MAP table to retrieve the FILE IDs
- For every FILE IDs, lookup the FILE ID to File Name reverse lookup table to retrieve the file name.
- Lookup the line number map table to obtain the line number and column number of the word in the file.
Source Code
The source code is arranged as follows.
Directory ‘./sqlite’ contains all the SQLite source code. I have picked only the relevant ‘C’ and ‘H’ files from the SQLite source tree and have modified some of the files to remove the extra dependencies. If you want to use the latest code from the SQLite site (http://www.sqlite.org), you will have to modify the code to make it work.
- Directory ‘./indexer’ contains all the indexer code
- All the project files are based on Visual Studio 2005
- Binary file is created in ./bin/win32 directory
- Settings file (settins.ini) is located in ./bin/win32 directory
- The index file is created in %TEMP%/index directory
| You must Sign In to use this message board. |
|
| | Msgs 1 to 9 of 9 (Total in Forum: 9) (Refresh) | FirstPrevNext |
|
 |
|
|
If I have one application that use SQL and full text retrieve, Can I use only your modified versione of SQLite?
Anderson Luís Oliveira e Silva nosredna@terra.com.br
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
You can use the FULL SQLite library if you want to. Keep in mind that if you compile SQLite into a DLL, the B-Tree APIs are not exported by default and that will result in linker errors. Therefore, you need to compile the full SQLite into a .LIB (or UNIX archive).
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I just spent the last 2 days implementing an image cache utilizing a hash table lookup. Its blazing fast, however it has its limitations. When the cache file grows large in size the speed degrades. Ive looked around the net for a Btree implementation that I could use to make the cache lookups faster.
Looking through your code, I can see that SQLite is exactly what I needed.
I gave you a 4/5 because while your source code is good, I would like to see more information about your implementation.
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
 |
|
|
Your file indexer is very good. Thank for your sharing.
However, I don't know why you chose modifying SQLite B-tree instead of using QDBM since QDBM provides all what you need. Moreover, because your library uses a modified version of another library - SQLite in this case, it might need more effort to update your library whenever the SQLite B-tree is updated for bug fixing. Please tell me if I'm missing something here. Thank you.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
First of all QDBM is GNU copyrighted, that means you cannot use it in commercial products any way you want. Where as SQLite does not have any copyright! I modified only TWO files to simplify the compilation process. Here is list of changes made to SQLite
1. Removed 'sqlite3_step' function from vdbeapi.c 2. Removed 'sqlite3ValueFromExpr' function from vdbemem.c
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
QDBM is LGPL, not GPL which means you can use it in commercial applications.
SQLite stores its data in a Btree using overflow pages, which I've always thought a little strange.
If you don't need SQL and do need BTree's or Hashed lookup, QDBM is in IMO a better library to use and has a much smaller footprint than SQLite.
Neville Franks, Author of Surfulater www.surfulater.com "Save what you Surf" and ED for Windows www.getsoft.com
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi Neville,
I didn't realize the conversation developed past your original reply or I would have replied sooner. Thanks for the heads up about QDBM, I was not aware of that project when you made your post. I was using GDBM back in the early 90's on the Unix platform. Its nice to see another implementation.
The day you posted your reply, I immediately researched QDBM. I came to the conclusion that QDBM is not suitable for our product due to its restrictive LGPL licence. I am simply not willing to release product source code in return for static linking. Nor am I willing to distribute a DLL, and LGPL licence in text form.
This morning, after reading the replies I decided to take a peek at your Surfulater product. After installing, I immediately noticed the SAIGDBM.dll library. Being curious, I dissassembled it and observed that it appeared to be a derivative work based on QDBM. I am not a lawyer, but it appears your product is in violation of the LGPL in many ways.
Best Regards, -Dave (Randor)
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi David, I'm using an old version of QDBM that has one line changed to make a variable external for VC6. I notified the QDBM developer of this and assume that later versions of the library resolve this issue. Because of this I'm using a different DLL name to avoid any confusuion. You will see Surfulater listed on the QDBM Web site.
You will see lots of products with LGPL.
eg. http://quesa.designcommunity.com/reference/lgpl.html[^]
"The LGPL license allows you to link Quesa into your software, with no real obligation other than that any modifications you make to Quesa itself are contributed back."
And: http://www.jboss.com/company/licensing[^]
"The LGPL is the most business-friendly license for enterprise IT organizations, independent software vendors, and the open source community itself. It promotes software freedom without affecting the proprietary software that sits alongside or on top of it. The LGPL also ensures stability and consistency by disallowing proprietary forks to the core software."
|
| Sign In·View Thread·PermaLink | 2.60/5 (2 votes) |
|
|
|
 |
|
|
General News Question Answer Joke Rant Admin
|