Click here to Skip to main content
15,880,796 members
Articles / Programming Languages / Visual Basic
Article

Advanced text searching

Rate me:
Please Sign up or sign in to vote.
4.40/5 (30 votes)
4 Mar 2005Public Domain5 min read 164.7K   575   69   17
Building a simple query language with the OR and AND boolean operators

Introduction

When dealing with large amounts of data (either a website or in other specialized programs) it is very good to offer the user the possibility to narrow down the subset of data he needs to look at. The only way to avoid this and still not confuse the user is to partition the data in small chunks (such as presenting an address book by the first letter of each name). This isn't always possible so a search system can be helpful.

Description of the solution

I've developed a class which implements a simple query language. The syntax is presented below by some elementary examples and their interpretation:

abc cde fgh The character sequences "abc", "cde" and "fgh" must be all present in the string, however their order doesn't matter (so cdeabcfgh is a valid sequence for this rule). Every blank character is considered a separator (actually I use the Char.IsSeparator method to determine if a character is separator)
"abc cde""" The sequence abc cde" must be present in the string. Observe that in order to include the quote character ("), you must double it. Besides of grouping words the quote character is also useful in inserting special characters ( (, ), [, ]).
[a b c] The string must contain at least one of the elements (OR).
(a b c) The string must contain all the elements (the order is not important!)
[(a b) (c "[")] The parenthesizes can be nested. The expression means:
the string either contains a AND b (the order is not important)
OR
it contains c and [ (observe how the special character [ needs to be between quotes)

A more formal description of the syntax would be:

queryString ::= <queryPart>[<empty characters><queryString>]
queryPart ::= <keyword>|<orComposition>|<andComposition>
orComposition ::= [<queryPart>{<empty characters><queryPart>}]
andComposition ::= (<queryPart>{<empty characters><queryPart>})
emptyCharacters ::= as defined by the Char.IsSeparator method
keyword ::= "<characters>"|<non empty, non special characters>

This syntax may appear a little strange and has the drawback that it's non-standard. Because of this it might require a little getting used to from the users perspective. I don't recommend using it as the default search language on a website, however it can be presented as an alternative (for advanced searching for example). Also, I found it to be a little more obvious than the normal approach (if you write "Anna and Betty", what do you want to search for: for the exact string or for a string containing both words?) and it's simpler to implement.

To search using this code, you must do the following simple steps:

  1. Create a search tree and remember its root
    Dim x As SearchTreeNode = SearchTreeNode.ContructSearchTree("[a b] [c d]")
  2. Now call the .MatchesString(string to match, root) function for each element you want to verify. It returns true or false depending whether the supplied string matches the rule or not.

Implementation details

The SearchTreeNode is a node of a multi-way tree which remembers it's first child and next sibling as shown below:

The nodes of the tree can be of three types: AND nodes, OR nodes and KEYWORD nodes (or leafs). An AND node matches a string if all of its children match the string. An OR node matches a string if at least one of its children match the string. A keyword node matches a string if it is contained within the string. For example the following rule is translated to the tree seen below: "[(a b) (c d)] e"

You probably observed the extra level with an AND node. This is done to avoid treating a particular case in the routine and has no impact on the run. The above tree is constructed with the ConstructSearchTree static (shared) method. This method throws an exception if it's unable to construct the tree. The message of the exception tries to hint to a possible cause of the syntax error.

The MatchesString static method needs a string which it should compare to the rule and the root of the tree. It then performs a recursive walk of the tree using the MatchString private method and returns true or false.

Some notes on the performance

I'm actually quite satisfied with the performance of the code. It can search 10000 strings containing 100 "x" characters with the rule "[a b] [c d]" in 16 milliseconds. This represents the worst case scenario for the leafs (where the actual searching is done using the indexOf method), because no match is found. Also in this case the full tree must be walked, so with a string that actually matches the rule the search could be aborted much sooner. The tests were done on an AMD Sempron 2200+ machine with 256 MB RAM running WinXP SP2.

One possibility to speed up the matching would be to implement a faster string matching algorithm like Boyer-Moore or TurboSearch. The helper tables for these could be generated during the creation of the tree. The problem is that the strings are Unicode encoded and thus a helper table (for the TurboSearch algorithm for example) could reach very big sizes (64K elements). Also there is the fact that the keywords will probably be rather small (5-10 charactes) and the indexOf method performs quite well.

Additional features

There are two methods: one for building the filter part of an SQL query from the tree and one to build a regular expression from it. They are both static (shared) methods and their syntax is as follows:

BuildSQLQuery(field name, 
                root element)
returns a string that looks like:
(<field name> 
            LIKE "%keyword%" AND
... In the keyword all occurrences of the quote character (") are replaced with backslash quote (\").

The second method builds a regular expression from the search tree:

BuildRegEx(root 
                element)
and returns the regular expression as string. One limitation of this method is that it doesn't generate a regular expression that handles all possible orders or appearances of the supplied argument, just the one defined in the tree. I used this approach because given the limitations of regular expressions a full implementation would generate very long and complicated expressions (one would have to explicitly specify all possible permutations in the expression).

Final notes

I hope that you found this article useful. If you found this article stupid, annoying, incorrect, etc. express this fact by rating the article as you see fit. However I would appreciate if you would leave a comment explaining your reasons so that I can (hopefully) learn from my mistakes.

License

This article, along with any associated source code and files, is licensed under A Public Domain dedication


Written By
Web Developer
Romania Romania
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
Generaladvanced search Pin
aisyahhumairah8-Oct-07 17:10
aisyahhumairah8-Oct-07 17:10 
QuestionSeacrching document or speific word in windows Drives Pin
kk_upadhyay7-Oct-07 21:08
kk_upadhyay7-Oct-07 21:08 
QuestionBinding Data to DataGrid from database? Pin
khmays201228-Mar-07 3:10
khmays201228-Mar-07 3:10 
QuestionBinding Data to DataGrid from database? Pin
khmays201228-Mar-07 3:09
khmays201228-Mar-07 3:09 
Generali am a chinese Pin
JohnnyQiang23-Jan-07 0:09
JohnnyQiang23-Jan-07 0:09 
there is two reasons coming here for me .
one,i want to practise my english
the other is i want to prove my program level and make some foreign friends!
thanks
Laugh | :laugh:

chinese!

Question.net frame lock Pin
Hiran Rajapaksha19-Dec-06 9:19
Hiran Rajapaksha19-Dec-06 9:19 
QuestionCan the approach be used for normal (OR/AND) syntax? Pin
Michael Freidgeim20-Jun-06 18:43
Michael Freidgeim20-Jun-06 18:43 
AnswerRe: Can the approach be used for normal (OR/AND) syntax? Pin
Cd-MaN20-Jun-06 20:32
Cd-MaN20-Jun-06 20:32 
GeneralNicely Done Pin
Paul Conrad1-Mar-06 16:36
professionalPaul Conrad1-Mar-06 16:36 
NewsC# Version Pin
RubyPdf16-Jan-06 23:50
RubyPdf16-Jan-06 23:50 
GeneralC++ re-writing + NOT operator + optimisation Pin
christophe.carpentier@alogie.fr17-May-05 1:29
christophe.carpentier@alogie.fr17-May-05 1:29 
GeneralUsing this to search NTEXT column in SQL Server 2000 Pin
Ravidhu3-Apr-05 11:10
Ravidhu3-Apr-05 11:10 
QuestionWhy not just use Regular Expressions? Pin
Terence Wallace12-Mar-05 5:22
Terence Wallace12-Mar-05 5:22 
AnswerRe: Why not just use Regular Expressions? Pin
Cd-MaN12-Mar-05 8:58
Cd-MaN12-Mar-05 8:58 
GeneralRe: Why not just use Regular Expressions? Pin
danielcomer19-Jan-06 6:43
danielcomer19-Jan-06 6:43 
GeneralRe: Why not just use Regular Expressions? Pin
Polymorpher3-Feb-06 3:52
Polymorpher3-Feb-06 3:52 
GeneralRe: Why not just use Regular Expressions? Pin
Shane Story27-Dec-06 8:44
Shane Story27-Dec-06 8:44 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.