5,557,174 members and growing! (16,870 online)
Email Password   helpLost your password?
General Programming » Algorithms & Recipes » Algorithms     Intermediate

Advanced text searching

By Cd-MaN

Building a simple query language with the OR and AND boolean operators
VB, Windows, .NET 1.0, .NET 1.1, .NETVisual Studio, VS.NET2002, VS.NET2003, Dev

Posted: 4 Mar 2005
Updated: 4 Mar 2005
Views: 117,942
Bookmarked: 47 times
Announcements
Want a new Job?



Search    
Advanced Search
Sitemap
Prize winner in Competition "VB.NET Feb 2005"
27 votes for this Article.
Popularity: 5.32 Rating: 3.72 out of 5
3 votes, 11.1%
1
2 votes, 7.4%
2
2 votes, 7.4%
3
10 votes, 37.0%
4
10 votes, 37.0%
5

Introduction

When dealing with large amounts of data (either a website or in other specialized programs) it is very good to offer the user the possibility to narrow down the subset of data he needs to look at. The only way to avoid this and still not confuse the user is to partition the data in small chunks (such as presenting an address book by the first letter of each name). This isn't always possible so a search system can be helpful.

Description of the solution

I've developed a class which implements a simple query language. The syntax is presented below by some elementary examples and their interpretation:

abc cde fgh The character sequences "abc", "cde" and "fgh" must be all present in the string, however their order doesn't matter (so cdeabcfgh is a valid sequence for this rule). Every blank character is considered a separator (actually I use the Char.IsSeparator method to determine if a character is separator)
"abc cde""" The sequence abc cde" must be present in the string. Observe that in order to include the quote character ("), you must double it. Besides of grouping words the quote character is also useful in inserting special characters ( (, ), [, ]).
[a b c] The string must contain at least one of the elements (OR).
(a b c) The string must contain all the elements (the order is not important!)
[(a b) (c "[")] The parenthesizes can be nested. The expression means:
the string either contains a AND b (the order is not important)
OR
it contains c and [ (observe how the special character [ needs to be between quotes)

A more formal description of the syntax would be:

queryString ::= <queryPart>[<empty characters><queryString>]
queryPart ::= <keyword>|<orComposition>|<andComposition>
orComposition ::= [<queryPart>{<empty characters><queryPart>}]
andComposition ::= (<queryPart>{<empty characters><queryPart>})
emptyCharacters ::= as defined by the Char.IsSeparator method
keyword ::= "<characters>"|<non empty, non special characters>

This syntax may appear a little strange and has the drawback that it's non-standard. Because of this it might require a little getting used to from the users perspective. I don't recommend using it as the default search language on a website, however it can be presented as an alternative (for advanced searching for example). Also, I found it to be a little more obvious than the normal approach (if you write "Anna and Betty", what do you want to search for: for the exact string or for a string containing both words?) and it's simpler to implement.

To search using this code, you must do the following simple steps:

  1. Create a search tree and remember its root
    Dim x As SearchTreeNode = SearchTreeNode.ContructSearchTree("[a b] [c d]")
  2. Now call the .MatchesString(string to match, root) function for each element you want to verify. It returns true or false depending whether the supplied string matches the rule or not.

Implementation details

The SearchTreeNode is a node of a multi-way tree which remembers it's first child and next sibling as shown below:

The nodes of the tree can be of three types: AND nodes, OR nodes and KEYWORD nodes (or leafs). An AND node matches a string if all of its children match the string. An OR node matches a string if at least one of its children match the string. A keyword node matches a string if it is contained within the string. For example the following rule is translated to the tree seen below: "[(a b) (c d)] e"

You probably observed the extra level with an AND node. This is done to avoid treating a particular case in the routine and has no impact on the run. The above tree is constructed with the ConstructSearchTree static (shared) method. This method throws an exception if it's unable to construct the tree. The message of the exception tries to hint to a possible cause of the syntax error.

The MatchesString static method needs a string which it should compare to the rule and the root of the tree. It then performs a recursive walk of the tree using the MatchString private method and returns true or false.

Some notes on the performance

I'm actually quite satisfied with the performance of the code. It can search 10000 strings containing 100 "x" characters with the rule "[a b] [c d]" in 16 milliseconds. This represents the worst case scenario for the leafs (where the actual searching is done using the indexOf method), because no match is found. Also in this case the full tree must be walked, so with a string that actually matches the rule the search could be aborted much sooner. The tests were done on an AMD Sempron 2200+ machine with 256 MB RAM running WinXP SP2.

One possibility to speed up the matching would be to implement a faster string matching algorithm like Boyer-Moore or TurboSearch. The helper tables for these could be generated during the creation of the tree. The problem is that the strings are Unicode encoded and thus a helper table (for the TurboSearch algorithm for example) could reach very big sizes (64K elements). Also there is the fact that the keywords will probably be rather small (5-10 charactes) and the indexOf method performs quite well.

Additional features

There are two methods: one for building the filter part of an SQL query from the tree and one to build a regular expression from it. They are both static (shared) methods and their syntax is as follows: BuildSQLQuery(field name, root element) returns a string that looks like: (<field name> LIKE "%keyword%" AND ... In the keyword all occurrences of the quote character (") are replaced with backslash quote (\").

The second method builds a regular expression from the search tree: BuildRegEx(root element) and returns the regular expression as string. One limitation of this method is that it doesn't generate a regular expression that handles all possible orders or appearances of the supplied argument, just the one defined in the tree. I used this approach because given the limitations of regular expressions a full implementation would generate very long and complicated expressions (one would have to explicitly specify all possible permutations in the expression).

Final notes

I hope that you found this article useful. If you found this article stupid, annoying, incorrect, etc. express this fact by rating the article as you see fit. However I would appreciate if you would leave a comment explaining your reasons so that I can (hopefully) learn from my mistakes.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Cd-MaN



Occupation: Web Developer
Location: Romania Romania

Other popular Algorithms & Recipes articles:

Article Top
Sign Up to vote for this article
You must Sign In to use this message board.
FAQ FAQ Noise ToleranceSearch Search Messages 
 Layout  Per page   
 Msgs 1 to 17 of 17 (Total in Forum: 17) (Refresh)FirstPrevNext
Subject  Author Date 
Generaladvanced searchmemberaisyahhumairah18:10 8 Oct '07  
QuestionSeacrching document or speific word in windows Drivesmemberkk_upadhyay22:08 7 Oct '07  
QuestionBinding Data to DataGrid from database?memberkhmays20124:10 28 Mar '07  
QuestionBinding Data to DataGrid from database?memberkhmays20124:09 28 Mar '07  
Generali am a chinesemembermqcan01161:09 23 Jan '07  
Question.net frame lockmemberHiran Rajapaksha10:19 19 Dec '06  
QuestionCan the approach be used for normal (OR/AND) syntax?memberMichael Freidgeim19:43 20 Jun '06  
AnswerRe: Can the approach be used for normal (OR/AND) syntax?memberCd-MaN21:32 20 Jun '06  
GeneralNicely Donemembercomputerguru9238217:36 1 Mar '06  
NewsC# VersionmemberNeo Fount0:50 17 Jan '06  
GeneralC++ re-writing + NOT operator + optimisationmemberchristophe.carpentier@alogie.fr2:29 17 May '05  
GeneralUsing this to search NTEXT column in SQL Server 2000memberRavidhu12:10 3 Apr '05  
GeneralWhy not just use Regular Expressions?memberTLWallace.NET6:22 12 Mar '05  
GeneralRe: Why not just use Regular Expressions?memberCd-MaN9:58 12 Mar '05  
GeneralRe: Why not just use Regular Expressions?memberdanielcomer7:43 19 Jan '06  
GeneralRe: Why not just use Regular Expressions?memberPolymorpher4:52 3 Feb '06  
GeneralRe: Why not just use Regular Expressions?memberShane Story9:44 27 Dec '06  

General General    News News    Question Question    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

PermaLink | Privacy | Terms of Use
Last Updated: 4 Mar 2005
Editor: Paul Watson
Copyright 2005 by Cd-MaN
Everything else Copyright © CodeProject, 1999-2008
Web20 | Advertise on the Code Project