|
|||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionWhen dealing with large amounts of data (either a website or in other specialized programs) it is very good to offer the user the possibility to narrow down the subset of data he needs to look at. The only way to avoid this and still not confuse the user is to partition the data in small chunks (such as presenting an address book by the first letter of each name). This isn't always possible so a search system can be helpful. Description of the solutionI've developed a class which implements a simple query language. The syntax is presented below by some elementary examples and their interpretation:
A more formal description of the syntax would be: queryString ::= <queryPart>[<empty characters><queryString>]queryPart ::= <keyword>|<orComposition>|<andComposition> orComposition ::= [<queryPart>{<empty characters><queryPart>}] andComposition ::= (<queryPart>{<empty characters><queryPart>}) emptyCharacters ::= as defined by the Char.IsSeparator methodkeyword ::= "<characters>"|<non empty, non special characters> This syntax may appear a little strange and has the drawback that it's non-standard. Because of this it might require a little getting used to from the users perspective. I don't recommend using it as the default search language on a website, however it can be presented as an alternative (for advanced searching for example). Also, I found it to be a little more obvious than the normal approach (if you write "Anna and Betty", what do you want to search for: for the exact string or for a string containing both words?) and it's simpler to implement. To search using this code, you must do the following simple steps:
Implementation detailsThe SearchTreeNode is a node of a multi-way tree which remembers it's first child and next sibling as shown below:
The nodes of the tree can be of three types: AND nodes, OR nodes and KEYWORD nodes (or leafs). An AND node matches a string if all of its children match the string. An OR node matches a string if at least one of its children match the string. A keyword node matches a string if it is contained within the string. For example the following rule is translated to the tree seen below: "[(a b) (c d)] e"
You probably observed the extra level with an AND node. This is done to avoid
treating a particular case in the routine and has no impact on the run. The above
tree is constructed with the The Some notes on the performanceI'm actually quite satisfied with the performance of the code. It can search 10000 strings containing 100 "x" characters with the rule "[a b] [c d]" in 16 milliseconds. This represents the worst case scenario for the leafs (where the actual searching is done using the indexOf method), because no match is found. Also in this case the full tree must be walked, so with a string that actually matches the rule the search could be aborted much sooner. The tests were done on an AMD Sempron 2200+ machine with 256 MB RAM running WinXP SP2. One possibility to speed up the matching would be to implement a faster string
matching algorithm like Boyer-Moore or TurboSearch. The helper tables for these
could be generated during the creation of the tree. The problem is that the
strings are Unicode encoded and thus a helper table (for the TurboSearch
algorithm for example) could reach very big sizes (64K elements). Also there is
the fact that the keywords will probably be rather small (5-10 charactes) and
the Additional featuresThere are two methods: one for building the filter part of an SQL query from the
tree and one to build a regular expression from it. They are both static
(shared) methods and their syntax is as follows: The second method builds a regular expression from the search tree: Final notesI hope that you found this article useful. If you found this article stupid, annoying, incorrect, etc. express this fact by rating the article as you see fit. However I would appreciate if you would leave a comment explaining your reasons so that I can (hopefully) learn from my mistakes.
|
||||||||||||||||||||||||||||||||||