SQL Server Style Wildcard Matching

TFB_PLC

Rate me:

4.20/5 (4 votes)

17 Jun 2010CPOL4 min read

30.2K

123

Match SQL Server style wildcards in client code

Download source code - 19.19 KB

Introduction

I was asked to highlight fields in a CListCtrl that matched part of the SQL filter used to retrieve the records. Rather than hit the database multiple times for each potentially matching filter criteria, it seemed like the neatest solution would be to record the LIKE patterns and match them on the client for each row in the list. This code is the implementation of this idea.

Background

Of course, you could convert the LIKE syntax into regex syntax and run it through your favourite regex engine, but to do this well requires careful tokenisation of the pattern anyway (C style escapes are not supported by SQL Server, for example), so why not just write a class to do the whole job?

As I had written a lexer generator class already, I had a code base to work from and, in fact, just needed to vastly simplify what I had already written.

The first task is to tokenise the wildcard pattern. SQL Server supports '%' for zero or more repetitions of any character, '_' for any single character, and character ranges which can be optionally negated, e.g., [a-z], [^0-9].

The following code tokenises the wildcard:

C++

enum e_Token {eEOF, eZeroOrMore, eAny, eCharSet};

typedef basic_string_token<CharT> string_token;

static e_Token next(const CharT * &curr_, const CharT *end_,
    string_token &chars_, const bool icase_,
    const std::locale &locale_)
{
    e_Token eToken = eCharSet;

    if (curr_ >= end_) return eEOF;

    switch (*curr_)
    {
    case '%':
#if defined _MSC_VER && _MSC_VER <= 1200
        chars_._charset.erase();
#else
        chars_._charset.clear();
#endif
        chars_._negated = true;
        eToken = eZeroOrMore;
        break;
    case '_':
#if defined _MSC_VER && _MSC_VER <= 1200
        chars_._charset.erase();
#else
        chars_._charset.clear();
#endif
        chars_._negated = true;
        eToken = eAny;
        break;
    case '[':
        charset(curr_, end_, chars_, icase_, locale_);
        chars_.remove_duplicates();
        chars_.normalise();
        break;
    default:
        if (icase_ && (std::isupper(*curr_, locale_) ||
            std::islower(*curr_, locale_)))
        {
            const CharT upper_ = std::toupper(*curr_, locale_);
            const CharT lower_ = std::tolower(*curr_, locale_);

            chars_._charset = upper_;
            chars_._charset += lower_;
        }
        else
        {
            chars_._charset = *curr_;
        }

        chars_._negated = false;
        break;
    }

    ++curr_;
    return eToken;
}

As you can see, there is nothing earth shattering about the main tokenisation process, although the charset processing gets more involved. In a regex implementation, this routine is far more complex!

The character ranges are tokenised like this:

C++

static void charset(const CharT * &curr_, const CharT *end_,
       string_token &chars_, const bool icase_,
       const std::locale &locale_)
{
    CharT prev_ = 0;

    ++curr_;

    if (curr_ >= end_)
    {
        // Pointless returning index if at end of string
        throw runtime_error("Unexpected end of wildcard "
            "following '['.");
    }

    chars_._negated = *curr_ == '^';

    if (chars_._negated)
    {
        ++curr_;

        if (curr_ >= end_)
        {
            // Pointless returning index if at end of string
            throw runtime_error("Unexpected end of wildcard "
                "following '^'.");
        }
    }

    while (*curr_ != ']')
    {
        prev_ = *curr_;
        ++curr_;

        if (curr_ >= end_)
        {
            // Pointless returning index if at end of string
            throw runtime_error("Unexpected end of wildcard "
                "(missing ']').");
        }

        if (*curr_ == '-')
        {
            charset_range(prev_, curr_, end_, chars_, icase_, locale_);
        }
        else
        {
            if (icase_ && (std::isupper(prev_, locale_) ||
                std::islower(prev_, locale_)))
            {
                const CharT upper_ = std::toupper(prev_, locale_);
                const CharT lower_ = std::tolower(prev_, locale_);

                chars_._charset += upper_;
                chars_._charset += lower_;
            }
            else
            {
                chars_._charset += prev_;
            }
        }
    }

    if (!chars_._negated && chars_._charset.empty())
    {
        throw runtime_error("Empty charsets not allowed.");
    }
}

static void charset_range(const CharT prev_, const CharT * &curr_,
    const CharT *end_, string_token &chars_, const bool icase_,
    const std::locale &locale_)
{
    ++curr_;

    if (curr_ >= end_)
    {
        // Pointless returning index if at end of string
        throw runtime_error("Unexpected end of wildcard "
            "following '-'.");
    }

    std::size_t range_end_ = 
      static_cast<typename Traits::index_type>(*curr_);

    ++curr_;

    if (curr_ >= end_)
    {
        // Pointless returning index if at end of string
        throw runtime_error("Unexpected end of wildcard "
            "(missing ']').");
    }

    std::size_t range_start_ = 
      static_cast<typename Traits::index_type>(prev_);

    // Semanic check
    if (range_end_ < range_start_)
    {
        throw runtime_error("Invalid range in charset.");
    }

    chars_._charset.reserve(chars_._charset.size() +
        (range_end_ + 1 - range_start_));

    for (; range_start_ <= range_end_; ++range_start_)
    {
        const CharT ch_ = static_cast<CharT>(range_start_);

        if (icase_ && (std::isupper(ch_, locale_) ||
            std::islower(ch_, locale_)))
        {
            const CharT upper_ = std::toupper(ch_, locale_);
            const CharT lower_ = std::tolower(ch_, locale_);

            chars_._charset += upper_;
            chars_._charset += lower_;
        }
        else
        {
            chars_._charset += ch_;
        }
    }
}

Again, due to the simplicity of wildcard ranges, this is all pretty basic stuff.

Having tokenised the wildcard, we can now build a syntax tree of the tokens, which in turn is short and quite easy to understand:

C++

static node *parse(const CharT * &curr_, const CharT *end_,
       node_ptr_vector &node_ptr_vector_, const bool icase_,
       const std::locale &locale_)
{
    node *root_ = 0;
    typename tokeniser::string_token chars_;

    while (curr_ < end_)
    {
        tokeniser::e_Token eToken = tokeniser::next(curr_, end_, chars_,
            icase_, locale_);
        node *node_ = 0;

        switch (eToken)
        {
        case tokeniser::eZeroOrMore:
            node_ptr_vector_->push_back(0);
            node_ptr_vector_->back() = new leaf_node(chars_);
            node_ = node_ptr_vector_->back();
            node_ptr_vector_->push_back(0);
            node_ptr_vector_->back() = new iteration_node(node_);
            node_ = node_ptr_vector_->back();
            break;
        case tokeniser::eAny:
        case tokeniser::eCharSet:
            node_ptr_vector_->push_back(0);
            node_ptr_vector_->back() = new leaf_node(chars_);
            node_ = node_ptr_vector_->back();
            break;
        }

        if (root_)
        {
            node *lhs_ = root_;

            node_ptr_vector_->push_back(0);
            node_ptr_vector_->back() = new sequence_node(lhs_, node_);
            root_ = node_ptr_vector_->back();
        }
        else
        {
            root_ = node_;
        }
    }

    if (root_)
    {
        node *rhs_ = 0;

        node_ptr_vector_->push_back(0);
        node_ptr_vector_->back() = new end_node;
        rhs_ = node_ptr_vector_->back();
        node_ptr_vector_->push_back(0);
        node_ptr_vector_->back() = new sequence_node(root_, rhs_);
        root_ = node_ptr_vector_->back();
    }

    return root_;
}

Having got as far as generating a syntax tree, we really have to take things up a notch to generate a DFA. The method used is straight out of the famous 'Dragon Book'. As simple as the tree generation code looks, there is actually more going on behind the scenes! If you read the Dragon Book, it will explain that firstpos, lastpos, and followpos set calculations can be done by a single tree traversal after the fact. In reality, there is no need to do this as a separate step due to the nature of the tree construction. We therefore construct these sets as we build each subtree...

Here are some quick snippets of the tree nodes doing this:

C++

basic_leaf_node(const string_token &token_) :
    node(false),
    _token(token_)
{
    if (!_nullable)
    {
        _firstpos.push_back(this);
        _lastpos.push_back(this);
    }
}

basic_sequence_node(node *left_, node *right_) :
    node(left_->nullable() && right_->nullable()),
    _left(left_),
    _right(right_)
{
    _left->append_firstpos(_firstpos);

    if (_left->nullable())
    {
        _right->append_firstpos(_firstpos);
    }

    if (_right->nullable())
    {
        _left->append_lastpos(_lastpos);
    }

    _right->append_lastpos(_lastpos);

    node_vector &lastpos_ = _left->lastpos();
    const node_vector &firstpos_ = _right->firstpos();

    for (node_vector::iterator iter_ = lastpos_.begin(),
        end_ = lastpos_.end(); iter_ != end_; ++iter_)
    {
        (*iter_)->append_followpos(firstpos_);
    }
}

basic_iteration_node(node *next_) :
    node(true),
    _next(next_)
{
    node_vector::iterator iter_;
    node_vector::iterator end_;

    _next->append_firstpos(_firstpos);
    _next->append_lastpos(_lastpos);

    for(iter_ = _lastpos.begin(), end_ = _lastpos.end();
        iter_ != end_; ++iter_)
    {
        (*iter_)->append_followpos(_firstpos);
    }
}

basic_end_node() : 
    node(false)
{
    node::_firstpos.push_back(this);
    node::_lastpos.push_back(this);
}

basic_node(const bool nullable_) :
     _nullable(nullable_)
{
}

Here's how we convert the tree to a DFA:

C++

void build(const CharT *curr_, const CharT *end_)
{
    typename parser::node_ptr_vector node_ptr_vector_;
    node *root_ = parser::parse(curr_, end_, node_ptr_vector_, _locale);
    typename const node::node_vector *followpos_ = &root_->firstpos();
    node_set_vector seen_sets_;
    size_t_vector hash_vector_;

    closure(followpos_, seen_sets_, hash_vector_);

    for (std::size_t index_ = 0; index_ < seen_sets_->size(); ++index_)
    {
        equivset_list equiv_list_;

        build_equiv_list(seen_sets_[index_], equiv_list_);

        for (typename equivset_list::list::const_iterator iter_ =
            equiv_list_->begin(), end_ = equiv_list_->end();
            iter_ != end_; ++iter_)
        {
            equivset *equivset_ = *iter_;
            const std::size_t transition_ = closure
                (&equivset_->_followpos, seen_sets_,
                hash_vector_);

            if (transition_ != npos)
            {
                _dfa[index_]._transitions.push_back
                    (transition(equivset_->_token, transition_));
            }
        }
    }
}

The closure() function builds a vector of followpos sets and recognises when the same set pops up more than once, returning the index to the existing entry rather than creating a new one. build_equiv_list() intersects the character ranges at a given tree node, and basically resolves any ambiguities. As we have no 'or' or independent repeat operators to deal with, the only ambiguity possible comes with the use of the '%' wildcard. Let's say we have the sequence '%a'. We basically would intersect 'any character' with 'a', leaving us with two character ranges: '[^a]' and '[a]'. Now, during the intersection process, these transitions have their followpos sets updated to reflect the intersection. In this example, '[^a]' will loop back around on itself (state 0) and '[a]' will go to DFA state 1. When a closure occurs on state 1, we will end up with the same two character ranges, but this time, '[^a]' will go back to state 0 (this implies the first 'a' was actually part of '%') and '[a]' loops round on itself to state 1 (also implying that the first 'a' was part of '%'). State 1 of the DFA is the end state in this example.

I have left the character ranges as strings rather than converting to equivalence classes as my lexer generator does, as performance should be less of an issue for a simple wildcard class. Hopefully, this also makes the code easier to follow, as the introduction of equivalence classes makes things rather opaque when stepping through the code to see what is going on!

Finally, here are the match routines:

C++

bool match(const string &str_) const
{
    bool match_ = true;
    const CharT *curr_ = str_.c_str();
    const CharT *end_ = curr_ + str_.size();
    std::size_t state_ = 0;

    for (; curr_ < end_; ++curr_)
    {
        if (!_dfa[state_].match(*curr_, state_))
        {
            match_ = false;
            break;
        }
    }

    if (match_)
    {
        match_ = _dfa[state_]._end;
    }

    return match_;
}

The DFA states match routine:

C++

bool match(const CharT c_, std::size_t &next_) const
{
    bool match_ = false;
    typename transition_deque::const_iterator iter_ =
        _transitions.begin();
    typename transition_deque::const_iterator end_ =
        _transitions.end();

    for (; iter_ != end_; ++iter_)
    {
        if (iter_->match(c_, next_))
        {
            match_ = true;
            break;
        }
    }

    return match_;
}

The transition match routine:

C++

bool match(const CharT c_, std::size_t &next_) const
{
    bool match_ = false;

    match_ = _chars._charset.find(c_) != string_token::string::npos;
    match_ ^= _chars._negated;

    if (match_)
    {
        next_ = _next;
    }

    return match_;
}

Using the Code

The zip comes with a simple example, and here it is:

C++

#include <iostream>
#include "tfbWildcard/wildcard.hpp"

int main(int argc, char* argv[])
{
    std::string test_("%blah%");

    try
    {
        tfb::wildcard wc;

        wc.assign(test_, true);

        bool match = wc.match("blah");

        std::cout << "Match: " << match << std::endl;
    }
    catch (const std::exception &e)
    {
        std::cout << e.what();
    }

    return 0;
}

Points of Interest

Note that tfb::wildcard is a typedef. You can create a wide character class using tfb::wwildcard. You can construct the class with string instead of assign, but other than that, this is the entire API!

History

2009-07-03: First version released
2009-07-03: Fixed case insensitivity for single characters, and made case sensitivity an option
2010-04-29: Updated source code - this version is C++0x compatible
2010-06-17: This version copes with an empty string as the wildcard

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

TFB_PLC

Software Developer (Senior) TFB Plc

United Kingdom

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

This is a Organisation

1 members

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.