Click here to Skip to main content
14,385,667 members

Fast String Matching with Wildcards, Globs, and Gitignore-Style Globs - How Not to Blow it Up

Rate this:
4.98 (25 votes)
Please Sign up or sign in to vote.
4.98 (25 votes)
19 Sep 2019CPOL
Classic globbing and modern gitignore-style globbing algorithms can be fast, whereas recursive implementations are known to blow up exponentially; why some freely available source code should not be used.

Introduction

Jack Handy’s wildcard string compare efficiently compares a string to a pattern containing * and ? wildcards. The algorithm is fast and safe, but does not support gitignore-style globbing rules. In this article, I will illustrate that classic globbing and modern gitignore-style globbing algorithms can be fast too. I will also explain what’s wrong with recursive implementations that blow up exponentially and why some freely available source code should not be used.

Background

Wildcard string matching and globbing isn’t as trivial as it may seem at first. In fact, mistakes in the past resulted in serious vulnerabilities such as denial of service. Simple patches, such as limiting the number of matches to limit the CPU time, have been applied to fix implementations that suffer exponential blow-up in execution time. More disconcerting is that buggy globbing implementations can be easily located on the Web: just search for wildmat.c to find copies of implementations that may crash when a character class range is incomplete, such as [a-.

The Trouble With Recursion

To understand how globbing implementations may cause denial of service, we will take a quick look at some examples. Recursion is often used in simple implementations of wildcard matching with * and ?. The idea is to scan the pattern and the string from left to right while pairwise matching the characters. When a star is encountered in the pattern, the do-match function is recursively called to compare the rest of the pattern to the rest of the string. If the recursive do-match call succeeds, then we’re done. If the recursive call fails however, we go back where we left off before the recursive call, advance one character further in the string, and repeat the recursive call. This loop is the “star-loop” that consumes none, one, or more characters in place of a star in the pattern as shown in the example below:

// returns TRUE if text string matches wild pattern with * and ?
int naive_recursive_match(const char *text, const char *wild)
{
  while (*text != '\0')
  {
    if (*wild == '*')
    {
      // any number of stars act as one star
      while (*++wild == '*')
        continue;
      // a star at the end of the pattern matches any text
      if (*wild == '\0')
        return TRUE;
      // star-loop: match the rest of the pattern and text
      while (naive_recursive_match(text, wild) == FALSE && *text != '\0')
        text++;
      return *text != '\0' ? TRUE : FALSE;
    }
    // ? matches any character or we match the current non-NUL character
    if (*wild != '?' && *wild != *text)
      return FALSE;
    text++;
    wild++;
  }
  // ignore trailing stars
  while (*wild == '*')
    wild++;
  // at end of text means success if nothing else is left to match
  return *wild == '\0' ? TRUE : FALSE;
}

This algorithm is fairly simple and needs little explanation: recursive calls are made for stars until matching or the end of the string of text is reached. A ? matches any character. Otherwise, the current pattern character is matched with the current character in the text. We move up in the text and in the pattern by one character, and repeat until the end of the text is reached. Note that any trailing stars can be ignored in the pattern when we get to the end of the text.

This approach works well for short strings to match and patterns with few stars. However, it is easy to find examples that result in an explosive execution time blow-up due to excessive backtracking on stars. Cases such as a*a*a*a*a*a*a*a*b and a string aaa…aaa of 100 a’s take minutes to terminate with a failure to match. Just try the following two shell commands:

$ touch `python3 -c 'print(100*"a")'`
$ ls a*a*a*a*a*a*a*a*b

It takes bash-3.2 about 8 minutes to terminate on my 2.9GHz i7 machine with decent performance. It takes tcsh-6.18 over an hour. The explosive 2n states for n stars is to blame for this behavior.

Using ABORT

One of the first globbing algorithms in history written by Rich Salz wildmat.c (original) had this undesirable behavior, but was updated by Lars Mathiesen in the early 2000s with an ABORT state. Their comments in the updated wildmat.c source code still remain there to this day and are quite insightful:

Quote:

Once the control of one instance of DoMatch enters the star-loop, that instance will return either TRUE or ABORT, and any calling instance will therefore return immediately after (without calling recursively again). In effect, only one star-loop is ever active. It would be possible to modify the code to maintain this context explicitly, eliminating all recursive calls at the cost of some complication and loss of clarity (and the ABORT stuff seems to be unclear enough by itself).

Apparently, the exponential blow-up problem was already well known at that time, as we find the following comment in wildmat.c demonstrating the issue:

pattern: -*-*-*-*-*-*-12-*-*-*-m-*-*-*
text 1:  -adobe-courier-bold-o-normal--12-120-75-75-m-70-iso8859-1
text 2:  -adobe-courier-bold-o-normal--12-120-75-75-X-70-iso8859-1
Text 1 matches with 51 calls, while text 2 fails with 54 calls. 
Without the ABORT, then it takes 22310 calls to fail.  Ugh.

The powerful insight here is that only the last star-loop should be active, because advancing any of the previous star-loops is not productive when the last star-loop fails. This clever change made globbing execute in linear time for the typical case. Only in the worst case, the algorithm takes quadratic time in the length of the pattern and string.

Let’s apply this improvement to our example recursive matcher:

enum { FALSE = 0, TRUE = 1, ABORT = 2 };
// returns TRUE if text string matches wild pattern with * and ?
int recursive_match(const char *text, const char *wild)
{
  while (*text != '\0')
  {
    if (*wild == '*')
    {
      int ret;
      // any number of stars act as one star
      while (*++wild == '*')
        continue;
      // a star at the end of the pattern matches any text
      if (*wild == '\0')
        return TRUE;
      // star-loop: match the rest of the pattern and text
      while ((ret = recursive_match(text, wild)) == FALSE && *text != '\0')
        text++;
      // ABORT recursion when failing the last star-loop
      if (ret != TRUE)
        return ABORT;
      return *text != '\0' ? TRUE : FALSE;
    }
    // ? matches any character or we match the current non-NUL character
    if (*wild != '?' && *wild != *text)
      return FALSE;
    text++;
    wild++;
  }
  // ignore trailing stars
  while (*wild == '*')
    wild++;
  // at end of text means success if nothing else is left to match
  return *wild == '\0' ? TRUE : FALSE;
}

This improvement avoids exponential blow-up, but still makes a recursive call for each star in the pattern. This is not necessary. We can save that state of the matcher so that we can restore it from our backup to execute another iteration of the last star-loop until we are done.

Non-Recursive Wildcard String Matching

To replace recursion with iteration, we need two backup positions: the current position in the pattern right after the star and the current position in the text string. To iterate the star-loop, we restore the backed-up positions, advance one character in the text string, and try again. We keep iterating until the match succeeds or we run out of text. If a star occurs after our last backup position, we effectively enter a new star-loop and remove the one we were iterating on previously:

// returns TRUE if text string matches wild pattern with * and ?
int match(const char *text, const char *wild)
{
  const char *text_backup = NULL;
  const char *wild_backup = NULL;
  while (*text != '\0')
  {
    if (*wild == '*')
    {
      // new star-loop: backup positions in pattern and text
      text_backup = text;
      wild_backup = ++wild;
    }
    else if (*wild == '?' || *wild == *text)
    {
      // ? matched any character or we matched the current non-NUL character
      text++;
      wild++;
    }
    else
    {
      // if no stars we fail to match
      if (wild_backup == NULL)
        return FALSE;
      // star-loop: backtrack to the last * by restoring the backup positions 
      // in the pattern and text
      text = ++text_backup;
      wild = wild_backup;
    }
  }
  // ignore trailing stars
  while (*wild == '*')
    wild++;
  // at end of text means success if nothing else is left to match
  return *wild == '\0' ? TRUE : FALSE;
}

The execution time of this algorithm is linear in the length of the pattern and string for typical cases. In the worst case, this algorithm takes quadratic time (to see why, consider pattern *aaa…aaab with ½n a’s and string aaa…aaa with n a’s, which requires ¼n2 comparisons before failing.)

In case you're asking if there is a linear worst-case algorithm, the answer is yes: by constructing a deterministic finite automaton (DFA) from the pattern for matching. This is beyond the scope of this article and requires taking into account the (high) cost of DFA construction.

Non-Recursive Glob Matching

Now that we reviewed the background and issues with string matching and globbing, let's move on to the main part of this article. Globbing differs from string matching with respect to a subtle meaning of the * and ? wildcards in globs: these wildcards do not match pathname separators, i.e., do not match / (and do not match \ in Windows). For example, foo*.h does not match foo/bar.h.

To adjust our previous match algorithm to perform glob matching, we add some checks for the / pathname separator such that * and ? cannot match it:

// returns TRUE if text string matches glob-like pattern with * and ?
int globly_match(const char *text, const char *glob)
{
  const char *text_backup = NULL;
  const char *glob_backup = NULL;
  while (*text != '\0')
  {
    if (*glob == '*')
    {
      // new star-loop: backup positions in pattern and text
      text_backup = text;
      glob_backup = ++glob;
    }
    else if ((*glob == '?' && *text != '/') || *glob == *text)
    {
      // ? matched any character except /, or we matched the current non-NUL character
      text++;
      glob++;
    }
    else
    {
      if (glob_backup == NULL || *text_backup == '/')
        return FALSE;
      // star-loop: backtrack to the last * but do not jump over /
      text = ++text_backup;
      glob = glob_backup;
    }
  }
  // ignore trailing stars
  while (*glob == '*')
    glob++;
  // at end of text means success if nothing else is left to match
  return *glob == '\0' ? TRUE : FALSE;
}

For practical glob matching applications however, this glob-like ("globly") algorithm is not yet complete. It lacks support for character classes, such as [a-zA-Z] to match letters and [^0-9] to match any character except a digit. It also lacks a means to escape the special meaning of the *, ?, and [ meta characters with a backslash. Adding these new features requires a redesign of the glob meta character tests in the main loop, using a switch to select glob meta characters and by using continue to commence the main loop. Breaking from the switch moves control to the star-loop:

#define PATHSEP '/'  // pathname separator, we should define '\\' for Windows instead
// returns TRUE if text string matches glob pattern with * and ?
int glob_match(const char *text, const char *glob)
{
  const char *text_backup = NULL;
  const char *glob_backup = NULL;
  while (*text != '\0')
  {
    switch (*glob)
    {
      case '*':
        // new star-loop: backup positions in pattern and text
        text_backup = text;
        glob_backup = ++glob;
        continue;
      case '?':
        // match any character except /
        if (*text == PATHSEP)
          break;
        text++;
        glob++;
        continue;
      case '[':
      {
        int lastchr;
        int matched = FALSE;
        int reverse = glob[1] == '^' || glob[1] == '!' ? TRUE : FALSE;
        // match any character in [...] except /
        if (*text == PATHSEP)
          break; 
        // inverted character class
        if (reverse)
          glob++;
        // match character class
        for (lastchr = 256; *++glob != '\0' && *glob != ']'; lastchr = *glob)
          if (lastchr < 256 && *glob == '-' && 
              glob[1] != '\0' ? *text <= *++glob && *text >= lastchr : *text == *glob)
            matched = TRUE;
        if (matched == reverse)
          break;
        text++;
        if (*glob != '\0')
          glob++;
        continue;
      }
      case '\\':
        // literal match \-escaped character
        glob++;
        // FALLTHROUGH
      default:
        // match the current non-NUL character
        if (*glob != *text && !(*glob == '/' && *text == PATHSEP))
          break;
        text++;
        glob++;
        continue;
    }
    if (glob_backup == NULL || *text_backup == PATHSEP)
      return FALSE;
    // star-loop: backtrack to the last * but do not jump over /
    text = ++text_backup;
    glob = glob_backup;
  }
  // ignore trailing stars
  while (*glob == '*')
    glob++;
  // at end of text means success if nothing else is left to match
  return *glob == '\0' ? TRUE : FALSE;
}

The PATHSEP character is either the conventional / or the Windows \ used in the string text to separate pathnames. Note that traditional Unix uses ! for character class inversion, for example [!0-9]. Here, we offer a choice of ! and the more conventional ^ to invert a character class, for example [^0-9].

The execution time of this algorithm is linear in the length of the pattern and string for typical cases and takes quadratic time in the worst case.

Non-recursive gitignore-style glob Matching

As we arrive at the second main part of this article, let's take a look at modern gitignore-style globbing rules and design an efficient non-recursive algorithm.

Gitignore-style globbing applies the following rules to determine file and directory pathname matches:

* matches anything except a /
? matches any one character except a /
[a-z] matches one character in the selected range of characters
[^a-z] matches one character not in the selected range of characters
[!a-z] matches one character not in the selected range of characters
/ when used at the begin of a glob, matches if pathname has no /
**/ matches zero or more directories
/** when at the end of a glob, matches everything after the /
\? matches a ? (or any character specified after the backslash)

We also make the following assumption: when a glob contains a path separator /, the full pathname is matched. Otherwise, the basename of a file or directory is matched. For example, *.h matches foo.h and bar/foo.h, bar/*.h matches bar/foo.h but not foo.h and not bar/bar/foo.h. A leading / in the glob can be used to force /*.h to match foo.h but not bar/foo.h. Furthermore, a leading ./ or / is ignored in pathnames, so ./foo/bar is matched as foo/bar as we would expect.

Examples:

* matches a, b, x/a, x/y/b  
a matches a, x/a, x/y/a but not b, x/b, a/a/b
/* matches a, b but not x/a, x/b, x/y/a
/a matches a but not x/a, x/y/a
a?b matches axb, ayb but not a, b, ab, a/b
a[xy]b matches axb, ayb but not a, b, azb
a[a-z]b matches aab, abb, acb, azb but not a, b, a3b, aAb, aZb
a[^xy]b matches aab, abb, acb, azb but not a, b, axb, ayb
a[^a-z]b matches a3b, aAb, aZb but not a, b, aab, abb, acb, azb
a/*/b matches a/x/b, a/y/b but not a/b, a/x/y/b
**/a matches a, x/a, x/y/a but not b, x/b
a/**/b matches a/b, a/x/b, a/x/y/b but not x/a/b, a/b/x
a/** matches a/x, a/y, a/x/y but not a, b/x
a\?b matches a?b but not a, b, ab, axb, a/b

To implement gitignore-style globbing, we need two star-loops: one for single star, the “*-loop”, and another for double star, the “**-loop”. The **-loop overrules the *-loop because there is no point in backtracking on a single star when we encounter a double star after it in the glob pattern. The converse is not true: we should backtrack on a double star when a single star that comes after it in the glob does not match:

// returns TRUE if text string matches gitignore-style glob pattern
int gitignore_glob_match(const char *text, const char *glob)
{
  const char *text1_backup = NULL;
  const char *glob1_backup = NULL;
  const char *text2_backup = NULL;
  const char *glob2_backup = NULL;
  // match pathname if glob contains a / otherwise match the basename
  if (*glob == '/')
  {
    // if pathname starts with ./ then ignore these pairs
    while (*text == '.' && text[1] == PATHSEP)
      text += 2;
    // if pathname starts with / then ignore it
    if (*text == PATHSEP)
      text++;
    glob++;
  }
  else if (strchr(glob, '/') == NULL)
  {
    const char *sep = strrchr(text, PATHSEP);
    if (sep)
      text = sep + 1;
  }
  while (*text != '\0')
  {
    switch (*glob)
    {
      case '*':
        if (*++glob == '*')
        {
          // trailing ** match everything after /
          if (*++glob == '\0')
            return TRUE;
          // ** followed by a / match zero or more directories
          if (*glob != '/')
            return FALSE;
          // new **-loop, discard *-loop
          text1_backup = NULL;
          glob1_backup = NULL;
          text2_backup = text;
          glob2_backup = ++glob;
          continue;
        }
        // trailing * matches everything except /
        text1_backup = text;
        glob1_backup = glob;
        continue;
      case '?':
        // match any character except /
        if (*text == PATHSEP)
          break;
        text++;
        glob++;
        continue;
      case '[':
      {
        int lastchr;
        int matched = FALSE;
        int reverse = glob[1] == '^' || glob[1] == '!' ? TRUE : FALSE;
        // match any character in [...] except /
        if (*text == PATHSEP)
          break;
        // inverted character class
        if (reverse)
          glob++;
        // match character class
        for (lastchr = 256; *++glob != '\0' && *glob != ']'; lastchr = *glob)
          if (lastchr < 256 && *glob == '-' && 
              glob[1] != '\0' ? *text <= *++glob && *text >= lastchr : *text == *glob)
            matched = TRUE;
        if (matched == reverse)
          break;
        text++;
        if (*glob != '\0')
          glob++;
        continue;
      }
      case '\\':
        // literal match \-escaped character
        glob++;
        // FALLTHROUGH
      default:
        // match the current non-NUL character
        if (*glob != *text && !(*glob == '/' && *text == PATHSEP))
          break;
        text++;
        glob++;
        continue;
    }
    if (glob1_backup != NULL && *text1_backup != PATHSEP)
    {
      // *-loop: backtrack to the last * but do not jump over /
      text = ++text1_backup;
      glob = glob1_backup;
      continue;
    }
    if (glob2_backup != NULL)
    {
      // **-loop: backtrack to the last **
      text = ++text2_backup;
      glob = glob2_backup;
      continue;
    }
    return FALSE;
  }
  // ignore trailing stars
  while (*glob == '*')
    glob++;
  // at end of text means success if nothing else is left to match
  return *glob == '\0' ? TRUE : FALSE;
}

The execution time of this algorithm is linear in the length of the pattern and string for typical cases and cubic in the worst case, when both a **-loop and a *-loop are active.

Refinements

Files with names starting with a dot (.) are treated differently by shell globs, meaning that a dot beginning a name must be matched explicitly and cannot be matched by a wildcard. To replicate this behavior, we add the following:

int glob_match(const char *text, const char *glob)
{
  int nodot = 1; // new flag to indicate that we shouldn't match a dot, e.g. after a /
  ...
  while (*text != '\0')
  {
    switch (*glob)
    {
      case '*':
        // match anything except . after /
        if (nodot && *text == '.')
          break;
        ...
      case '?':
        // match anything except . after /
        if (nodot && *text == '.')
          break;
        ...
      case '[':
      {
        ...
        // match any character in [...] except . after /
        if (nodot && *text == '.')
          break;
        ...
      }
      ...
      default:
        // match the current non-NUL character
        if (*glob != *text && !(*glob == '/' && *text == PATHSEP))
          break;
        nodot = *glob == '/';
        ...

Case-insensitive matching in ASCII can be achieved with tolower() as follows:

// match the current non-NUL character
if (tolower(*glob) != tolower(*text) && !(*glob == '/' && *text == PATHSEP))
  break;
// match character class
for (lastchr = 256; *++glob != '\0' && *glob != ']'; lastchr = tolower(*glob))
  if (lastchr < 256 && *glob == '-' &&
      glob[1] != '\0' ? tolower(*text) <= tolower(*++glob) &&
      tolower(*text) >= lastchr : tolower(*text) == tolower(*glob))
    matched = TRUE;

Unicode matching can be supported in two ways: by using wide strings, i.e., wchar_t or std::wstring or by using UTF-8 encoded strings. The wide string option requires no changes to the algorithms. The UTF-8 version requires a few changes for ? and character classes, with everything else staying the same:

case '?':
  // match anything except /
  if (*text == PATHSEP)
    break;
  utf8(&text);
  glob++;
  continue;
case '[':
{
  int chr;
  …
  // match character class
  lastchr = 0x10ffff;
  glob++;
  while (*glob != '\0' && *glob != ']')
    if (lastchr < 0x10ffff && *glob == '-' ? (chr = utf8(&text)) <= utf8(&glob) &&
        chr >= lastchr : utf8(&text) == (lastchr = utf8(&glob)))
      matched = TRUE;
  if (matched == reverse)
    break;
  if (*glob)
    glob++;
  continue;
}

where utf8() returns the wide character and advances by one UTF-8 character in the glob:

int utf8(const char **s)
{
  int c1, c2, c3, c = (unsigned char)**s;
  if (c != '\0')
    (*s)++;
  if (c < 0x80)
    return c;
  c1 = (unsigned char)**s;
  if (c < 0xC0 || (c == 0xC0 && c1 != 0x80) || c == 0xC1 || (c1 & 0xC0) != 0x80)
    return 0xFFFD;
  if (c1 != '\0')
    (*s)++;
  c1 &= 0x3F;
  if (c < 0xE0)
    return (((c & 0x1F) << 6) | c1);
  c2 = (unsigned char)**s;
  if ((c == 0xE0 && c1 < 0x20) || (c2 & 0xC0) != 0x80)
    return 0xFFFD;
  if (c2 != '\0')
    (*s)++;
  c2 &= 0x3F;
  if (c < 0xF0)
    return (((c & 0x0F) << 12) | (c1 << 6) | c2);
  c3 = (unsigned char)**s;
  if (c3 != '\0')
    (*s)++;
  if ((c == 0xF0 && c1 < 0x10) || (c == 0xF4 && c1 >= 0x10) || c >= 0xF5 || (c3 & 0xC0) != 0x80)
    return 0xFFFD;
  return (((c & 0x07) << 18) | (c1 << 12) | (c2 << 6) | (c3 & 0x3F));
}

Conclusions

Wildcard matching, classic globbing, and modern gitignore-style globbing can be fast when implemented with star-loop iterations instead of recursion. These algorithms typically run in linear time in the length of the string and pattern and in the worst case, may run in quadratic or cubic time in the length of the pattern and string, without blowing up CPU usage exponentially.

Extended globbing features such as shell brace expansion can be added to our matcher. However, brace expansion requires recursion with backtracking, which may make the exponential blow-up problem worse if we're not careful. Consider for example a*{b,a*{b,a*{b,a*{b,a*{b,a*{b,a*{b,a*b}}}}}}}. There is no easy way to limit the execution time for brace expansion, except perhaps by limiting the number of braces or by converting an extended glob to a regex for matching although regex matching can be costly too.

I wrote this article after implementing gitignore-style globbing for ugrep, due to the lack of available and usable open source code. The versions of wildmat.c that I found in official repositories all had a nasty bug that is actually documented in the source code: "Might not be robust in face of malformed patterns; e.g., "foo[a-" could cause a segmentation violation." Ugh.

History

  • 3rd August, 2019: First draft
  • 5th August, 2019: First publication
  • 8th August, 2019: Updated
  • 12th September, 2019: Added Java download sources
  • 19th September, 2019: Added shell execution example; [] does not match /; dotglob refinement; new and updated download sources
  • 17th October, 2019: Minor update to improve gitignore globbing by removing a leading / from the pathname, if present

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Robert van Engelen
CEO
United States United States
Founder of Genivia inc, Professor of Computer Science

Comments and Discussions

 
QuestionExcellent thank you, a small addition with a,/a Pin
David Dumke27-Sep-19 14:42
MemberDavid Dumke27-Sep-19 14:42 
AnswerRe: Excellent thank you, a small addition with a,/a Pin
Robert van Engelen16-Oct-19 11:32
MemberRobert van Engelen16-Oct-19 11:32 
GeneralMy vote of 5 Pin
EJ Smits17-Sep-19 5:44
MemberEJ Smits17-Sep-19 5:44 
GeneralA Note On Slashes Pin
Rick York13-Sep-19 8:21
mveRick York13-Sep-19 8:21 
GeneralRe: A Note On Slashes Pin
Nelek14-Sep-19 11:56
protectorNelek14-Sep-19 11:56 
GeneralMy vote of 5 Pin
Jan Heckman6-Aug-19 1:37
professionalJan Heckman6-Aug-19 1:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Article
Posted 5 Aug 2019

Stats

10.1K views
270 downloads
30 bookmarked