|
The code has an exponential behavior when it tries to reject matches with multiple asterisks. The solution to it comes from 1991 in Lars Mathiesen's modification to wildmat. Here's the full code sample with knobs and test-cases:
#include <stdio.h>
#define TCHAR char
#define _T(x) (x)
#define USE_ABORT
typedef enum {
false = 0,
true = 1,
abort = 2,
none = 3
} WM_RET;
#ifndef USE_ABORT
#define abort false
#endif
const char* STATUS[] = {
"FALSE", "TRUE", "ABORT", "NONE"
};
int recurse;
WM_RET WildcardMatch(const TCHAR *pszText, const TCHAR *pszPattern) {
recurse++;
WM_RET result = none;
while (*pszPattern)
{
if (*pszPattern == _T('?'))
{
if (!*pszText)
return abort;
++pszText;
++pszPattern;
}
else if (*pszPattern == _T('*'))
{
result = WildcardMatch(pszText, pszPattern + 1);
if (result != false)
return result;
if (!*pszText)
return abort;
if (WildcardMatch(pszText + 1, pszPattern) == true)
return true;
return abort;
}
else if (*pszText++ != *pszPattern++)
return false;
}
return !*pszText && !*pszPattern;
}
void inspect(const TCHAR *pat, const TCHAR *str) {
recurse = 0;
int res = WildcardMatch(str, pat);
printf("%s (%d calls)\n", STATUS[res], recurse);
}
int main(void) {
inspect(
"-*-*-*-*-*-*-12-*-*-*-m-*-*-*",
"-adobe-courier-bold-o-normal--12-120-75-75-m-70-iso8859-1");
inspect(
"-*-*-*-*-*-*-12-*-*-*-m-*-*-*",
"-adobe-courier-bold-o-normal--12-120-75-75-X-70-iso8859-1");
return 0;
}
I have set up a sandbox at https://repl.it/repls/ScarceMinorTechnician for you all to play with it. Before you put it into a library though, remember to UPPERCASE the enums and add some prefixes to avoid collision.
You can read a description of this technique at Wikipedia.
(What is all this dehydrated fiasco? I thought code minimization is something that can be done automatically without a need for humans to introduce bugs.)
modified 27-Nov-19 5:30am.
|
|
|
|
|
exactly what I need - thanks
|
|
|
|
|
Carefully crafted, clear it works - just fine!
|
|
|
|
|
on an empty pszString and
(pszMatch == _T("*?")) or any other combination where there an empty pszString and any characters after the initial _T('*') in pszMatch.
Since one _T('*') is sufficient for any subsequent match, wouldn't it be better to close out the function by skipping the
while ((*pszMatch) == _T('*'))
{
pszMatch++;
}
loop at the end and just close with
return ((*pszMatch) == _T('\0')) || (*pszMatch) == _T('*'));
? Why advance the character being considered in pszMatch after any _T('*") is found?
|
|
|
|
|
Hi
I have a requirement for a similar functionality in linux, Can you please help me with linux version of the code ?
Thanks a Lot
|
|
|
|
|
I like it. Thank you very much.
|
|
|
|
|
Awesome, I love MVP's lol
|
|
|
|
|
Thank you Martin for this simple algorithm for doing wildcard string matching with out messy regex. Tested it and all the contributed versions in the comments and both the original version and Dezhi Zhao's "straight" version perform wonderfully and as expected.
|
|
|
|
|
|
I always prefer the code to either shorter or faster if I cannot have both
The code should be much shorter since you use recursion already.
So, here is my dehydrated version that still keeps the original spirit of the work.
I believe the dehydrated is both smaller and faster than the original one though I did not bench them.
Have fun and thanks for sharing!
bool WildcardMatch(const TCHAR *pszString, const TCHAR *pszMatch)
{
while (*pszMatch)
{
if (!*pszString)
return *pszMatch ==_T('*') && !*(pszMatch + 1);
if (*pszMatch ==_T('*'))
return WildcardMatch(pszString, pszMatch + 1) || WildcardMatch(pszString + 1, pszMatch);
if (!(*pszMatch ==_T('?') || _totupper(*pszString) == _totupper(*pszMatch)))
return false;
pszString++;
pszMatch++;
}
return !*pszString;
}
modified on Friday, April 29, 2011 6:08PM removed the redundant !*pszMatch in the last return statement;
|
|
|
|
|
Thanks for your version.
I would prefer mine. I would say, that your code is less readable and I guess that there is no difference in speed.
If you write to if statements or one or condition, for the compiler it is the same...
Try and do a benchmark, buti think it isn't worth the effort.
--
Martin Richter (MVP for C++) WWJD http://blog.m-ri.de
"A well-written program is its own heaven; a poorly written program is its own hell!" The Tao of Programming
|
|
|
|
|
You don't even need to benechmark them to know that the dehydrated is faster.
A rather simple observation will prove these:
1. the dehydrated does fewer comparisions by combining some.
2. the dehydrated does early exit that elminates some unneccesary recursive calls.
As for the readability, I think it is still on par with the original one if we bother to put comments there.
Still need a real life bench number for an arbitary test set data?
Here it is:
1000000 test set runs of WildcardMatch = 11.108s, r = 7000000
1000000 test set runs of WildcardMatch_dehydrated = 9.890s, r = 7000000
So, the dehydrated is 12+% faster in this test run.
The compiler is Intel C/C++ 12. Intel CPU E5400 2.70G
Here is the code snip of the test:
const TCHAR* string[] = { _T(""), _T("MyName.doc"), _T("MyName.docx"), _T("MyNamName.docx") };
const TCHAR* match[] = {_T("*"), _T(""), _T("?yName.*x"), _T("My*Name.*oc") };
int bench(bool (*pfn)(const TCHAR* pszString, const TCHAR* pszMatch), TCHAR* fname)
{
int r = 0;
int count = 1000000;
unsigned start, finish;
int c = count;
start = GetTickCount();
while(c--)
{
int i = 0;
do
{
int j = 0;
do
{
r += (*pfn)(string[i], match[j]);
} while(++j < sizeof(match)/sizeof(TCHAR*));
} while(++i < sizeof(string)/sizeof(TCHAR*));
}
finish = GetTickCount();
_tprintf(_T("%d test set runs of %s = %.3fs, r = %d\n"), count, fname, (float)(finish - start) / 1000, r);
return r;
}
int _tmain(int argc, _TCHAR* argv[])
{
bench(WildcardMatch, _T("WildcardMatch"));
bench(WildcardMatch_dehydrated, _T("WildcardMatch_dehydrated"));
return 0;
}
|
|
|
|
|
Hi Dezhi Zhao,
I like your optimizing approach, thanks, I hope you will find interesting my attempts to write superfast ITERATIVE wildcard finder, just see my KAZAHANA dedicated article.
Please try to improve it, it turned out it is second-to-fastest (written by Jack Handy - a fellow CP member) the one you are using in your bench.
Get down get down get down get it on show love and give it up
What are you waiting on?
|
|
|
|
|
Here are two more versions.
One is iterative rather than recursive.
bool WildcardMatch(const TCHAR *pszString, const TCHAR *pszMatch)
{
const TCHAR *cp;
const TCHAR *mp;
while ((*pszString) && (*pszMatch != _T('*')))
{
if ((_totupper(*pszMatch) != _totupper(*pszString)) && (*pszMatch != _T('?')))
{
return false;
}
pszMatch++;
pszString++;
}
while (*pszString)
{
if (*pszMatch == _T('*'))
{
if (!*++pszMatch)
{
return true;
}
mp = pszMatch;
cp = pszString + 1;
}
else if ((_totupper(*pszMatch) == _totupper(*pszString)) || (*pszMatch == _T('?')))
{
pszMatch++;
pszString++;
}
else
{
pszMatch = mp;
pszString = cp++;
}
}
while (*pszMatch == _T('*'))
{
pszMatch++;
}
return !*pszMatch;
}
The other is hardly readable but very condensed:
bool WildcardMatch3(const TCHAR *pszString, const TCHAR *pszMatch)
{
return *pszMatch-_T('*') ? *pszString ? (*pszMatch==_T('?')) | (_totupper(*pszString)==_totupper(*pszMatch)) &&
WildcardMatch3(pszString+1, pszMatch+1) : !*pszMatch : WildcardMatch3(pszString, pszMatch+1) || *pszString && WildcardMatch3(pszString+1, pszMatch);
}
Enjoy!
Axel Kemper
|
|
|
|
|
Here is a test run:
1000000 test set runs of WildcardMatch = 11.092s, r = 7000000
1000000 test set runs of WildcardMatch_dehydrated = 10.109s, r = 7000000
1000000 test set runs of WildcardMatch_iterative = 5.834s, r = 7000000
1000000 test set runs of WildcardMatch3 = 10.452s, r = 7000000
The iterative one has the obvious advatage. However, it is disqualified in this game because it does not comform to the original intent of the author
And I believe that the iterative version could still be further improved without much effort. But again, it does not belong to this amatuer group
Have fun!
|
|
|
|
|
The iterative version is the best for sure!
But it isn't so obvious why it works but it works.
Saddly the creator doen't document so I will try to explain it here:
1. Assume that a string matches up to a wildcard. (Part one ofthe loop=
2. If string part to the right matches it be checked by the second loop.
3. All non matching characters are eaten by the last found *.
So no fallback to an ealier * is needed. So recursion isn't needed.
--
Martin Richter (MVP for C++) WWJD http://blog.m-ri.de
"A well-written program is its own heaven; a poorly written program is its own hell!" The Tao of Programming
|
|
|
|
|
Now, you prefer the speed over elegance. Here is a straight entry that is slightly faster and shorter than the iterative one
bool WildcardMatch_straight(const TCHAR *pszString, const TCHAR *pszMatch)
{
const TCHAR *mp;
const TCHAR *cp = NULL;
while (*pszString)
{
if (*pszMatch == _T('*'))
{
if (!*++pszMatch)
return true;
mp = pszMatch;
cp = pszString + 1;
}
else if (*pszMatch == _T('?') || _totupper(*pszMatch) == _totupper(*pszString))
{
pszMatch++;
pszString++;
}
else if (!cp)
return false;
else
{
pszMatch = mp;
pszString = cp++;
}
}
while (*pszMatch == _T('*'))
pszMatch++;
return !*pszMatch;
}
benchmarks:
1000000 test set runs of WildcardMatch = 11.107s, r = 7000000
1000000 test set runs of WildcardMatch_dehydrated = 10.000s, r = 7000000
1000000 test set runs of WildcardMatch3 = 11.560s, r = 7000000
1000000 test set runs of WildcardMatch_iterative = 5.865s, r = 7000000
1000000 test set runs of WildcardMatch_straight = 5.694s, r = 7000000
|
|
|
|
|
Thank you for sharing, my friend. I don't understand why you were voted down. Your code is a great contribution. In fact, I think it would be fantastic if you were to elaborate on your thoughts on this subjects by submitting an article with these an other alternative implementations. Thanks again.
|
|
|
|
|
Thank you for your compliments. You know this is just a code forum and you may have some smart guys casting their random votes here
As I said in my intial post, the code should be short or fast if you cannot have both. I just wanted to show that it can be done but choose to skip comments.
As you can see that a guy claimed a few weeks ago that the dehydrated produced a different result. I tested them again and told him that he must have done an invalid test. However, that guy never came back admitting his mistake
|
|
|
|
|
Awesome, just what I was looking for. Simple and easy to follow.
|
|
|
|
|
Dehydrated version is bad and not functional.
The part:
if (!*pszString)
return *pszMatch ==_T('*') && !*(pszMatch + 1);
produces bad result when more * is at the end of pszMatch string.
|
|
|
|
|
I'm not sure that you know what you know what you're trying to say.
I don't see any problems here at all. That line is perfect legal and functional.
The dehydrated version is the same as the original in terms of functionality or behaviors from a caller's view, but only shorter and faster.
Please do us a favor and provide a simple example to support your claim. Let us know what is the result from the original and the dehydrated. Thanks
|
|
|
|
|
I try this:
int _tmain(int argc, _TCHAR* argv[])
{
bool b;
b=WildcardMatch_poriadne(_T("abcd"),_T("a*b*d**"));
printf("%d\n",b);
b=WildcardMatch_Zhao1(_T("abcd"),_T("a*b*d**"));
printf("%d\n",b);
b=WildcardMatch_akemper1(_T("abcd"),_T("a*b*d**"));
printf("%d\n",b);
return 0;
}
The results are:
1
0
1
So dehydrated version gives bad result. Problem is with test, which check ending *.
There may be hundred * at the end, which can match zero characters.
The bad test is:
if (!*pszString)
return *pszMatch ==_T('*') && !*(pszMatch + 1);
The return statement gives bad result when there is multiple * at the end of pszMatch.
|
|
|
|
|
Now I think I know what you meant
However, I think you did an invalid test. I just run the same test and the results from the original and the dehydrated are the same, 1.
So, please check if you messed up the dehydrated version somehow.
In my test, I just copied and pasted the code snip of the dehydrated from my original post in this thread.
Please do the same thing, run your tests again and let us know the results. Thanks
modified 22-Mar-13 14:13pm.
|
|
|
|
|
I just tested it as well with the following input:
WildcardMatch_straight("RooK!BaseD@cdg-EF70F305.woh.res.rr.com", "*!BaseD@*.res.??.**")
Works fine...
|
|
|
|
|