|
Hello,
I think you have a bug in the code. You initialize a second parameter for m_apnFSM as:
std::vector<SEntry>( c_nMaxChar )
and later you have code:
// follow up the table states
size_t nIdxChar = min(static_cast<size_t>(pszSearch[nIdx]), c_nMaxChar) ;
const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;
if n_nMaxChar wins in the "min" function it make nIdxChar=c_nMaxChar and index goes out of range. It seems you need:
size_t nIdxChar = min(static_cast<size_t>(pszSearch[nIdx]), c_nMaxChar - 1) ;
there
|
|
|
|
|
I agree. Thank you for pointing it out. In fact, I found that problem the day after I posted this (April 26th), and made some more design changes to the class. I didn't upload my revisions here because I just wasn't sure how important it would be to anyone. (Since I was updating a really old article, I didn't know if anybody really cared anymore ).
One of the changes I made was to add some new methods to the CIVStringSet class header:
void SetCharRange( TCHAR cMin, TCHAR cMax ) ;
size_t CharIndex( TCHAR cVal ) { return static_cast<size_t>(max(min(cVal, m_cMaxChar), m_cMinChar) - m_cMinChar) ; }
bool IsPrefixFound( const _tstring & strText ) ;
I added two new member variables:
TCHAR m_cMinChar,
m_cMaxChar ;
in the .cpp, I changed implementation as follows:
* I got rid of the c_nMaxChar from file scope.
* Changed/added class initializers to
m_cMinChar( _T('A') ),
m_cMaxChar( _T('z') ),
m_apnFSM( dwInitialWidth, vector<SEntry>( _T('z') - _T('A') + 1 ) ),
* Implemented SetCharRange:
void CIVStringSet::SetCharRange( TCHAR cMin, TCHAR cMax )
{
m_cMinChar = cMin ;
m_cMaxChar = cMax ;
size_t nSize = m_apnFSM.size() ;
m_apnFSM.clear() ;
m_apnFSM.resize( nSize, vector<SEntry>( m_cMaxChar - m_cMinChar + 1 ) ) ;
}
* In InsertWord, I changed the line that initialized nIdxChar:
size_t nIdxChar = CharIndex( pszWord[nChar] ) ;
* and at the bottom of that same loop, I changed the line to this:
m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFFFFFC0, vector<SEntry>( m_cMaxChar - m_cMinChar + 1 ) ) ;
* Then, at the line where you pointed out my bug, I now have:
size_t nIdxChar = CharIndex( m_strSearch[nIdx] ) ;
* and I implemented IsPrefixFound:
bool CIVStringSet::IsPrefixFound( const _tstring & strText )
{
unsigned long dwCurState = 0UL ;
size_t nLen = strText.length() ;
for ( UINT nIdx = 0U ; nIdx < nLen ; nIdx++ )
{
size_t nIdxChar = CharIndex( strText[nIdx] ) ;
const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;
dwCurState = rEntry.m_dwState ;
}
return (dwCurState > 0UL) ;
}
...just in case anybody cares
Thanks again for pointing out the bug!
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|
A minor change to consider for your very helpful test program: Have the "Search" button be the default button, so that when an end-user presses the "Return" key, the search proceeds.
As it is now, the "Done" button is the default, and the app closes. This is disconcerting ... my first impression was that the app had crashed.
|
|
|
|
|
Thanks for provding this.
Can this code (or a derivative) simply detect if all search tokens are in a string, and quit as soon as this has been determined? My impression is that the existing code continues to the end of the string and finds all matches of all tokens.
Here is a scenario: a 40 meg text file is in a memory buffer and split into about 300,000 lines. I want to make a list of those lines which contain all of the search words. The memory buffer will be searched repeatedly for different groups of tokens.
Suppose line 1000 is:
"There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string is long. I have created a class to simplify the task and perform it very efficiently."
If the search tokens were "string" and "search", then this line can be considered a "match" after about 85 characters, and move on to the next line.
I wonder if there is a way for the code to stop looking for the token "search" after it has been found (about position 70), and proceed to only look for "string". Can a token be removed part way thru a search if this would speed up the search (maybe it wouldn't speed up the search more than a minor amount, if at all)?
Another possible optimization would be to switch to something like Boyer-Moore-Horspool for the rest of the string when there was only one token left (but that is getting pretty complicated).
Can a call such as DetectWhetherAllTokensExist be done with the existing code? If so, how? If not, what modifications would be involved?
Thanks again.
|
|
|
|
|
You're welcome. As you can see that I wrote these classes almost four years ago, I had to reacquaint myself with the code.
As Phil B pointed out back in 2003, my code doesn't force full-word matches. He offered a semi-solution that insists that a match not be counted unless the following character is a space. Of course, a space is not the only word delimiter, and to check for all those would make the code slower. I'd use something like strspn or SpanIncluding if I needed to do that, though.
Removing strings from the FSM would be a speed increase, because it reduce the number of paths followed when searching. I didn't provide for removal because I never considered a project like yours before. Detecting that there is only one token left would add too much overhead to the algorithm, and negate any performance increase you might have gotten with switching to another algorithm.
To implement DetectWhetherAllTokensExist, you'd obviously need to somehow know that a particular word has already been found. If you implement a Remove method in the class, certainly removing the word from the search would accomplish that. Once all words have been removed, you'd know you're done.
So, how do you remove a string from the set? Remember that it's vital to not change the index of an entry in the set, so removing can't involve changing the position of elements in the array. I haven't pondered this enough to give you a solution now, and I just got a call from my son, who needs a ride. Maybe I'll come up with an answer later. Let me know if you get there first.
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|
That was a prompt reply. Thanks.
I haven't stared at your algorithm closely enough to really understand what it is doing. Is your approach somewhat similar to the Aho-Corasick algorithm?
Just a quick, uninformed guess .... would it be possible to replace the matched member of the FSM with some convention like { 0xFF 0xFF .... 0xFF } which would never match again. Then you could keep a count of matches, and when all were found, you could quit processing that line and go to the next.
|
|
|
|
|
Check the Acknowledgement in the article, and you'll see that the algorithm was created by Halibard and Rubin. It was published in the June 2002 issue of C/C++ Users Journal, which you can get by registering at http://www.cuj.com. No explanation of the algorithm I give here could be as thorough as reading it straight from the horses' mouths.
Once you read the article (or parse the code more closely), you'll see that the substrings do not get stored as single entities in the FSM. Indeed, by definition, a Finite State Machine is not like a table so much as it is like a map. Strings that have the same initial letters will share some entries in the FSM as they are mapped into it. This is partly why it is more efficient. Rather than comparing strings one at a time with your text, it compares all possible matches at once.
However, because of this design, it somewhat confounds the idea of removing entries from the FSM. Since entries don't necessarily belong to just one string, it's more difficult to determine whether or not it can be removed. I haven't yet had the brainstorm to reveal that solution.
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|
I registered to look at CUG. I didn't realize you could do that without charge ... Thanks.
According to the original article, the algorithm was oriented to detecting prohibited words in e-mail, so it quit once one of the banned words was found.
"The implementation in this article returns the offset within the string of the first matching substring and stops processing."
which is quite a bit different from what I am trying to do.
The article indicates that 0 and -1 have special meaning. I haven't stared at your code enough to be close to grok'ing it, but perhaps a 0 or -1 could be "moved up" the data structure related to a word that was matched, so that detection would be shortcut and proceed to the next letter with the state machine "cleared".
If there was no prefix with any other search word (such as looking for "hello" and "goodbye"), then it would be up near the "top". If there was a common prefix (as in my original question using the example "string" and "search"), then the state associated with 's' + 't' would be "shortcut/truncated" when "string" was found.
I was wondering if it would be simpler and still advantageous if the code could keep track of which words had been matched, and quit when they were all "hit". It wouldn't necessarily stop detecting previously matched substrings.
Thanks for taking the time to refresh your memory with this code, and considering a revision. It does seem like a worthwhile enhancement with broad applicability.
BTW, have you ever seen a C or C++ implementation of the Aho-Corasick algorithm for detecting multiple sub-strings in a string? I'm only aware of that algorithm, and my understanding is that it also uses a FSM approach.
As they say, "if this stuff was easy, it wouldn't pay so well. (written by a freeware developer to a software contributor <g>)
|
|
|
|
|
I was not familiar with that algorithm before, although I know the name Aho to be the 'a' in the famous "awk" Unix utility (Weinberg and Kernighan, also of great Unix fame are the 'w' and 'k'). Upon cursory examination, their algorithm appears very similar to the Halibard-Rubin one upon which my classes are based. Interestingly, they reference Kernighan's spam filter as an inspiration to their work.
There is a CP article (http://www.codeproject.com/csharp/ahocorasick.asp) that may interest you if you haven't already seen it.
At this time, I'm not interested in enhancing my classes (for free, at least). Especially since there are so many other similar pieces of code out there. It would be a simple matter to keep track of which words have been hit, but a performance drain to detect when all have been hit -- at least given what my imagination has conjured so far.
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|
l_d_allan, did you ever locate a solution? Because I too am looking for the same thing: a class/procedure that returns true if each token exists in the search string. Ideally, I don't care how many times a token is present, so it seems a short-circuited logic works best: when a token is found once, we stop searching for that one. But, I guess that's a bit different from how FSM works.
|
|
|
|
|
I have not spent a lot of time viewing the code, but it appears that the efficient part of this procedure is the 'looking' for the substring. Creating new FSM from words/substrings of interest appears not to be expensive.
So here is my suggestion. Use FindFirstIn to locate the first substring. Now create a new 'problem' with the rest of the original string from the point where substring was found. Create a new FSM with one less substring and start a new search. Repeat the process till all the words/substrings have been found. You do not have to make any changes to the class or the search procedure to do this.
|
|
|
|
|
please help me to add a CSTringList to an object of CIVstringSet.I tried to use the overloaded member function ADD(CStringList list);but an error is being shown.IT says that "cannot convert from CStringList to const char *);
moreover the Add() function is taking too much time to add all the CString to the list when i pass a CString as a parameter to the ADD function.
WHY cant i use the overloaded member function ADD(CSTringLIST list);plz reply as i am facing a deadline.
|
|
|
|
|
Are you using a CStringList object, or perhaps a pointer to a CStringList? The function requires a direct reference to the object, not a pointer. You are using the MFC version of the class, and not the STL version, right?
As concerns "taking too much time", can you be more specific? Have you identified which step is taking the most time?
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|
Excellent bit of code however I used it to search for 1400 substrings in a short bit of text (a url) and it fell over because the DFA it builds gets too large.
I think the large space requirement for many substrings in inevitable but if anyone knows otherwise I'd love to know.
|
|
|
|
|
you can dramatically increase the number of strings this class can handle by changing
the array m_apnFSM to be some STL dynamically-allocating type, and using that template-class's dynamic allocation mechanisms in favor of the SetColDim method -- this should also be faster
if you're adding a large number of strings, as the original increases by a small static amount each time it re-allocs, and the STL libraries try to amortize the cost of memory allocation.
with that change it seems to scale to around 80,000 strings on my machine.
(i used a deque<vector<dword[128]> >, as the [] operator is a little faster for vector, but deques are faster if you keep forcing the collection to allocate more memory).
|
|
|
|
|
Hi
There is a bug in the STL version of the code
IT reads
int CIVStringSet::Add( LPCTSTR pszWords, LPCTSTR pszDelims )
{
.....
// Now, move beyond the token just extracted and the delimiters
int nTokenLen = strToken.length() ;
if ( nTokenLen < strUnparsed.length() )
{
int nDelimLen = _tcsspn( &strUnparsed[nTokenLen], pszDelims ) ;
nTokenLen += nDelimLen ;
if ( nTokenLen < strUnparsed.length() )
strUnparsed = strUnparsed.substr( nTokenLen ) ;
else
break ;
}
else
break ;
.......
which gets stuck in a loop by this line
int nDelimLen = _tcsspn( &strUnparsed[nTokenLen], pszDelims ) ;
it works with this instead though.
.......
// Now, move beyond the token just extracted and the delimiters
int nTokenLen = strToken.length() ;
if ( nTokenLen < strUnparsed.length() )
{
int nDelimLen = strUnparsed.find_first_not_of( pszDelims , nTokenLen ) ;
//nTokenLen += nDelimLen ;
if ( nDelimLen < strUnparsed.length() )
strUnparsed = strUnparsed.substr( nDelimLen ) ;
else
break ;
}
else
break ;
.........
Thanks for a reaaly useful class though.
Spencer
|
|
|
|
|
Hi there ,
Your example is great , I'm wondering if you have any implmentation for unicode strings . If no can you point me how this can be done ?
WBR Oleg
|
|
|
|
|
Oleg,
I have not needed to adapt this to Unicode. This could be adaptable to Unicode strings by substituting wchar for char in the code. However, there is an assumption that only 128 characters are in the alphabet, as we only allocate 128 arrays and mask out indexes above 0x7F. With some small effort, I think you could do it.
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|
I am designing a Postscript editing program, which will search through the Postscript text file looking for a particular string. The problem I am having is the amount of time it is taking to search each line for the string. Is there any way of speeding this up, and not looking at each line, maybe somehow "jump" to the correct place in the file?
Any advice would be appreciated.
Bryster
|
|
|
|
|
As an example string to search through....
"Joshua Lewis soll nicht so machen als ob er das kann, sondern wie alle kinder sollen sie lieb sein"
(The german grammer is horrible I know , my english also )
And as the substrings/words to be found are "so sollen wie er"
The results shown in the dlg app are..
so,so,er,so,er,wie,er,so...
soll and sollen were found, but not as complete words.
Is there any was the the parser can tell the difference between partial complete words?
I´m no genius, and therefore can only guess that once a substring has been found, it is "marked as being found", and the parser continues with the next token in the string to be searched.
Is there a way to "force" the parser to continue so that "full" words are also found?
bum... and I thought I´d got rid of all the bugs
|
|
|
|
|
Found a (semi) solution for this problem
if ( dwEntry > MAXWORD ){
int nTmpIndex = nIdx;
if((!(nTmpIndex++ > m_strSearch.length()) && m_strSearch[nTmpIndex] == ' ')
|| nTmpIndex == m_strSearch.length())
{
int nIndex = HIWORD(dwEntry) - 1 ;
ASSERT(nIndex < size());
rnNext = nIndex ;
m_nCurTextChar = nIdx + 1 ; // set the index for the next time
break ;
}
else{
// did not find a complete word
wCurState = LOWORD(dwEntry);
}
}
Changed this in the function FindNext( int & rnNext )....
This only returns COMPLETE words that have been found, maybe a flag or something to differenciate between partial and full word search and replace...??
bum... and I thought I´d got rid of all the bugs
|
|
|
|
|
I thought of porting the class to Java, as I find it useful. During the process I found that it uses DWORD as a container for (non-defined) structs.
This makes the algorithm less then obvious, not to mention cumbersome to port.
C++ is blessed with all that is required to make the algorithm both readable and efficient, so why not use it?
Apart from that - great class.
|
|
|
|
|
It took me several minutes of head-scratching before I finally made sense of what you meant by "uses DWORD as a container for (non-defined) structs.". At least, I think I now know to what you refer. . . I assume you mean that you don't approve of using the arrays of DWORDS m_apnFSM, rather than having arrays of some struct with two smaller integers. The fact is that after reading the article upon which my class is based, I had no trouble understanding the algorithm, so it never occurred to me to divide the DWORD into two variables. The original did not do this, and I didn't feel compelled to improve on it in that way.
As for porting this class to Java. . . well, I can't say that I see the wisdom. The whole idea was to make a class that was fast. And, we all know that Java is not known for its efficiency.
Nevertheless, if you feel inclined to make these changes and port to Java, I won't protest.
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|
Hi - love the article !! more often however, I need to know WHERE ie by character position a particular string was found !!
Is there a way to return this from the FSM ?? I dont have the base CUJ article, maybe it had notes there - I will try to find a copy (June 2002 may not even be out 'Down-Under' here in Aus yet/hopefully)
In General, Im using Multiple Regular Expressions To (For Example) Split a report
so (re's in brackets)
(^1) Matches a page-break
(^0UserId: ABCDEFG) Indicates The Start Of Data
(**** End Of Report ****) Is The End Of The Data, as long as the start has been found (ie there are multiple reports in the one file)
so I have to apply each RegEx to each line to keep track of where in the report I am (thats a lot of work). The multiple string search you present is a possible improvement, but not without knowing that the '1' for the Page break was in Column 0.
Its probably a waste to use such an efficient search tool on this sort of thing, given thats I have to track through the file line by line - brute force may be easiest option....
Thanks anyway
Cheers, Garth
|
|
|
|
|
Garth,
I must admit that I feel kinda silly for not providing that information! Looking at the FindNext function, the data you want is readily available (in nStartChar). The trick then becomes in deciding the best way to return the position. For the FindFirstIn and FindNext, we could simply change the return type from bool to int, and return the position. We would then return -1, rather than false, to indicate no more tokens are found. As for FindAllIn, however, we build a CStringList with the found tokens. We could have another array as a third parameter which would be filled with corresponding positions. Or perhaps, embed the position in the strings, using a delimiter. Or, we could just not return the postions at all.
I encourage anyone with an opinion on the matter to "speak up".
The real benefit of this class is seen when searching a large text and/or with a large number of seek words. For searching small strings for a small number of words, it may not be worth the overhead. I haven't taken the time to analyze the performance metrics to determine what the threshhold is, but I suspect short lines such as yours may not warrant this. Even so, you may find this class easy enough to use that it's worth the overhead.
Scot Brennecke
Software Developer
VC++ MVP
|
|
|
|
|