Click here to Skip to main content
Click here to Skip to main content

A Multiple Substring Search Class: CIVStringSet

By , 25 Apr 2010
 

Introduction

There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string is long. I have created a class to simplify the task and perform it very efficiently. The class is offered in two versions now: one using MFC CString and CStringArray, and another using STL std::string and std::vector.

Acknowledgement

In the June 2002 issue of C/C++ Users Journal, Moishe Halibard and Moshe Rubin published an article entitled "A Multiple Substring Search Algorithm", in which they greatly detailed their algorithm for an efficient finite state machine (FSM) to assist in searching for multiple substrings in one pass. My class is based on this algorithm.

Update Details

For those of you who have seen the versions of these classes that I published in 2002, here is a summary of the major changes I've made:

  • Support for a string of delimiter characters.
  • Acceleration of the FindNext method to skip over delimiters at the start of a search, and skip to the next delimiter when a word is rejected.
  • Replacement of the FSM implementation to use a std::vector of a new SEntry struct type instead of a dynamic array of DWORDs.
  • More use of the more appropriate size_t type instead of int.
  • The sample project uses VC++ 9.0 (VS2008) instead of the archaic and buggy VC++ 6.0.

Class Interface

The CIVStringSet class is derived from either CStringArray or std::vector, so as to easily inherit the capability of a dynamic array of CStrings or std::strings.

However, many of the base class methods have been hidden by my class to prevent any insertion or removal of strings except at the end of the array. It is critical that the array index for a given word not be changed after it is added to the FSM.

The public interfaces to the classes are as follows:

struct SEntry
{
    SEntry( DWORD dwState = 0UL, DWORD dwIndex = 0UL )
        : m_dwState( dwState ), m_dwIndex( dwIndex ) {}

    DWORD m_dwState ;
    DWORD m_dwIndex ;
} ;

class CIVStringSet : public CStringArray
{
public:
    CIVStringSet( DWORD dwInitialWidth = 64 ) ;  		// Initial width of FSM
    virtual ~CIVStringSet() ;

    bool     Add( LPCTSTR pszWord ) ;                     	// a single word
    bool     Add( const CString & rstrWord ) ;            	// a single word
    size_t   Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ; 	// multiple words, 
					// delimited with chars from pszDelims
    size_t   Add( LPCTSTR pszzWords, size_t nWords ) ;    	// nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t   Add( const std::set<CString< & sstrWords ) ; 	// all the elements of a 
							// set of CStrings
    size_t   Add( const CStringArray & astrWords ) ;      	// all the elements 
							// of a CStringArray
    size_t   Add( const CStringList & lstrWords ) ;       	// all the elements of a 
							// CStringList

    void     SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } // delimiters 
							// to use in word search
    UINT     FindFirstIn( CString strText, size_t & rnFirst ) ; // Begin iteration
    UINT     FindNext( size_t & rnNext ) ;                      // Continue iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT< < CWordPosPairList ; // list of pairs to 
						// be returned by FindAllIn
    size_t   FindAllIn( CString strText, CWordPosPairList & rlstrWords ) ; // Iterate 
						// all at once and make list
} ;

#ifdef _UNICODE
typedef std::wstring _tstring ;
#else
typedef std::string _tstring ;
#endif

class CIVStringSet : public std::vector<_tstring>

{
public:
    CIVStringSet( unsigned long dwInitialWidth = 64 ) ;  // Initial width of FSM
    virtual ~CIVStringSet() ;

    bool        Add( LPCTSTR pszWord ) ;                         // a single word
    bool        Add( const _tstring & rstrWord ) ;               // a single word
    size_t      Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ;     // multiple words, 
					// delimited with chars from pszDelims
    size_t      Add( LPCTSTR pszzWords, size_t nWords ) ;        // nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t      Add( const std::set<_tstring> & sstrWords ) ;    // all the elements of 
							 // a set of strings
    size_t      Add( const std::vector<_tstring> & astrWords ) ; // all the elements 
						// of an array of strings
    size_t      Add( const std::list<_tstring> & lstrWords ) ;   // all the elements 
							 // of a list of strings

    void        SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } //delimiters
							// to use in word search
    UINT        FindFirstIn( const _tstring & strText, size_t & rnFirst ) ;  // Begin 
								// iteration
    UINT        FindNext( size_t & rnNext ) ;                                // Continue 
								// iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT> > CWordPosPairList ; // list of pairs 
						// to be returned by FindAllIn
    size_t      FindAllIn( _tstring strText, CWordPosPairList & rlstrWords ) ; // Iterate
						// all at once and make list
} ;

The constructor takes a single, optional argument that determines the initial width of the FSM. The width will need to be at least as much as the combined lengths of all the strings in the array. The FSM will be enlarged, as needed as strings are added to the collection. A good guess of the total length when constructing may save a trivial reallocation later.

There are seven overloads of the Add method, allowing you to add one word at a time, or an entire set, list, or array of words at once. See the demo project for an example of using one form of Add, as well as FindFirstIn and FindNext.

As strings are added to the collection, the FSM is updated. Once all strings are added, no additional preparation is required to do a search. There are three methods for searching: FindFirstIn, FindNext, and FindAllIn. The first two are used to iterate through the found substrings one at a time. The third method uses the first two to build a CWordPosPairList for you. Simply pass in the text string to be searched as the first parameter to FindFirstIn or FindAllIn, and the class will find all instances of any strings in the array.

FindFirstIn and FindNext both return the offset within the searched text at which a token was found. The size_t & parameter will be filled upon return with the index in the string array of the string token that was found. FindAllIn returns the number of tokens found, and takes a CWordPosPairList & as its second parameter. A CWordPosPairList & is an STL list of pairs. Each pair returned contains the index and offset of a found token. See the demo program for an example of using this.

Demo Program

By request, I have also provided a simple dialog-based demo program that shows the usage of the Add and Find methods. It uses the form of Add that takes a delimited string, parses it to create the list of seek tokens, then fills a list box with those found.

Source Highlights

For those who are curious, but don't want to bother downloading the source, here are some of the most interesting functions. If the code and comments aren't satisfactory documentation, let me know.

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// InsertWord
//          put the new word into the FSM
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bool CIVStringSet::InsertWord( LPCTSTR pszWord, unsigned long dwIndex )
{
    bool bRetVal = false ;

    size_t nLen = _tcslen( pszWord ) ;
    if ( m_dwMaxUsedState < MAXDWORD )    	// if we've got over 4 billion states, 
					// something's wrong
    {
        unsigned long dwCurState = 0UL ;
        for ( UINT nChar = 0U ; nChar < nLen ; nChar++ )
        {
            size_t          nIdxChar = 
		min(static_cast<size_t>(pszWord[nChar]), c_nMaxChar) ;
            SEntry        & rEntry   = m_apnFSM[dwCurState][nIdxChar] ;  // the state, 
						// and possibly also an index
            unsigned long   dwState  = rEntry.m_dwState ;
            bool            bNew     = (dwState == 0UL) ;   	// no previous words 
							// have been here

            if ( bNew )
                dwState = ++m_dwMaxUsedState ;  // use a new state number

            // special case - last character of substring
            if ( nChar == ( nLen-1 ) )
            {
                // Put both the state number and the word index in the entry
                rEntry.m_dwState = dwState ;
                rEntry.m_dwIndex = dwIndex ;
                break ;
            }
            else if ( bNew ) // if current entry for this char value and state 
				// is still zero, add to FSM
                rEntry.m_dwState = dwState ;

            dwCurState = dwState ;  // prepare for next iteration
            if ( m_apnFSM.size() <= dwCurState )
                m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFC0, 
			vector<sentry />( c_nMaxChar ) ) ;
        }

        bRetVal = true ;
    }

    return bRetVal ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindFirstIn
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindFirstIn( const _tstring & strText, size_t & rnFirst )
{
    // The real work is done by FindNext, but we must initialize the starting
    //    character index and string buffer first
    m_nCurTextChar = 0U ;
    m_strSearch = strText ;

    return FindNext( rnFirst ) ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindNext
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindNext( size_t & rnNext )
{
    unsigned long dwCurState = 0UL ; // start in state 0
    UINT nIdx ;
    UINT nStartChar = MAXDWORD ;
    size_t nLen = m_strSearch.length() ;

    // Start with the character following the last one examined
    for ( nIdx = m_nCurTextChar ; nIdx < nLen ; nIdx++ )
    {
        size_t nNextIdx = m_strSearch.find_first_not_of( m_strDelims, nIdx ) ;
        if ( nNextIdx != nIdx )
        {
            // Skip over delimiters
            nStartChar = MAXDWORD ;
            if ( ( nNextIdx != _tstring::npos )
              && ( nNextIdx < nLen )
               )
            {
                dwCurState = 0UL ;
                nIdx = nNextIdx ;
            }
            else
                break ;
        }

        // if we are in state 0, save the index of the starting character,
        //   in case we must backtrack or for the return value
        if ( dwCurState == 0UL )
            nStartChar = nIdx ;

        // follow up the table states
        size_t nIdxChar = min(static_cast<size_t>(m_strSearch[nIdx]), c_nMaxChar) ;
        const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;

        // if we reach an entry with a an index, we may have found a full match
        //    if so, return the word we found
        if ( ( rEntry.m_dwIndex != 0UL )
          && ( ( nIdx == nLen-1 )
            || ( m_strDelims.find_first_of( m_strSearch[nIdx+1] ) != _tstring::npos )
             )
           )
        {
            size_t nIndex = rEntry.m_dwIndex - 1 ;
            ASSERT(nIndex < size());
            rnNext = nIndex ;
            m_nCurTextChar = nIdx + 1 ;  // set the index for the next time
            break ;
        }
        else
            dwCurState = rEntry.m_dwState ;

        if ( dwCurState == 0 )
        {
            nStartChar = MAXDWORD ;  // in case we're about to terminate the loop
            nIdx = m_strSearch.find_first_of( m_strDelims, nIdx ) ; // advance to 
							// the next delimiter
        }
    }

    // return the index of the start of the word, or -1 if no word found
    return (( nIdx < nLen ) ? nStartChar : MAXDWORD) ;
}

Updates

  • 1 July 2002 - Added STL version
  • 24 April 2010 - Revised MFC and STL versions

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Scot Brennecke
Software Developer (Senior) Microsoft
United States United States
Member
Scot is an Escalation Engineer for the Microsoft Developer Support Languages team. He helps software developers who are Microsoft customers find bugs in their own, or Microsoft's, code.
 
Scot spends most of his time writing, reading, or thinking about C++ software, thereby classifying him as a geek.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
QuestionHow to find multi space sub string.memberHarsh Gandhi30 Oct '12 - 21:07 
How to find multi space sub sting from a main string buffer.
 
If the main string buffer is :
 
"The quick red fox jumped over the lazy brown dog. Now is the time for all good men to come
to the aid of their country. Ask not what your country can do for you, but what you can do
for your country."
 
And the search sub string is "fox jumped over the lazy" then we need method which will search this sub string and return a index as "14".
QuestionHow to find Embedded sub strings with in the main string buffer.memberHarsh Gandhi30 Oct '12 - 19:52 
If the main string buffer is :
 
"The quick red fox jumped over the lazy brown dog. Now is the time for all good men to come
to the aid of their country. Ask not what your country can do for you, but what you can do
for your country."
 
Embedded Word to Search for "uic" then I need a method which will search for sub string "uic" in the main buffer and return its index as "5".
 
Actually I need a methods which which will search the sub string as Embedded word or Non-Embedded word depending on its input parameters.
 
Your method provides a way to search for Non-Embedded words but not Embedded words.
QuestionDWORD rollover, newing stack variables in .loopsmembercodecloset6 Mar '12 - 5:13 
First, great, great stuff (glad you came back and revisited it) Yes, we do care! (laughing)...
 
A couple of comments:
 
1. ++ operators cause rollover, so you need a bigger container to test for exceeding the max value of a system type. m_dw64MaxUsedState should be an unsigned __int64. Then in CIVStringSet::InsertWord(), replace "if ( m_dw64MaxUsedState < MAXDWORD )" with "if ( m_ui64MaxUsedState < (unsigned __int64)MAXDWORD )". Replace "dwState = ++m_dwMaxUsedState); // use a new state number" with "dwState = (DWORD)(++m_ui64MaxUsedState); // use a new state number";
 
2. It's much faster if instead of instantiating new instances of things inside loops, you re-use a variable instantiated outside the loop - yes, the compiler optimizer will catch some of these, but you'd be surprised how many it misses (especially object pointers). Consider these changes to CIVStringSet::InsertWord():
 
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// InsertWord
// put the new word into the FSM
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bool CIVStringSet::InsertWord( LPCTSTR pszWord, unsigned long dwIndex )
{
bool bRetVal = false ;
 
size_t nLen = _tcslen( pszWord ) ;
if ( m_ui64MaxUsedState < (unsigned __int64)MAXDWORD ) // if we've got over 4 billion states, something's wrong
{
size_t nIdxChar;
unsigned long dwState;
bool bNew;
 
unsigned long dwCurState = 0UL ;
for ( UINT nChar = 0U ; nChar < nLen ; nChar++ )
{
#ifndef __USE_ORIGINAL_VERSION__
nIdxChar = CharIndex( pszWord[nChar] ) ;
#else // #ifndef __USE_ORIGINAL_VERSION__
nIdxChar = min(static_cast(pszWord[nChar]), c_nMaxChar) ;
SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ; // the state, and possibly also an index
#endif // #else : #ifndef __USE_ORIGINAL_VERSION__
 
dwState = (m_apnFSM[dwCurState][nIdxChar]).m_dwState ;
bNew = (dwState == 0UL) ; // no previous words have been here
 
if ( bNew )
dwState = (DWORD)(++m_ui64MaxUsedState); // use a new state number
 
// special case - last character of substring
if ( nChar == ( nLen-1 ) )
{
// Put both the state number and the word index in the entry
(m_apnFSM[dwCurState][nIdxChar]).m_dwState = dwState ;
(m_apnFSM[dwCurState][nIdxChar]).m_dwIndex = dwIndex ;
break ;
}
else if ( bNew ) // if current entry for this char value and state is still zero, add to FSM
(m_apnFSM[dwCurState][nIdxChar]).m_dwState = dwState ;
 
dwCurState = dwState ; // prepare for next iteration
if ( m_apnFSM.size() <= dwCurState )
 
#ifndef __USE_ORIGINAL_VERSION__
m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFFFFFC0, vector( m_cMaxChar - m_cMinChar + 1 ) ) ;
#else // #ifndef __USE_ORIGINAL_VERSION__
m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFC0, vector( c_nMaxChar ) ) ;
#endif // #else : #ifndef __USE_ORIGINAL_VERSION__
}
 
bRetVal = true ;
}
 
return bRetVal ;
}
 

You'll see more improvement in CIVStringSet::FindNext() if you make the same kinds of changes there than in CIVStringSet::InsertWord() but InsertWord() was smaller to use as an example (smile).
 
3. Right idea on the character set limits, but the implementation could be improved. Rather than assuming 'A' to 'z' as the allowed characters, it's much faster (quicker search stops for unqualified matches) if you use a dynamically-determined char set drawn straight from the search words themselves. Loop over the input search words pulling out all unique characters actually used, sort them, and use members of that list as the character value test for matches. Not only will the search be fatser, but you can now support non-Latin searches and in-word punctuation such as hyphens and possessory single quotes(with some other minor changes I won't describe here). I'm too lazy to provide the code to do this within these comments, but will provide it if you really want it.
 
Thanks,
Scott
QuestionBugmembergribunin225 Aug '11 - 3:37 
Hello,
 
I think you have a bug in the code. You initialize a second parameter for m_apnFSM as:
std::vector<SEntry>( c_nMaxChar )
 
and later you have code:
 
// follow up the table states
size_t nIdxChar = min(static_cast<size_t>(pszSearch[nIdx]), c_nMaxChar) ;
const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;
 
if n_nMaxChar wins in the "min" function it make nIdxChar=c_nMaxChar and index goes out of range. It seems you need:
 
size_t nIdxChar = min(static_cast<size_t>(pszSearch[nIdx]), c_nMaxChar - 1) ;
 
there
AnswerRe: BugmemberScot Brennecke25 Aug '11 - 8:59 
I agree. Thank you for pointing it out. In fact, I found that problem the day after I posted this (April 26th), and made some more design changes to the class. I didn't upload my revisions here because I just wasn't sure how important it would be to anyone. (Since I was updating a really old article, I didn't know if anybody really cared anymore Smile | :) ).
 
One of the changes I made was to add some new methods to the CIVStringSet class header:
    void        SetCharRange( TCHAR cMin, TCHAR cMax ) ;                     // range of allowable characters
    size_t      CharIndex( TCHAR cVal ) { return static_cast<size_t>(max(min(cVal, m_cMaxChar), m_cMinChar) - m_cMinChar) ; }
    bool        IsPrefixFound( const _tstring & strText ) ;                  // Is the string a prefix of any words in the set?
 
I added two new member variables:
    TCHAR         m_cMinChar,         // Range of allowable characters minimum
                  m_cMaxChar ;        //   ... and maximum
 
in the .cpp, I changed implementation as follows:
* I got rid of the c_nMaxChar from file scope.
* Changed/added class initializers to
    m_cMinChar( _T('A') ),
    m_cMaxChar( _T('z') ),
    m_apnFSM( dwInitialWidth, vector<SEntry>( _T('z') - _T('A') + 1 ) ),
 
* Implemented SetCharRange:
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// SetCharRange
// Set the range of allowable characters
// The farther apart the min and max are, the bigger the vectors will be
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
void CIVStringSet::SetCharRange( TCHAR cMin, TCHAR cMax )
{
    m_cMinChar = cMin ;
    m_cMaxChar = cMax ;
    size_t nSize = m_apnFSM.size() ;
    m_apnFSM.clear() ;
    m_apnFSM.resize( nSize, vector<SEntry>( m_cMaxChar - m_cMinChar + 1 ) ) ;
}
 
* In InsertWord, I changed the line that initialized nIdxChar:
            size_t          nIdxChar = CharIndex( pszWord[nChar] ) ;
* and at the bottom of that same loop, I changed the line to this:
                m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFFFFFC0, vector<SEntry>( m_cMaxChar - m_cMinChar + 1 ) ) ;
* Then, at the line where you pointed out my bug, I now have:
        // follow up the table states
        size_t nIdxChar = CharIndex( m_strSearch[nIdx] ) ;
 
* and I implemented IsPrefixFound:
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// IsPrefixFound
// Is the string a prefix of any words in the set?
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bool CIVStringSet::IsPrefixFound( const _tstring & strText )
{
    unsigned long dwCurState = 0UL ; // start in state 0
    size_t nLen = strText.length() ;
 
    for ( UINT nIdx = 0U ; nIdx < nLen ; nIdx++ )
    {
        // follow up the table states
        size_t nIdxChar = CharIndex( strText[nIdx] ) ;
        const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;
        dwCurState = rEntry.m_dwState ;
    }
 
    return (dwCurState > 0UL) ;
}
 
...just in case anybody cares Big Grin | :-D
 
Thanks again for pointing out the bug!
Scot Brennecke
Software Developer
VC++ MVP

GeneralHave "Search" button be default buttonmemberl_d_allan28 Feb '06 - 3:55 
A minor change to consider for your very helpful test program: Have the "Search" button be the default button, so that when an end-user presses the "Return" key, the search proceeds.
 
As it is now, the "Done" button is the default, and the app closes. This is disconcerting ... my first impression was that the app had crashed.
 

GeneralSpecialize to "Find if all in string"memberl_d_allan27 Feb '06 - 8:06 
Thanks for provding this.
 
Can this code (or a derivative) simply detect if all search tokens are in a string, and quit as soon as this has been determined? My impression is that the existing code continues to the end of the string and finds all matches of all tokens.
 
Here is a scenario: a 40 meg text file is in a memory buffer and split into about 300,000 lines. I want to make a list of those lines which contain all of the search words. The memory buffer will be searched repeatedly for different groups of tokens.
 
Suppose line 1000 is:
"There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string is long. I have created a class to simplify the task and perform it very efficiently."
 
If the search tokens were "string" and "search", then this line can be considered a "match" after about 85 characters, and move on to the next line.
 
I wonder if there is a way for the code to stop looking for the token "search" after it has been found (about position 70), and proceed to only look for "string". Can a token be removed part way thru a search if this would speed up the search (maybe it wouldn't speed up the search more than a minor amount, if at all)?
 
Another possible optimization would be to switch to something like Boyer-Moore-Horspool for the rest of the string when there was only one token left (but that is getting pretty complicated).
 
Can a call such as DetectWhetherAllTokensExist be done with the existing code? If so, how? If not, what modifications would be involved?
 
Thanks again.
 

GeneralRe: Specialize to "Find if all in string"memberScot Brennecke27 Feb '06 - 10:32 
You're welcome. As you can see that I wrote these classes almost four years ago, I had to reacquaint myself with the code.
As Phil B pointed out back in 2003, my code doesn't force full-word matches. He offered a semi-solution that insists that a match not be counted unless the following character is a space. Of course, a space is not the only word delimiter, and to check for all those would make the code slower. I'd use something like strspn or SpanIncluding if I needed to do that, though.
Removing strings from the FSM would be a speed increase, because it reduce the number of paths followed when searching. I didn't provide for removal because I never considered a project like yours before. Detecting that there is only one token left would add too much overhead to the algorithm, and negate any performance increase you might have gotten with switching to another algorithm.
To implement DetectWhetherAllTokensExist, you'd obviously need to somehow know that a particular word has already been found. If you implement a Remove method in the class, certainly removing the word from the search would accomplish that. Once all words have been removed, you'd know you're done.
So, how do you remove a string from the set? Remember that it's vital to not change the index of an entry in the set, so removing can't involve changing the position of elements in the array. I haven't pondered this enough to give you a solution now, and I just got a call from my son, who needs a ride. Maybe I'll come up with an answer later. Let me know if you get there first.
 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Specialize to "Find if all in string"memberl_d_allan27 Feb '06 - 12:25 
That was a prompt reply. Thanks.
 
I haven't stared at your algorithm closely enough to really understand what it is doing. Is your approach somewhat similar to the Aho-Corasick algorithm?
 
Just a quick, uninformed guess .... would it be possible to replace the matched member of the FSM with some convention like { 0xFF 0xFF .... 0xFF } which would never match again. Then you could keep a count of matches, and when all were found, you could quit processing that line and go to the next.
 


GeneralRe: Specialize to "Find if all in string"memberScot Brennecke27 Feb '06 - 20:50 
Check the Acknowledgement in the article, and you'll see that the algorithm was created by Halibard and Rubin. It was published in the June 2002 issue of C/C++ Users Journal, which you can get by registering at http://www.cuj.com. No explanation of the algorithm I give here could be as thorough as reading it straight from the horses' mouths.
 
Once you read the article (or parse the code more closely), you'll see that the substrings do not get stored as single entities in the FSM. Indeed, by definition, a Finite State Machine is not like a table so much as it is like a map. Strings that have the same initial letters will share some entries in the FSM as they are mapped into it. This is partly why it is more efficient. Rather than comparing strings one at a time with your text, it compares all possible matches at once.
 
However, because of this design, it somewhat confounds the idea of removing entries from the FSM. Since entries don't necessarily belong to just one string, it's more difficult to determine whether or not it can be removed. I haven't yet had the brainstorm to reveal that solution.
 

 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Specialize to "Find if all in string"memberl_d_allan28 Feb '06 - 3:27 
I registered to look at CUG. I didn't realize you could do that without charge ... Thanks.
 
According to the original article, the algorithm was oriented to detecting prohibited words in e-mail, so it quit once one of the banned words was found.
"The implementation in this article returns the offset within the string of the first matching substring and stops processing."
 
which is quite a bit different from what I am trying to do.
 
The article indicates that 0 and -1 have special meaning. I haven't stared at your code enough to be close to grok'ing it, but perhaps a 0 or -1 could be "moved up" the data structure related to a word that was matched, so that detection would be shortcut and proceed to the next letter with the state machine "cleared".
 
If there was no prefix with any other search word (such as looking for "hello" and "goodbye"), then it would be up near the "top". If there was a common prefix (as in my original question using the example "string" and "search"), then the state associated with 's' + 't' would be "shortcut/truncated" when "string" was found.
 
I was wondering if it would be simpler and still advantageous if the code could keep track of which words had been matched, and quit when they were all "hit". It wouldn't necessarily stop detecting previously matched substrings.
 
Thanks for taking the time to refresh your memory with this code, and considering a revision. It does seem like a worthwhile enhancement with broad applicability.
 
BTW, have you ever seen a C or C++ implementation of the Aho-Corasick algorithm for detecting multiple sub-strings in a string? I'm only aware of that algorithm, and my understanding is that it also uses a FSM approach.
 
As they say, "if this stuff was easy, it wouldn't pay so well. (written by a freeware developer to a software contributor )

GeneralRe: Specialize to "Find if all in string"memberScot Brennecke28 Feb '06 - 3:52 
I was not familiar with that algorithm before, although I know the name Aho to be the 'a' in the famous "awk" Unix utility (Weinberg and Kernighan, also of great Unix fame are the 'w' and 'k'). Upon cursory examination, their algorithm appears very similar to the Halibard-Rubin one upon which my classes are based. Interestingly, they reference Kernighan's spam filter as an inspiration to their work.
 
There is a CP article (http://www.codeproject.com/csharp/ahocorasick.asp) that may interest you if you haven't already seen it.
 
At this time, I'm not interested in enhancing my classes (for free, at least). Especially since there are so many other similar pieces of code out there. It would be a simple matter to keep track of which words have been hit, but a performance drain to detect when all have been hit -- at least given what my imagination has conjured so far.

 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Specialize to "Find if all in string"memberjmeaux3 Jul '08 - 9:59 
l_d_allan, did you ever locate a solution? Because I too am looking for the same thing: a class/procedure that returns true if each token exists in the search string. Ideally, I don't care how many times a token is present, so it seems a short-circuited logic works best: when a token is found once, we stop searching for that one. But, I guess that's a bit different from how FSM works.
GeneralRe: Specialize to "Find if all in string"memberrajas11 Aug '09 - 23:00 
I have not spent a lot of time viewing the code, but it appears that the efficient part of this procedure is the 'looking' for the substring. Creating new FSM from words/substrings of interest appears not to be expensive.
 
So here is my suggestion. Use FindFirstIn to locate the first substring. Now create a new 'problem' with the rest of the original string from the point where substring was found. Create a new FSM with one less substring and start a new search. Repeat the process till all the words/substrings have been found. You do not have to make any changes to the class or the search procedure to do this.
GeneralADD a CStringList to CIVStringSetmembersath15422 Aug '05 - 23:36 
please help me to add a CSTringList to an object of CIVstringSet.I tried to use the overloaded member function ADD(CStringList list);but an error is being shown.IT says that "cannot convert from CStringList to const char *);
moreover the Add() function is taking too much time to add all the CString to the list when i pass a CString as a parameter to the ADD function.
 
WHY cant i use the overloaded member function ADD(CSTringLIST list);plz reply as i am facing a deadline.
GeneralRe: ADD a CStringList to CIVStringSetmemberScot Brennecke23 Aug '05 - 7:04 
Are you using a CStringList object, or perhaps a pointer to a CStringList? The function requires a direct reference to the object, not a pointer. You are using the MFC version of the class, and not the STL version, right?
 
As concerns "taking too much time", can you be more specific? Have you identified which step is taking the most time?
 
Scot Brennecke
Software Developer
VC++ MVP
GeneralLimitation...memberTDK90000018 Oct '04 - 6:29 
Excellent bit of code however I used it to search for 1400 substrings in a short bit of text (a url) and it fell over because the DFA it builds gets too large.
 
I think the large space requirement for many substrings in inevitable but if anyone knows otherwise I'd love to know.

GeneralRe: Limitation...sussq84571229 Jul '05 - 5:50 
you can dramatically increase the number of strings this class can handle by changing
the array m_apnFSM to be some STL dynamically-allocating type, and using that template-class's dynamic allocation mechanisms in favor of the SetColDim method -- this should also be faster
if you're adding a large number of strings, as the original increases by a small static amount each time it re-allocs, and the STL libraries try to amortize the cost of memory allocation.
with that change it seems to scale to around 80,000 strings on my machine.
(i used a deque >, as the [] operator is a little faster for vector, but deques are faster if you keep forcing the collection to allocate more memory).
GeneralBugmemberspencergunn29 Aug '03 - 5:07 
Hi
 
There is a bug in the STL version of the code
IT reads
 
int CIVStringSet::Add( LPCTSTR pszWords, LPCTSTR pszDelims )
{
.....
// Now, move beyond the token just extracted and the delimiters
int nTokenLen = strToken.length() ;
if ( nTokenLen < strUnparsed.length() )
{
int nDelimLen = _tcsspn( &strUnparsed[nTokenLen], pszDelims ) ;
nTokenLen += nDelimLen ;
if ( nTokenLen < strUnparsed.length() )
strUnparsed = strUnparsed.substr( nTokenLen ) ;
else
break ;
}
else
break ;
.......
 
which gets stuck in a loop by this line
 
int nDelimLen = _tcsspn( &strUnparsed[nTokenLen], pszDelims ) ;
 
it works with this instead though.
 
.......
// Now, move beyond the token just extracted and the delimiters
int nTokenLen = strToken.length() ;
if ( nTokenLen < strUnparsed.length() )
{
int nDelimLen = strUnparsed.find_first_not_of( pszDelims , nTokenLen ) ;
//nTokenLen += nDelimLen ;
if ( nDelimLen < strUnparsed.length() )
strUnparsed = strUnparsed.substr( nDelimLen ) ;
else
break ;
}
else
break ;

.........
Thanks for a reaaly useful class though.
Spencer
GeneralUnicode VersionmemberOleg Reabciuc21 Jul '03 - 22:15 
Hi there ,
Your example is great , I'm wondering if you have any implmentation for unicode strings . If no can you point me how this can be done ?
WBR Oleg
GeneralRe: Unicode VersionmemberScot Brennecke22 Jul '03 - 5:17 
Oleg,
I have not needed to adapt this to Unicode. This could be adaptable to Unicode strings by substituting wchar for char in the code. However, there is an assumption that only 128 characters are in the alphabet, as we only allocate 128 arrays and mask out indexes above 0x7F. With some small effort, I think you could do it.
 
Scot Brennecke
Software Developer
VC++ MVP
Generalre: Searching a large text filememberBryster18 Jun '03 - 2:50 
I am designing a Postscript editing program, which will search through the Postscript text file looking for a particular string. The problem I am having is the amount of time it is taking to search each line for the string. Is there any way of speeding this up, and not looking at each line, maybe somehow "jump" to the correct place in the file?
 
Any advice would be appreciated.
 
Bryster
QuestionSmall problem..?memberPhil.B10 Apr '03 - 1:37 
As an example string to search through....
"Joshua Lewis soll nicht so machen als ob er das kann, sondern wie alle kinder sollen sie lieb sein"
 
(The german grammer is horrible I know Smile | :) , my english also Frown | :( )
 
And as the substrings/words to be found are "so sollen wie er"
 
The results shown in the dlg app are..
so,so,er,so,er,wie,er,so...
 
soll and sollen were found, but not as complete words.
 
Is there any was the the parser can tell the difference between partial complete words?
 
I´m no genius, and therefore can only guess that once a substring has been found, it is "marked as being found", and the parser continues with the next token in the string to be searched.
 
Is there a way to "force" the parser to continue so that "full" words are also found?
 
bum... and I thought I´d got rid of all the bugs :(
AnswerRe: Small problem..?memberPhil.B10 Apr '03 - 2:59 
Found a (semi) solution for this problem
 
if ( dwEntry > MAXWORD ){
int nTmpIndex = nIdx;
if((!(nTmpIndex++ > m_strSearch.length()) && m_strSearch[nTmpIndex] == ' ')
|| nTmpIndex == m_strSearch.length())
{
int nIndex = HIWORD(dwEntry) - 1 ;
ASSERT(nIndex < size());
rnNext = nIndex ;
m_nCurTextChar = nIdx + 1 ; // set the index for the next time
break ;
}
else{
// did not find a complete word
wCurState = LOWORD(dwEntry);
}
 
}
 
Changed this in the function FindNext( int & rnNext )....
 
This only returns COMPLETE words that have been found, maybe a flag or something to differenciate between partial and full word search and replace...??
 
bum... and I thought I´d got rid of all the bugs :(
GeneralData structure documentationmemberomri18 Nov '02 - 1:31 
I thought of porting the class to Java, as I find it useful. During the process I found that it uses DWORD as a container for (non-defined) structs.
This makes the algorithm less then obvious, not to mention cumbersome to port.
C++ is blessed with all that is required to make the algorithm both readable and efficient, so why not use it?
Apart from that - great class.Hmmm | :|

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 25 Apr 2010
Article Copyright 2002 by Scot Brennecke
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid