Click here to Skip to main content
Click here to Skip to main content

A Multiple Substring Search Class: CIVStringSet

By , 25 Apr 2010
 

Introduction

There are probably dozens of reasons a programmer might want to search a text string for the presence of any of several substrings simultaneously. The brute force approach taken all too often is to simply loop through a list of tokens, then search the string sequentially, beginning to end, for each one. Needless to say, this is very inefficient, and can even become outrageously long if the list of tokens is long and/or the searched string is long. I have created a class to simplify the task and perform it very efficiently. The class is offered in two versions now: one using MFC CString and CStringArray, and another using STL std::string and std::vector.

Acknowledgement

In the June 2002 issue of C/C++ Users Journal, Moishe Halibard and Moshe Rubin published an article entitled "A Multiple Substring Search Algorithm", in which they greatly detailed their algorithm for an efficient finite state machine (FSM) to assist in searching for multiple substrings in one pass. My class is based on this algorithm.

Update Details

For those of you who have seen the versions of these classes that I published in 2002, here is a summary of the major changes I've made:

  • Support for a string of delimiter characters.
  • Acceleration of the FindNext method to skip over delimiters at the start of a search, and skip to the next delimiter when a word is rejected.
  • Replacement of the FSM implementation to use a std::vector of a new SEntry struct type instead of a dynamic array of DWORDs.
  • More use of the more appropriate size_t type instead of int.
  • The sample project uses VC++ 9.0 (VS2008) instead of the archaic and buggy VC++ 6.0.

Class Interface

The CIVStringSet class is derived from either CStringArray or std::vector, so as to easily inherit the capability of a dynamic array of CStrings or std::strings.

However, many of the base class methods have been hidden by my class to prevent any insertion or removal of strings except at the end of the array. It is critical that the array index for a given word not be changed after it is added to the FSM.

The public interfaces to the classes are as follows:

struct SEntry
{
    SEntry( DWORD dwState = 0UL, DWORD dwIndex = 0UL )
        : m_dwState( dwState ), m_dwIndex( dwIndex ) {}

    DWORD m_dwState ;
    DWORD m_dwIndex ;
} ;

class CIVStringSet : public CStringArray
{
public:
    CIVStringSet( DWORD dwInitialWidth = 64 ) ;  		// Initial width of FSM
    virtual ~CIVStringSet() ;

    bool     Add( LPCTSTR pszWord ) ;                     	// a single word
    bool     Add( const CString & rstrWord ) ;            	// a single word
    size_t   Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ; 	// multiple words, 
					// delimited with chars from pszDelims
    size_t   Add( LPCTSTR pszzWords, size_t nWords ) ;    	// nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t   Add( const std::set<CString< & sstrWords ) ; 	// all the elements of a 
							// set of CStrings
    size_t   Add( const CStringArray & astrWords ) ;      	// all the elements 
							// of a CStringArray
    size_t   Add( const CStringList & lstrWords ) ;       	// all the elements of a 
							// CStringList

    void     SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } // delimiters 
							// to use in word search
    UINT     FindFirstIn( CString strText, size_t & rnFirst ) ; // Begin iteration
    UINT     FindNext( size_t & rnNext ) ;                      // Continue iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT< < CWordPosPairList ; // list of pairs to 
						// be returned by FindAllIn
    size_t   FindAllIn( CString strText, CWordPosPairList & rlstrWords ) ; // Iterate 
						// all at once and make list
} ;

#ifdef _UNICODE
typedef std::wstring _tstring ;
#else
typedef std::string _tstring ;
#endif

class CIVStringSet : public std::vector<_tstring>

{
public:
    CIVStringSet( unsigned long dwInitialWidth = 64 ) ;  // Initial width of FSM
    virtual ~CIVStringSet() ;

    bool        Add( LPCTSTR pszWord ) ;                         // a single word
    bool        Add( const _tstring & rstrWord ) ;               // a single word
    size_t      Add( LPCTSTR pszWords, LPCTSTR pszDelims ) ;     // multiple words, 
					// delimited with chars from pszDelims
    size_t      Add( LPCTSTR pszzWords, size_t nWords ) ;        // nWords words, 
					// each 0 term'd, with extra 0 at end
    size_t      Add( const std::set<_tstring> & sstrWords ) ;    // all the elements of 
							 // a set of strings
    size_t      Add( const std::vector<_tstring> & astrWords ) ; // all the elements 
						// of an array of strings
    size_t      Add( const std::list<_tstring> & lstrWords ) ;   // all the elements 
							 // of a list of strings

    void        SetDelims( LPCTSTR pszDelims ) { m_strDelims = pszDelims ; } //delimiters
							// to use in word search
    UINT        FindFirstIn( const _tstring & strText, size_t & rnFirst ) ;  // Begin 
								// iteration
    UINT        FindNext( size_t & rnNext ) ;                                // Continue 
								// iteration

    typedef std::pair<size_t,UINT>              CWordPosPair ;     // first is index 
				// of word in array, second is position in text
    typedef std::list< std::pair<size_t,UINT> > CWordPosPairList ; // list of pairs 
						// to be returned by FindAllIn
    size_t      FindAllIn( _tstring strText, CWordPosPairList & rlstrWords ) ; // Iterate
						// all at once and make list
} ;

The constructor takes a single, optional argument that determines the initial width of the FSM. The width will need to be at least as much as the combined lengths of all the strings in the array. The FSM will be enlarged, as needed as strings are added to the collection. A good guess of the total length when constructing may save a trivial reallocation later.

There are seven overloads of the Add method, allowing you to add one word at a time, or an entire set, list, or array of words at once. See the demo project for an example of using one form of Add, as well as FindFirstIn and FindNext.

As strings are added to the collection, the FSM is updated. Once all strings are added, no additional preparation is required to do a search. There are three methods for searching: FindFirstIn, FindNext, and FindAllIn. The first two are used to iterate through the found substrings one at a time. The third method uses the first two to build a CWordPosPairList for you. Simply pass in the text string to be searched as the first parameter to FindFirstIn or FindAllIn, and the class will find all instances of any strings in the array.

FindFirstIn and FindNext both return the offset within the searched text at which a token was found. The size_t & parameter will be filled upon return with the index in the string array of the string token that was found. FindAllIn returns the number of tokens found, and takes a CWordPosPairList & as its second parameter. A CWordPosPairList & is an STL list of pairs. Each pair returned contains the index and offset of a found token. See the demo program for an example of using this.

Demo Program

By request, I have also provided a simple dialog-based demo program that shows the usage of the Add and Find methods. It uses the form of Add that takes a delimited string, parses it to create the list of seek tokens, then fills a list box with those found.

Source Highlights

For those who are curious, but don't want to bother downloading the source, here are some of the most interesting functions. If the code and comments aren't satisfactory documentation, let me know.

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// InsertWord
//          put the new word into the FSM
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
bool CIVStringSet::InsertWord( LPCTSTR pszWord, unsigned long dwIndex )
{
    bool bRetVal = false ;

    size_t nLen = _tcslen( pszWord ) ;
    if ( m_dwMaxUsedState < MAXDWORD )    	// if we've got over 4 billion states, 
					// something's wrong
    {
        unsigned long dwCurState = 0UL ;
        for ( UINT nChar = 0U ; nChar < nLen ; nChar++ )
        {
            size_t          nIdxChar = 
		min(static_cast<size_t>(pszWord[nChar]), c_nMaxChar) ;
            SEntry        & rEntry   = m_apnFSM[dwCurState][nIdxChar] ;  // the state, 
						// and possibly also an index
            unsigned long   dwState  = rEntry.m_dwState ;
            bool            bNew     = (dwState == 0UL) ;   	// no previous words 
							// have been here

            if ( bNew )
                dwState = ++m_dwMaxUsedState ;  // use a new state number

            // special case - last character of substring
            if ( nChar == ( nLen-1 ) )
            {
                // Put both the state number and the word index in the entry
                rEntry.m_dwState = dwState ;
                rEntry.m_dwIndex = dwIndex ;
                break ;
            }
            else if ( bNew ) // if current entry for this char value and state 
				// is still zero, add to FSM
                rEntry.m_dwState = dwState ;

            dwCurState = dwState ;  // prepare for next iteration
            if ( m_apnFSM.size() <= dwCurState )
                m_apnFSM.resize( ((dwCurState * 2) + 63) & 0xFFC0, 
			vector<sentry />( c_nMaxChar ) ) ;
        }

        bRetVal = true ;
    }

    return bRetVal ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindFirstIn
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindFirstIn( const _tstring & strText, size_t & rnFirst )
{
    // The real work is done by FindNext, but we must initialize the starting
    //    character index and string buffer first
    m_nCurTextChar = 0U ;
    m_strSearch = strText ;

    return FindNext( rnFirst ) ;
}

//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// FindNext
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
UINT CIVStringSet::FindNext( size_t & rnNext )
{
    unsigned long dwCurState = 0UL ; // start in state 0
    UINT nIdx ;
    UINT nStartChar = MAXDWORD ;
    size_t nLen = m_strSearch.length() ;

    // Start with the character following the last one examined
    for ( nIdx = m_nCurTextChar ; nIdx < nLen ; nIdx++ )
    {
        size_t nNextIdx = m_strSearch.find_first_not_of( m_strDelims, nIdx ) ;
        if ( nNextIdx != nIdx )
        {
            // Skip over delimiters
            nStartChar = MAXDWORD ;
            if ( ( nNextIdx != _tstring::npos )
              && ( nNextIdx < nLen )
               )
            {
                dwCurState = 0UL ;
                nIdx = nNextIdx ;
            }
            else
                break ;
        }

        // if we are in state 0, save the index of the starting character,
        //   in case we must backtrack or for the return value
        if ( dwCurState == 0UL )
            nStartChar = nIdx ;

        // follow up the table states
        size_t nIdxChar = min(static_cast<size_t>(m_strSearch[nIdx]), c_nMaxChar) ;
        const SEntry & rEntry = m_apnFSM[dwCurState][nIdxChar] ;

        // if we reach an entry with a an index, we may have found a full match
        //    if so, return the word we found
        if ( ( rEntry.m_dwIndex != 0UL )
          && ( ( nIdx == nLen-1 )
            || ( m_strDelims.find_first_of( m_strSearch[nIdx+1] ) != _tstring::npos )
             )
           )
        {
            size_t nIndex = rEntry.m_dwIndex - 1 ;
            ASSERT(nIndex < size());
            rnNext = nIndex ;
            m_nCurTextChar = nIdx + 1 ;  // set the index for the next time
            break ;
        }
        else
            dwCurState = rEntry.m_dwState ;

        if ( dwCurState == 0 )
        {
            nStartChar = MAXDWORD ;  // in case we're about to terminate the loop
            nIdx = m_strSearch.find_first_of( m_strDelims, nIdx ) ; // advance to 
							// the next delimiter
        }
    }

    // return the index of the start of the word, or -1 if no word found
    return (( nIdx < nLen ) ? nStartChar : MAXDWORD) ;
}

Updates

  • 1 July 2002 - Added STL version
  • 24 April 2010 - Revised MFC and STL versions

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Scot Brennecke
Software Developer (Senior) Microsoft
United States United States
Member
Scot is an Escalation Engineer for the Microsoft Developer Support Languages team. He helps software developers who are Microsoft customers find bugs in their own, or Microsoft's, code.
 
Scot spends most of his time writing, reading, or thinking about C++ software, thereby classifying him as a geek.

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralRe: Specialize to "Find if all in string"memberl_d_allan28 Feb '06 - 3:27 
I registered to look at CUG. I didn't realize you could do that without charge ... Thanks.
 
According to the original article, the algorithm was oriented to detecting prohibited words in e-mail, so it quit once one of the banned words was found.
"The implementation in this article returns the offset within the string of the first matching substring and stops processing."
 
which is quite a bit different from what I am trying to do.
 
The article indicates that 0 and -1 have special meaning. I haven't stared at your code enough to be close to grok'ing it, but perhaps a 0 or -1 could be "moved up" the data structure related to a word that was matched, so that detection would be shortcut and proceed to the next letter with the state machine "cleared".
 
If there was no prefix with any other search word (such as looking for "hello" and "goodbye"), then it would be up near the "top". If there was a common prefix (as in my original question using the example "string" and "search"), then the state associated with 's' + 't' would be "shortcut/truncated" when "string" was found.
 
I was wondering if it would be simpler and still advantageous if the code could keep track of which words had been matched, and quit when they were all "hit". It wouldn't necessarily stop detecting previously matched substrings.
 
Thanks for taking the time to refresh your memory with this code, and considering a revision. It does seem like a worthwhile enhancement with broad applicability.
 
BTW, have you ever seen a C or C++ implementation of the Aho-Corasick algorithm for detecting multiple sub-strings in a string? I'm only aware of that algorithm, and my understanding is that it also uses a FSM approach.
 
As they say, "if this stuff was easy, it wouldn't pay so well. (written by a freeware developer to a software contributor )

GeneralRe: Specialize to "Find if all in string"memberScot Brennecke28 Feb '06 - 3:52 
I was not familiar with that algorithm before, although I know the name Aho to be the 'a' in the famous "awk" Unix utility (Weinberg and Kernighan, also of great Unix fame are the 'w' and 'k'). Upon cursory examination, their algorithm appears very similar to the Halibard-Rubin one upon which my classes are based. Interestingly, they reference Kernighan's spam filter as an inspiration to their work.
 
There is a CP article (http://www.codeproject.com/csharp/ahocorasick.asp) that may interest you if you haven't already seen it.
 
At this time, I'm not interested in enhancing my classes (for free, at least). Especially since there are so many other similar pieces of code out there. It would be a simple matter to keep track of which words have been hit, but a performance drain to detect when all have been hit -- at least given what my imagination has conjured so far.

 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Specialize to "Find if all in string"memberjmeaux3 Jul '08 - 9:59 
l_d_allan, did you ever locate a solution? Because I too am looking for the same thing: a class/procedure that returns true if each token exists in the search string. Ideally, I don't care how many times a token is present, so it seems a short-circuited logic works best: when a token is found once, we stop searching for that one. But, I guess that's a bit different from how FSM works.
GeneralRe: Specialize to "Find if all in string"memberrajas11 Aug '09 - 23:00 
I have not spent a lot of time viewing the code, but it appears that the efficient part of this procedure is the 'looking' for the substring. Creating new FSM from words/substrings of interest appears not to be expensive.
 
So here is my suggestion. Use FindFirstIn to locate the first substring. Now create a new 'problem' with the rest of the original string from the point where substring was found. Create a new FSM with one less substring and start a new search. Repeat the process till all the words/substrings have been found. You do not have to make any changes to the class or the search procedure to do this.
GeneralADD a CStringList to CIVStringSetmembersath15422 Aug '05 - 23:36 
please help me to add a CSTringList to an object of CIVstringSet.I tried to use the overloaded member function ADD(CStringList list);but an error is being shown.IT says that "cannot convert from CStringList to const char *);
moreover the Add() function is taking too much time to add all the CString to the list when i pass a CString as a parameter to the ADD function.
 
WHY cant i use the overloaded member function ADD(CSTringLIST list);plz reply as i am facing a deadline.
GeneralRe: ADD a CStringList to CIVStringSetmemberScot Brennecke23 Aug '05 - 7:04 
Are you using a CStringList object, or perhaps a pointer to a CStringList? The function requires a direct reference to the object, not a pointer. You are using the MFC version of the class, and not the STL version, right?
 
As concerns "taking too much time", can you be more specific? Have you identified which step is taking the most time?
 
Scot Brennecke
Software Developer
VC++ MVP
GeneralLimitation...memberTDK90000018 Oct '04 - 6:29 
Excellent bit of code however I used it to search for 1400 substrings in a short bit of text (a url) and it fell over because the DFA it builds gets too large.
 
I think the large space requirement for many substrings in inevitable but if anyone knows otherwise I'd love to know.

GeneralRe: Limitation...sussq84571229 Jul '05 - 5:50 
you can dramatically increase the number of strings this class can handle by changing
the array m_apnFSM to be some STL dynamically-allocating type, and using that template-class's dynamic allocation mechanisms in favor of the SetColDim method -- this should also be faster
if you're adding a large number of strings, as the original increases by a small static amount each time it re-allocs, and the STL libraries try to amortize the cost of memory allocation.
with that change it seems to scale to around 80,000 strings on my machine.
(i used a deque >, as the [] operator is a little faster for vector, but deques are faster if you keep forcing the collection to allocate more memory).
GeneralBugmemberspencergunn29 Aug '03 - 5:07 
Hi
 
There is a bug in the STL version of the code
IT reads
 
int CIVStringSet::Add( LPCTSTR pszWords, LPCTSTR pszDelims )
{
.....
// Now, move beyond the token just extracted and the delimiters
int nTokenLen = strToken.length() ;
if ( nTokenLen < strUnparsed.length() )
{
int nDelimLen = _tcsspn( &strUnparsed[nTokenLen], pszDelims ) ;
nTokenLen += nDelimLen ;
if ( nTokenLen < strUnparsed.length() )
strUnparsed = strUnparsed.substr( nTokenLen ) ;
else
break ;
}
else
break ;
.......
 
which gets stuck in a loop by this line
 
int nDelimLen = _tcsspn( &strUnparsed[nTokenLen], pszDelims ) ;
 
it works with this instead though.
 
.......
// Now, move beyond the token just extracted and the delimiters
int nTokenLen = strToken.length() ;
if ( nTokenLen < strUnparsed.length() )
{
int nDelimLen = strUnparsed.find_first_not_of( pszDelims , nTokenLen ) ;
//nTokenLen += nDelimLen ;
if ( nDelimLen < strUnparsed.length() )
strUnparsed = strUnparsed.substr( nDelimLen ) ;
else
break ;
}
else
break ;

.........
Thanks for a reaaly useful class though.
Spencer
GeneralUnicode VersionmemberOleg Reabciuc21 Jul '03 - 22:15 
Hi there ,
Your example is great , I'm wondering if you have any implmentation for unicode strings . If no can you point me how this can be done ?
WBR Oleg
GeneralRe: Unicode VersionmemberScot Brennecke22 Jul '03 - 5:17 
Oleg,
I have not needed to adapt this to Unicode. This could be adaptable to Unicode strings by substituting wchar for char in the code. However, there is an assumption that only 128 characters are in the alphabet, as we only allocate 128 arrays and mask out indexes above 0x7F. With some small effort, I think you could do it.
 
Scot Brennecke
Software Developer
VC++ MVP
Generalre: Searching a large text filememberBryster18 Jun '03 - 2:50 
I am designing a Postscript editing program, which will search through the Postscript text file looking for a particular string. The problem I am having is the amount of time it is taking to search each line for the string. Is there any way of speeding this up, and not looking at each line, maybe somehow "jump" to the correct place in the file?
 
Any advice would be appreciated.
 
Bryster
QuestionSmall problem..?memberPhil.B10 Apr '03 - 1:37 
As an example string to search through....
"Joshua Lewis soll nicht so machen als ob er das kann, sondern wie alle kinder sollen sie lieb sein"
 
(The german grammer is horrible I know Smile | :) , my english also Frown | :( )
 
And as the substrings/words to be found are "so sollen wie er"
 
The results shown in the dlg app are..
so,so,er,so,er,wie,er,so...
 
soll and sollen were found, but not as complete words.
 
Is there any was the the parser can tell the difference between partial complete words?
 
I´m no genius, and therefore can only guess that once a substring has been found, it is "marked as being found", and the parser continues with the next token in the string to be searched.
 
Is there a way to "force" the parser to continue so that "full" words are also found?
 
bum... and I thought I´d got rid of all the bugs :(
AnswerRe: Small problem..?memberPhil.B10 Apr '03 - 2:59 
Found a (semi) solution for this problem
 
if ( dwEntry > MAXWORD ){
int nTmpIndex = nIdx;
if((!(nTmpIndex++ > m_strSearch.length()) && m_strSearch[nTmpIndex] == ' ')
|| nTmpIndex == m_strSearch.length())
{
int nIndex = HIWORD(dwEntry) - 1 ;
ASSERT(nIndex < size());
rnNext = nIndex ;
m_nCurTextChar = nIdx + 1 ; // set the index for the next time
break ;
}
else{
// did not find a complete word
wCurState = LOWORD(dwEntry);
}
 
}
 
Changed this in the function FindNext( int & rnNext )....
 
This only returns COMPLETE words that have been found, maybe a flag or something to differenciate between partial and full word search and replace...??
 
bum... and I thought I´d got rid of all the bugs :(
GeneralData structure documentationmemberomri18 Nov '02 - 1:31 
I thought of porting the class to Java, as I find it useful. During the process I found that it uses DWORD as a container for (non-defined) structs.
This makes the algorithm less then obvious, not to mention cumbersome to port.
C++ is blessed with all that is required to make the algorithm both readable and efficient, so why not use it?
Apart from that - great class.Hmmm | :|
GeneralRe: Data structure documentationmemberScot Brennecke18 Nov '02 - 16:30 
It took me several minutes of head-scratching before I finally made sense of what you meant by "uses DWORD as a container for (non-defined) structs.". At least, I think I now know to what you refer. . . I assume you mean that you don't approve of using the arrays of DWORDS m_apnFSM, rather than having arrays of some struct with two smaller integers. The fact is that after reading the article upon which my class is based, I had no trouble understanding the algorithm, so it never occurred to me to divide the DWORD into two variables. The original did not do this, and I didn't feel compelled to improve on it in that way.
 
As for porting this class to Java. . . well, I can't say that I see the wisdom. The whole idea was to make a class that was fast. And, we all know that Java is not known for its efficiency.
Nevertheless, if you feel inclined to make these changes and port to Java, I won't protest.

 
Scot Brennecke
Software Developer
VC++ MVP
QuestionIdentify WHERE String Was Found ?memberGarth J Lancaster18 Jun '02 - 14:06 
Hi - love the article !! more often however, I need to know WHERE ie by character position a particular string was found !!
 
Is there a way to return this from the FSM ?? I dont have the base CUJ article, maybe it had notes there - I will try to find a copy (June 2002 may not even be out 'Down-Under' here in Aus yet/hopefully)
 
In General, Im using Multiple Regular Expressions To (For Example) Split a report
 
so (re's in brackets)
 
(^1) Matches a page-break
(^0UserId: ABCDEFG) Indicates The Start Of Data
(**** End Of Report ****) Is The End Of The Data, as long as the start has been found (ie there are multiple reports in the one file)
 
so I have to apply each RegEx to each line to keep track of where in the report I am (thats a lot of work). The multiple string search you present is a possible improvement, but not without knowing that the '1' for the Page break was in Column 0.
 
Its probably a waste to use such an efficient search tool on this sort of thing, given thats I have to track through the file line by line - brute force may be easiest option....
 
Thanks anyway
 
Cheers, Garth
AnswerRe: Identify WHERE String Was Found ?memberScot Brennecke18 Jun '02 - 17:49 
Garth,
I must admit that I feel kinda silly for not providing that information! Looking at the FindNext function, the data you want is readily available (in nStartChar). The trick then becomes in deciding the best way to return the position. For the FindFirstIn and FindNext, we could simply change the return type from bool to int, and return the position. We would then return -1, rather than false, to indicate no more tokens are found. As for FindAllIn, however, we build a CStringList with the found tokens. We could have another array as a third parameter which would be filled with corresponding positions. Or perhaps, embed the position in the strings, using a delimiter. Or, we could just not return the postions at all.
I encourage anyone with an opinion on the matter to "speak up".
The real benefit of this class is seen when searching a large text and/or with a large number of seek words. For searching small strings for a small number of words, it may not be worth the overhead. I haven't taken the time to analyze the performance metrics to determine what the threshhold is, but I suspect short lines such as yours may not warrant this. Even so, you may find this class easy enough to use that it's worth the overhead.
 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Identify WHERE String Was Found ?memberGarth J Lancaster18 Jun '02 - 18:40 
Scot - thanks for your reply - dont worry about not providing the info - I thought it might have been an obtuse question, but was prepared to ask anyway (and felt more of a goose for not being able to find it myself)
 
I agree with your synopsis for fixing the returned positions for FindFirstIn and FindNext - and even have a suggestion for FindAllIn .. instead of CStringList, how about an STL vector - 2 thoughts on its contents ..
 
1) the matched string and its position
 
ie (("quick", 4), ("fox", 14)) If I specified "quick" & "fox", but, since these strings are trivial and the real world wont be, this may be too much info to return - a better idea would appear to be :-
 
2) An index into the string array of strings to match (CIVStringSet) and the position the string was found at, ie
 
search for "men", "country", "rabbit", specified in that order, returns :-
 
((0, 20),(1, 50),(1, 80),(1, 90), (2, -1))
 
where 0 is the element in the list "men" and 1 is the element "country" etc
 
(the positions returned I used are not real - I was too lazy to count along the string) .. note how a) '1' for country is returned three times, since there are three occurences, and b) since "rabbit", ie element 2 wasnt found, it's vector element returns -1 ??
 
well, those are my thoughts .. Im not fantastic on STL, just learning, so it might take me a while to implement any of this
 
Thanks for your help anyway, It's still a great article !!
 
cheers & thanks
Garth
GeneralRe: Identify WHERE String Was Found ?memberScot Brennecke18 Jun '02 - 19:45 
An excellent idea!
I like your idea of simply returning the index into the array of the found token. Since these strings are already known (the user added them), there's really not much sense in my plan of creating a new string to return. The index should be sufficient.
If you agree, then, the solution would be to create a simple struct, or even an STL pair, that contains the index into the array, and the position at which found. The FindAllIn function would then return an array of such pairs or structs.

 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Identify WHERE String Was Found ?adminChris Maunder18 Jun '02 - 18:59 
I'd suggest returning a vector of 'match' objects (a simple struct) that would be indexed by the strings you are searching for an contain the index into the string being searched:
// pseudo code

CIVStringSet SearchSet;
SearchSet.Add("FindMe");
 
Matches matches = SearchSet.FindAll(szStringToSearch);
 
for (int i = 0 to matches.Length)
   Print "Found " + match.Text + " at position " + match.pos
 
// alternate method
Print "Found string at " + matches["FindMe"].pos;
 

 
cheers,
Chris Maunder
 
he was a VB programmer, but he got better - Christian Graus
GeneralRe: Identify WHERE String Was Found ?memberScot Brennecke18 Jun '02 - 19:43 
An excellent idea (of course)!
However, I also like Garth's idea of simply returning the index into the array of the found token. Since these strings are already known (the user added them), there's really not much sense in my plan of creating a new string to return. The index should be sufficient.
If you agree, then, the solution would be to create a simple struct, or even an STL pair, that contains the index into the array, and the position at which found. The FindAllIn function would then return an array of such pairs or structs.

 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Identify WHERE String Was Found ?memberGarth J Lancaster18 Jun '02 - 20:07 
Sounds ok Scot .. maybe STL vector of simple elements is overkill (the STL part anyway) - the 'kiss' principle should reign .. A CArray object might be more preferable, since STL isnt used anywhere else, it might add more overhead ...
 
Just went looking for the June 2002 CUJ, couldnt find it in 4 newsagents around town Frown | :-( that just gives me the excuse to go to Borders tonight Smile | :)
 

 
Smile | :) Garth
GeneralRe: Identify WHERE String Was Found ?memberScot Brennecke19 Jun '02 - 19:19 
My experience with STL is that it is lightweight, and more efficient than MFC classes. But it is also harder to read and use for the uninitiated. I'm certain that my CIVStringSet class would be a lot faster if I used std::string and vector instead of CString and CStringArray. But I made the choice for ease of use to balance with the efficiency of the algorithm.
 
Unless you or someone else has another more compelling reason not to, I'm leaning toward returning a std::pair with the index of the string and its position from the FindFirstIn and FindNext methods, and a vector of the same from FindAllIn.
 
I may even make another class that uses STL instead of MFC, as an alternative for those who don't want the MFC baggage.

 
Scot Brennecke
Software Developer
VC++ MVP
GeneralRe: Identify WHERE String Was Found ?memberGarth J Lancaster19 Jun '02 - 20:10 
Scot Brennecke wrote:
you or someone else has another more compelling reason not to
 
me ? Blush | :O I'm a rank beginner - hardly experienced enough to advise someone of your calibre to suck eggs.. I merely thought - perhaps a tad naive, that if one was writing a 'component' for other people, you'd write it all STL or all MFC, but not mix them .. as I said, that may be a tad naive, thats not meant to look or sound like a criticism ...
 
just because the STL version (or STL in general) may be harder to use, I'm all in favour of learning it where the gain outweighs the pain - so if someone presents a 'free' component, for example off CP, and it uses STL, and I dont know that much about STL, I'd do my damndest to learn enough to use that component .. (hopefully the example supplied would show most of it)
 

I think (generalisation) that some people complain too much about stuff from CP, when they should remember thats it's free, save for a bit of work learning new tricks .. me, Im just grateful of all the help and ideas one gets !!!

 
Wink | ;) Garth

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web03 | 2.6.130523.1 | Last Updated 25 Apr 2010
Article Copyright 2002 by Scot Brennecke
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid