Click here to Skip to main content
Email Password   helpLost your password?

Description

Below is a function I created and have found extremely useful for splitting strings based on a particular delimiter. The implementation only requires STL which makes it easy to port to any OS that supports STL. The function is fairly lightweight although I haven't done extensive performance testing.

The delimiter can be n number of characters represented as a string. The parts of the string in between the delimiter are then put into a string vector. The class StringUtils contains one static function SplitString. The int returned is the number of delimiters found within the input string.

I used this utility mainly for parsing strings that were being passed across platform boundaries. Whether you are using raw sockets or middleware such as TIBCO� it is uncomplicated to pass string data. I found it more efficient to pass delimited string data verses repeated calls or messages. Another place I used this was in passing BSTRs back and forth between a Visual Basic client and an ATL COM DLL. It proved to be easier than passing a SAFEARRAY as an [in] or [out] parameter. This was also beneficial when I did not want the added overhead of MFC and hence could not use CString.

Implementation

The SplitString function uses the STL string functions find and substr to iterate through the input string. The hardest part was figuring out how to get the substring of the input string based on the offsets of the delimiter, not forgetting to take into account the length of the delimiter. Another hurdle was making sure not to call substr with an offset greater than the length of the input string.

Header

#ifndef __STRINGUTILS_H_
#define __STRINGUTILS_H_

#include <string>

#include <vector>


using namespace std;

class StringUtils
{

public:

    static int SplitString(const string& input, 
        const string& delimiter, vector<string>& results, 
        bool includeEmpties = true);

};

#endif

Source

int StringUtils::SplitString(const string& input, 
       const string& delimiter, vector<string>& results, 
       bool includeEmpties)
{
    int iPos = 0;
    int newPos = -1;
    int sizeS2 = (int)delimiter.size();
    int isize = (int)input.size();

    if( 
        ( isize == 0 )
        ||
        ( sizeS2 == 0 )
    )
    {
        return 0;
    }

    vector<int> positions;

    newPos = input.find (delimiter, 0);

    if( newPos < 0 )
    { 
        return 0; 
    }

    int numFound = 0;

    while( newPos >= iPos )
    {
        numFound++;
        positions.push_back(newPos);
        iPos = newPos;
        newPos = input.find (delimiter, iPos+sizeS2);
    }

    if( numFound == 0 )
    {
        return 0;
    }

    for( int i=0; i <= (int)positions.size(); ++i )
    {
        string s("");
        if( i == 0 ) 
        { 
            s = input.substr( i, positions[i] ); 
        }
        int offset = positions[i-1] + sizeS2;
        if( offset < isize )
        {
            if( i == positions.size() )
            {
                s = input.substr(offset);
            }
            else if( i > 0 )
            {
                s = input.substr( positions[i-1] + sizeS2, 
                      positions[i] - positions[i-1] - sizeS2 );
            }
        }
        if( includeEmpties || ( s.size() > 0 ) )
        {
            results.push_back(s);
        }
    }
    return numFound;
}

Output using demo project

main.exe "|mary|had|a||little|lamb||" "|"

int SplitString(
        const string& input,
        const string& delimiter,
        vector<string>& results,
        bool includeEmpties = true
)

-------------------------------------------------------
input           = |mary|had|a||little|lamb||
delimiter       = |
return value    = 8 // Number of delimiters found

results.size()  = 9
results[0]      = ''
results[1]      = 'mary'
results[2]      = 'had'
results[3]      = 'a'
results[4]      = ''
results[5]      = 'little'
results[6]      = 'lamb'
results[7]      = ''
results[8]      = ''

int SplitString(
        const string& input,
        const string& delimiter,
        vector<string>& results,
        bool includeEmpties = false
)

-------------------------------------------------------
input           = |mary|had|a||little|lamb||
delimiter       = |
return value    = 8 // Number of delimiters found

results.size()  = 5
results[0]      = 'mary'
results[1]      = 'had'
results[2]      = 'a'
results[3]      = 'little'
results[4]      = 'lamb'

MFC version

For those of you who absolutely cannot use STL and are committed to MFC I made a few minor changes to the above implementation. It uses CString instead of std::string and a CStringArray instead of a std::vector:

//------------------------

// SplitString in MFC

//------------------------

int StringUtils::SplitString(const CString& input, 
  const CString& delimiter, CStringArray& results)
{
  int iPos = 0;
  int newPos = -1;
  int sizeS2 = delimiter.GetLength();
  int isize = input.GetLength();

  CArray<INT, int> positions;

  newPos = input.Find (delimiter, 0);

  if( newPos < 0 ) { return 0; }

  int numFound = 0;

  while( newPos > iPos )
  {
    numFound++;
    positions.Add(newPos);
    iPos = newPos;
    newPos = input.Find (delimiter, iPos+sizeS2+1);
  }

  for( int i=0; i <= positions.GetSize(); i++ )
  {
    CString s;
    if( i == 0 )
      s = input.Mid( i, positions[i] );
    else
    {
      int offset = positions[i-1] + sizeS2;
      if( offset < isize )
      {
        if( i == positions.GetSize() )
          s = input.Mid(offset);
        else if( i > 0 )
          s = input.Mid( positions[i-1] + sizeS2, 
                 positions[i] - positions[i-1] - sizeS2 );
      }
    }
    if( s.GetLength() > 0 )
      results.Add(s);
  }
  return numFound;
}

String neutral version

I added this version in case you might need to use it with any type of string. The only requirement is the string class must have a constructor that takes a char*. The code only depends on the STL vector. I also added the option to not include empty strings in the results, which will occur if delimiters are adjacent:

//-----------------------------------------------------------

// StrT:    Type of string to be constructed

//          Must have char* ctor.

// str:     String to be parsed.

// delim:   Pointer to delimiter.

// results: Vector of StrT for strings between delimiter.

// empties: Include empty strings in the results. 

//-----------------------------------------------------------

template< typename StrT >
int split(const char* str, const char* delim, 
     vector<StrT>& results, bool empties = true)
{
  char* pstr = const_cast<char*>(str);
  char* r = NULL;
  r = strstr(pstr, delim);
  int dlen = strlen(delim);
  while( r != NULL )
  {
    char* cp = new char[(r-pstr)+1];
    memcpy(cp, pstr, (r-pstr));
    cp[(r-pstr)] = '\0';
    if( strlen(cp) > 0 || empties )
    {
      StrT s(cp);
      results.push_back(s);
    }
    delete[] cp;
    pstr = r + dlen;
    r = strstr(pstr, delim);
  }
  if( strlen(pstr) > 0 || empties )
  {
    results.push_back(StrT(pstr));
  }
  return results.size();
}

String neutral usage

// using CString

//------------------------------------------

int i = 0;
vector<CString> results;
split("a-b-c--d-e-", "-", results);
for( i=0; i < results.size(); ++i )
{
  cout << results[i].GetBuffer(0) << endl;
  results[i].ReleaseBuffer();
}

// using std::string

//------------------------------------------

vector<string> stdResults;
split("a-b-c--d-e-", "-", stdResults);
for( i=0; i < stdResults.size(); ++i )
{
  cout << stdResults[i].c_str() << endl;
}

// using std::string without empties

//------------------------------------------

stdResults.clear();
split("a-b-c--d-e-", "-", stdResults, false);
for( i=0; i < stdResults.size(); ++i )
{
  cout << stdResults[i].c_str() << endl;
}

Conclusion

Hope you find this as useful as I did. Feel free to let me know of any bugs or enhancements. Enjoy ;)

You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
GeneralBug: When no delimiter is found nothing is returned
Florian Rittmeier
9:47 10 May '07  
Hello,

When a string like "Jane" is given and the delimeter " " cannot be found the results vector is empty. I would expect that the results vector will contain "Jane" in this case.

Greets Florian
GeneralRe: Bug: When no delimiter is found nothing is returned
ekey
23:13 14 May '07  
Yeah, maybe the bug is here:
if( i == 0 )
{
s = input.substr( i, positions[i] );
}
int offset = positions[i-1] + sizeS2;
If here only one emlement in positions, hence, positions[i-1] will throw exception.

Best wish,
Yun
Generaloutput iterator
Joergen Sigvardsson
6:31 28 Oct '06  
May I suggest that you replace the results vector with an output iterator instead? That way, you don't have to write a new version of the function if you want the results in a list, set, output stream, or whatever.

--
Not a substitute for human interaction

GeneralHandling i-0 properly?
miker2069
22:20 16 Oct '06  
Hi, I don't think your STL code is hangling i=0 properly. Look what happens at the following statement.

int offset = positions[i-1] + sizeS2;

Obviously you'll get a vector exception thrown. I assume you wanted an 'else' block right after your if(i==0) block. I put it in and it seems to work. Just posting so that someone else that simply copies and pastes won't go crazy wondering why it doesn't work Smile



Mike

GeneralRe: Handling i-0 properly?
domini_harling
12:11 26 Apr '07  

That's exactly what I did. I copied and pasted and it failed immediately. I suppose it's a good starting point for a string splitter, but the bug needs to be fixed. Smile

Dom
GeneralRe: Handling i-0 properly?
mikecline
9:09 27 Mar '09  
I ditto this again.

What is up with putting code online that does not run?
GeneralTrully neutral version
elbertlev
10:39 8 Jun '06  
In essence the neutral version allows 2 types of strings MFC and stl (I know that templates allow more, but who uses other strings?). But the container used is vector<>. I belive that for MFC CStringArray is a better choice.

Lev Elbert
GeneralBoost alternatives
MattyT
15:24 8 Feb '06  
Nice article. Smile

For completeness it's worthwhile noting that there are solutions to this problem in the Boost library.

The string_algo library has a section on splitting strings using a couple of templated methods. In particular, split() will do a similar job to your function.

Tokenizer may also be useful, allowing you to iterate over a split string based on a token.

I've used these functions in production code and can attest that they work very well... Smile
GeneralNo reason for position array
Martin Richter
0:46 7 Feb '06  
Why are you collecting the positions first? There is no reason to do that. Its just wasting time. When you have a position and you have the next position you can store the result.

GeneralUpdated version
Paul J. Weiss
14:49 2 Feb '06  
I updated the code to handle cases such as ";mary;had;;a;little;lamb;;;" where there could be empty strings in between the delimiters. I also added another argument to the function which is a boolean to include the empty strings as part of the resulting vector. If includeEmpties is false then only strings of size greater than zero will be included in the results. I updated the source and the demo project.

Enjoy
Cool

Paul J. Weiss
GeneralHmmm, may be I'm wrong
Andreas Tirok
11:54 24 Jan '06  
Hi, I tried to use this ...

with the string "foo;boo" I got always numcount eq 1 from SplitString ...
I modified SplitString and with this source it works quite well ...

call:

std::vector Colums;
int nCols = SplitString(LinkList, ";", Colums);

Please ignore std:: Frown

int SplitString(const std::string& input, const std::string& delimiter, std::vector& results)
{
int iPos = 0;
int newPos = -1;
int sizeS2 = delimiter.size();
int isize = input.size();

std::vector positions;

newPos = input.find (delimiter, 0);

if( newPos < 0 )
{
return 0;
}

int numFound = 0;

while( newPos > iPos )
{
//numFound++;
positions.push_back(newPos);
iPos = newPos;
newPos = input.find (delimiter, iPos + sizeS2 + 1);
}

for( int i=0; i <= positions.size(); i++ )
{
std::string s;
if( i == 0 )
{
s = input.substr( i, positions[i] );
}
int offset = positions[i-1] + sizeS2;
if( offset < isize )
{
if( i == positions.size() )
{
s = input.substr(offset);
}
else if( i > 0 )
{
s = input.substr( positions[i-1] + sizeS2, positions[i] - positions[i-1] - sizeS2 );
}
}
if( s.size() > 0 )
{
results.push_back(s);
numFound++;
}
}
return numFound;
}

Regards

Andy
GeneralSmall modifications for patterns like ;;
khrl
4:45 20 Dec '05  
In CSV files often happens that you get parse strings like
xx;yy;;zz
Where ;; means that this entry is empty.
The class cannot detect this kind of pattern
it returns
xx
yy
;zz

to modificate this behaviour the following changes has to be supplied:

while( newPos > iPos )
{
numFound++;
positions.push_back(newPos);
iPos = newPos;
// newPos = input.find (delimiter, iPos+sizeS2+1);
newPos = input.find(delimiter,iPos + 1);
}

for( int i=0; i <= positions.size(); i++ )
{
string s;
if( i == 0 ) { s = input.substr( i, positions[i] ); }
int offset = positions[i-1] + sizeS2;
if( offset < isize )
{
if( i == positions.size() )
{
s = input.substr(offset);
}
else if( i > 0 )
{
s = input.substr( positions[i-1] + sizeS2,
positions[i] - positions[i-1] - sizeS2 );
}
}
//if( s.size() > 0 )
//{
results.push_back(s);
//}
}

regards
karl-heinz
Generalthe final MFC version
dis1411
14:31 24 Jul '05  
my solution handles the issues that came up (delimeter at front or back, delimeter repeated in the input string) Beer26's version almost got it, except that he was copying the input string over and over (albeit slighty shorter each time). mine doens't copy the [entire] string at all. it can split a list of words 2MB in size by "\r\n" in 0.3 seconds.. which is a lot faster than anything else on this page Smile

void CyourMFCClassDlg::split(const CString& str, const CString& delimiter, CStringArray& CStrArray)
{
long start = 0,
delim = str.Find(delimiter),
delimLen = delimiter.GetLength(),
elemCnt = 1; // the ACTUAL number of items, there'll be at least 1
// counting the elements, setting the size and then filling the array
// is much faster than doing .Add for each new element
while (delim > -1)
{
elemCnt++;
start = delim + delimLen;
delim = str.Find(delimiter, start);
}

// manually going through and finding each delimiter again is faster than
// keeping track of the positions from the last loop.. because doing .Add over and over
// to the position array would be such a bottleneck
start = 0;
delim = str.Find(delimiter);
CStrArray.SetSize(elemCnt); // now we don't have to use .Add, saving tons of cpu cycles
elemCnt = -1;

while (delim > -1)
{
elemCnt++;
CStrArray[elemCnt] = str.Mid(start, delim-start);
start = delim + delimLen;
delim = str.Find(delimiter, start);
}

if (start < str.GetLength())
CStrArray[elemCnt+1] = str.Mid(start);
else CStrArray[elemCnt+1] = "";
}

GeneralThnx!
muff99
17:20 9 Jul '05  
Exactly what I was looking for! I needed to include Afxtempl.h in order to get the MFC version up 'n 'running
GeneralYet another version
Alexis Smirnov
13:20 18 Mar '05  
This version templates the output container and assumes elements are to be added to the end.

template
void split(const string& str, _Cont& _container, const string& delim=",")
{
string::size_type lpos = 0;
string::size_type pos = str.find_first_of(delim, lpos);
while(lpos != string::npos)
{
_container.insert(_container.end(), str.substr(lpos,pos - lpos));

lpos = ( pos == string::npos ) ? string::npos : pos + 1;
pos = str.find_first_of(delim, lpos);
}
}

Alexis

http://weblog.smirnov.ca
GeneralRe: Yet another version
Alexis Smirnov
13:26 18 Mar '05  
comment poster ate angle brackets in the earlier version. Use this one instead:

template<typename _Cont>
void split(const string& str, _Cont& _container, const string& delim=",")
{
      string::size_type lpos = 0;
      string::size_type pos = str.find_first_of(delim, lpos);
      while(lpos != string::npos)
      {
          _container.insert(_container.end(), str.substr(lpos,pos - lpos));

          lpos = ( pos == string::npos ) ?   string::npos : pos + 1;
            pos = str.find_first_of(delim, lpos);
      }
}
GeneralCaveat programmer
David 'dex' Schwartz
1:59 8 Dec '09  
This algorithm always returns any blank fields (a sensible default to be sure, but not always what one wants) and yes the caller could then choose to throw those empty values away.
It also assumes delim is a set of possible delimiters but the original poster uses std::string::find not find_first_of so the meaning is quite different.
Returning the size is helpful for repeated use when appending to the same container.
This allows the caller to write cleaner code when field counts are of interest.

Please eschew the use of underscores at the start of names, that's really for standard library implementers and compiler vendors etc., not we mere mortal dev types.

Keep it simple
dex

Generalbug
_vin_
11:56 10 Aug '03  
When the delimeter is at the first place your function crashes.
GeneralRe: bug
_vin_
12:11 10 Aug '03  
or at last the place.

For example try to split " teststring" by " ", or split "teststring " by " ".
GeneralRe: bug
hiso7
23:18 20 Sep '07  
Agreed.

i've figured out the pb came from this line:
(1) int offset = positions[i-1] + sizeS2;

But changing it to:
(2) int offset = positions[i] + sizeS2;

only shifts the problem to the last word which is skipped.

My entry is:
onst string& input="mary|had|a||little|lamb|dsqdsqd|";

If i use (1), the program crashes with the error message "vector subscript out of range".
But if I use (2),I get the same message when reaching the last word "dsqdsqd".

Can someone explains or even better correct the portion of code to make it work?
GeneralRe: bug
hiso7
23:40 20 Sep '07  
Never mind. I have solved it.

The change is as follows:

int offset = positions[i] + sizeS2;
if( offset <= isize )
{
if( i == positions.size() )
{
s = input.substr(offset);
}
GeneralRe: bug
SkyDiver
22:44 10 Sep '08  
No it's not - it still causes problems.

The solution is as follows:

if( i == 0 ) 
{
s = input.substr( i, positions[i] );
}
else {
int offset = positions[i-1] + sizeS2;
if( offset < isize )
{
if( i == positions.size() )
{
s = input.substr(offset);
}
else if( i > 0 )
{
s = input.substr( positions[i-1] + sizeS2,
positions[i] - positions[i-1] - sizeS2 );
}
}
}

GeneralThis is my version!
Anonymous
2:41 4 Mar '03  
template void split(const string& str, _Outit _Where, const string& delim=",")
{
string::size_type lpos = 0;
string::size_type pos = str.find_first_of(delim, lpos);
do
{
*_Where = str.substr(lpos,pos - lpos);
// front_inserter, back_inserter and inserter will do
// nothing with operator++
++_Where;
lpos = ( pos == string::npos ) ? string::npos : pos + 1;
pos = str.find_first_of(delim, lpos);
}
while(lpos != string::npos);
}

GeneralRe: This is my version!
Anonymous
2:42 4 Mar '03  
template<typename _Outit>
void split(const string& str, _Outit _Where, const string& delim=",")
{
     string::size_type lpos = 0;
     string::size_type pos = str.find_first_of(delim, lpos);
     do
     {
          *_Where = str.substr(lpos,pos - lpos);
          // front_inserter, back_inserter and inserter will do
          // nothing with operator++
          ++_Where;
          lpos = ( pos == string::npos ) ?   string::npos : pos + 1;
          pos = str.find_first_of(delim, lpos);
     }
     while(lpos != string::npos);
}

GeneralRe: This is my version!
Anonymous
2:47 4 Mar '03  
This will support both 'back_inserter', 'front_inserter' and 'inserter'


Last Updated 1 Feb 2006 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010