 |
|
 |
boost::tokenizer does much more and more stable!
|
|
|
|
 |
|
 |
Hi Daniel, First of all great article. However, I think there is a small bug / problem in the following lines: <code> string_token_iterator(const std::string & str_, const char * separator_ = " ") : separator(separator_), str(&str_), end(0) { find_next(); }</code> You use a const reference for the str_, this will create a temp object in the following situation: <code>std::vector<std::string> s(string_token_iterator("one two three"), string_token_iterator());</code> That temp object will be destroyed after the constructor of string_token_iterator is finished - therefore this line is unsafe: <code>str(&str_)</code> actually pointing to invalid memory. Best regards Morten
|
|
|
|
 |
|
 |
Hello Morten,
You are right in that a temporary std::string object is created. However, the lifetime of that object is the full expression - see section 12.2/3 of the C++ standard.
It is a problem if you have code like this:
string_token_iterator x("one two three");
// can't use x here.
It is pretty subtle and I would actually recommend that people use the Boost tokenizer instead
regards,
Daniel
|
|
|
|
 |
|
 |
Hi I'm just getting back into C++, and my template knowledge is not very good.
I can't figure out why this doesn't work
std::string test;
//>> is just to mark the code in question
>> string_token_iterator myIt("one two three");
test = *myIt;
If I trace the code specifically at '>>' myIt doesn't contain a str with "one two three".
It looks to me as if the copy constructor didn't copy.
The thing about this that causes me extra confusion is that this does work
std::string test;
test = *(string_token_iterator("one two three"));
If I can't get the first method to work I can't see how to use the ++ operator.
Thanks for your help.
Mark Twombley
I never said I knew it all (and not much at all, I might add).
|
|
|
|
 |
|
 |
The iterator just holds a pointer to the std::string it is tokenizing. This is purely for efficiency reasons, to make the iterator lightweight.
The problem with the above usage is that a temporary std::string is created and the iterator obtains a pointer to it. The lifetime of that temporary is just the expression it is in so by the time execution reaches test = *myIt; the temporary has been destroyed and thus the pointer is invalid. This is just a problem when the iterator is created with a const char* argument, the 'solution' is then to create a temporary std::string and initialize the iterator with it.
std::string str("one two three"), test;
string_token_iterator myIt(str);
test = *myIt;
++myIt etc..
I'm sorry about this, it really should be documented better.
|
|
|
|
 |
|
 |
Thanks Daniel, your suggestion works, but only if you pass a std::string. If you did something like
string_token_iterator cmdLine(argv[1]);
That will not work.
After doing about 6 hrs of reading and playing, I have learned many new things.
So just incase another newbie comes along, here are the things I have learned.
1) I first read a wonderful book (okay just the relevant parts) called "The C++ Standard Library: A tutorial and Reference" by Nicolai M. Josuttis. Chapter 7 is all about Iterator templates (which this code is). From that I realized that your template should be used with a Container. Though your example proves you can do otherwise.
2) VC6, which I am using isn't very compliant with STL as Saurwein pointed out under "Problems with vector". Once I installed stlPort (just a little plug) the example you gave works great.
3) I also traced the working code and know why my example above doesn't work. My experimenting has shown that if you don’t pass a std::string to the template it has very limited scope. When the constructor for your itorator is called it copies the pointer to the original string. If that string wasn’t a std::string then a temporary std::string is created and the string passed is copied to the temp std::string. The temp std::string only has scope in the function call. So string_token_iterator cmdLine(argv[1]); doesn’t work because as soon as the program moves to the next line our tmp std::string has fallen out of scope and the std::string destructor is called, thereby making our pointer to the string invalid.
This code is valid (on compliant versions of STL)
std::vector s(string_token_iterator(argv[1],";"),
string_token_iterator());
The reason why is that our tmp std::string doesn’t fall out of scope until after the vector is constructed. I still can’t figure out at this time how to get the template to create a copy of the passed string and store it.
I hope this will help someone else.
Mark Twombley
I never said I knew it all (but I'm willing too learn).
|
|
|
|
 |
|
 |
Yeah, like I said in my comment this is a problem whenever the user passes a const char* to the tokenizer constructor. Both string_token_iterator("abc") and string_token_iterator(argv[1]) end up as that and temporary objects are created.
In retrospect I think it's safe to say that the iterator constructor should take a const std::string * as argument instead of a reference. It would make it a lot safer for people to use. The downside would be that std::vector s(string_token_iterator(argv[1],";"), string_token_iterator()); would no longer work.
|
|
|
|
 |
|
 |
1.)
in void find_next(void)
end = str->find_first_of(separator, start);
... for the last element 'std::string::npos' is returned, next
std::string operator*() const
{
return std::string(*str, start, end - start);
}
the constructor failes (coredump) because 'std::string::npos - start' is something truly invalid.
2.)
you're using
start = str->find_first_not_of(separator, end);
What if your token string contains two ore more tokens next to each other?
e.g.: "|||c|d", "|" - should return 5 tokens but does only two ...
|
|
|
|
 |
|
 |
Thank you for your comments.
>1.)
A.
std::string::npos is an unsigned integeral value. It is initialized to -1 so in effect it ends up as the maximum value for the underlying unsigned type.
B.
The last parameter in the std::string constructor call in operator* is the size of the resulting string. You are correct in that if end is npos it will take on a "weird" value. However, the effective length of the string that is created is the smallest of (str.size()-offset) and the length parameter.
>2.)
This is by design, the first example demonstrates this behaviour.
|
|
|
|
 |
|
 |
>1.)
>B.
>The last parameter in the std::string constructor call in operator* is the size of the
>resulting string. You are correct in that if end is npos it will take on a "weird" value.
>However, the effective length of the string that is created is the smallest
>of (str.size()->offset) and the length parameter.
A few more words on this issue: the behavior of the std::string constructor ( in this case 'basic_string(const_iterator first, const_iterator last, const A& al = A());' ) seams to vary for different STL's. While it works under VC6.0/7.x and gcc, it coredumps under Solaris/CC.
|
|
|
|
 |
|
 |
The tokenizer uses this constructor:
basic_string(const basic_string& str, size_type pos = 0, size_type n = npos, const Allocator& a = Allocator());
|
|
|
|
 |
|
 |
ok great work
but what if I want to output an empty token?
how can I modify the tokenizer?
Thanx TD
|
|
|
|
 |
|
 |
I found a solution
I hope you'd like it
.
.
.
string_token_iterator(const std::string & str_, const char * separator_ = " ") : separator(separator_),str(&str_),start(0),end(0)
{
find_next();
}
.
.
.
.
void find_next(void)
{
//ceck for empty tokens
if (start>0){
start = str->find_first_of(separator, end+1);
if (start!=end+1)
{
start = str->find_first_not_of(separator, end);
}
}
else
{
start = str->find_first_not_of(separator, end);
}
if(start == std::string::npos)
{
start = end = 0;
str = 0;
return;
}
end = str->find_first_of(separator, start);
}
.
.
.
By TD
|
|
|
|
 |
|
 |
Got my 5.
I suppose for VC6 the easy way to use it would be something like:
std::vector<std::string> s;
for(string_token_iterator iter("one two three"); iter != string_token_iterator(); iter++)
{
s.push_back(*iter);
}
Giles
"Je pense, donc je mange." - Rene Descartes 1689 - Just before his mother put his tea on the table.
Shameless Plug - Distributed Database Transactions in .NET using COM+
|
|
|
|
 |
|
 |
In reality, one should avoid hand-written loops where a suitable STL algorithm exists and the reason will become obvious in the statement below: for(string_token_iterator iter("one two three"); iter != string_token_iterator(); iter++) In the comparison part, one must rely on the compiler writer to optimize away the construction of a default string_token_iterator. Otherwise, this default constructor will be called every time the loop reaches the comparison point (and will obviously be followed by the string_token_iterator destructor, since the visibility of the comparator ends at the semi-colon following it). In the increment part, iter++ should become ++iter to avoid the construction/destruction of a temporary iterator object, thus slowing the loop unnecessarily. I would rather rely on #include <algorithm> using std::copy; #include <functional> using std::back_inserter; copy( string_token_iterator("one two three"), string_token_iterator(), back_inserter(s) ); and let the STL do what it can do best. I am possibly splitting hairs here, but anyway... Regardless, the article is splendid! Philip
|
|
|
|
 |
|
 |
Very elegant solution.
I have a problem with your example under VC++ V6.0 SP5 :
std::vector s(string_token_iterator("one two three"),string_token_iterator());
I get the following error message for this line :
error C2664: '__thiscall std::vector,class std::allocator >,class std::allocator,class std::allocator > > >::std::vector,class std::allocator >,class std::allocator,class std::allocator > > >(unsigned int,const class std::basic_string,class std::allocator > &,const class std::allocator,class std::a
llocator > > &)' : cannot convert parameter 1 from 'struct string_token_iterator' to 'unsigned int'
No user-defined-conversion operator available that can perform this conversion, or the operator cannot be called
Error executing cl.exe.
Any idea why this doesn't work ?
|
|
|
|
 |
|
 |
Thank you.
If I recall correctly VC6 lacks member template support which is needed for that construct. A workaround would be to use std::copy together with a back_inserter instead of using the vector constructor. It would lose the elegancy but then again, VC6 isn't very elegant
|
|
|
|
 |
|
 |
This seems to be not a problem with VC6 but rather with the STL implementation that comes with VC6. I'm using stlPort and it works nicely with the vector constructor.
Just in case you didnt notice: VC6's STL sucks.
Finally moved to Brazil
|
|
|
|
 |
|
 |
bc the last thing anyone wants is to find out during the conversion to unicode that one of his classes doesn't support it
|
|
|
|
 |
|
 |
I would leave my name as anonymous too if I couldn't convert this to unicode.
|
|
|
|
 |
|
 |
Nice article, the way your iterator blends into the general STL frameworks is fine.
In case you didn't know, Boost features a string tokenizer with a design very similar to yours. You might want to have a look at it to switch to it or borrow some ideas for your own tokenizer.
Joaquín M López Muñoz
Telefónica, Investigación y Desarrollo
|
|
|
|
 |