Harlinn Windows on CodePlex[^]
Introduction
This is the fifth article in this series about Windows C++ development.
The previous articles can be found here:
The String class is located in "\HarlinnWindows\hwinstring.h"
Why on earth another C++ string class, isn't CString, std::string & std::wstring good enough? They're certainly well designed classes, but it turns out there is a good case for another String class.
There is also something pleasurable about implementing a decent string class.
Motivation
String is a reference counted, copy-on-write string class that is binary compatible with a zero terminated wchar_t*. The good thing about that is that if you have something like this:
struct Foo1
{
int x;
int y;
wchar_t* pszText;
};
then
struct Foo2
{
int x;
int y;
String Text;
};
will have the the same binary layout as
Foo1, allowing you to pass
Foo2 to a function that expects a
Foo1.
void print(const Foo1* p);
void doprint(const String& s)
{
Foo2 foo2;
foo2.x = 5;
foo2.y = 5;
foo2.Text = s + L" Printed";
print( reinterpret_cast<Foo1*>(&foo2));
}
This means that the size of a String variable s is eight bytes, or four if you're compiling for a 32-bit architecture.
String is basically just a smart pointer to the data field of the following structure:
struct Buffer
{
size_type referenceCount;
size_type length;
wchar_t data[128];
};
So the following operation is required to convert the pointer to the zero terminated string into a pointer to a Buffer:
Buffer* toBuffer() const
{
if(data)
{
return (Buffer*)(((char*)data) - offsetof(Buffer,data));
}
return nullptr;
}
Since String objects are essentially smart pointers, replacing code like:
wchar_t* pointers[2];
pointers[0] = wcsdup(L"Some string");
pointers[1] = wcsdup(L"Some other string");
foo(pointers);
free(pointers[0]);
free(pointers[1]);
with:
std::vector<String> v;
v.push_back(String(L"Some string"));
v.push_back(String(L"Some other string"));
foo(reinterpret_cast<const wchar_t**> v.data());
can potentially make your life as a C++ developer a lot less exiting.
This is something you cannot do using CString, std::string & std::wstring, and having said that, I think it's also fair to mention that the String class does not perform small string optimization, and neither can you specify an allocator. It's primary purpose is to serve as a replacement for raw zero terminated wchar_t strings when working with the Windows API, so what I want is someting that I can efficiently pass around, modify, and return.
The three most important features for a string class, in relation to a framework for working with the Windows API, are:
- Access to the contents of the string.
- Access to the length of the string.
- Passing as an argument and returning as a result
Typical usage when interacting with the Windows API:
HWIN_EXPORT String Path::GetLongPathName(const String& path)
{
if(path)
{
wchar_t buffer[MAX_PATH+1] = {0,};
auto length = ::GetLongPathNameW(path.c_str(),
buffer,sizeof(buffer)/sizeof(wchar_t));
if(length == 0)
{
ThrowLastOSError();
}
if(length > (sizeof(buffer)/sizeof(wchar_t)))
{
String result;
result.SetLength(length-1);
length = ::GetLongPathNameW(path.c_str(),result.c_str(),length);
if(length == 0)
{
ThrowLastOSError();
}
return result;
}
else
{
String result(buffer,length);
return result;
}
}
return String();
}
During the second call to ::GetLongPathNameW, data will be copied directly to the buffer allocated by the call to SetLength, which is both convenient and efficient.
Performance
The String class performs very well in most situations, usually doing well enough compared to std::wstring and CString.
The tests operates on std::vector<T> containing 100 000 objects.
The source code for the tests is located in the "Examples\Windows\Strings\StringsExample" directory.
The test results are in milliseconds:
|
| String |
std::wstring |
CString |
.Net string |
| Default constructor | 0.2332 | 1.1048 | 0.9196 | 2.7119 |
| Initialize from small strings | 13.7469 | 13.6107 | 15.6871 | 15.6395 |
| Get length | 2.9202 | 7.6847 | 2.3535 | 1.4666 |
| Get wchar_t* | 1425.2 | 1815.01 | 1485.07 | N/A |
| Assignment | 2.5579 | 11.5627 | 2.7839 | 2.0621 |
| Initialize vector using push_back | 14.0322 | 15.0869 | 17.3816 | 10.8025 |
| Append string | 11.8091 | 15.6082 | 19.7568 | 2.4463 |
| Append char | 532.829 | 313.211 | 270.669 | 5102.9609 |
| Sort | 129.842 | 133.492 | 138.224 | 12689.2583 |
| Simple Find one of | 67.0268 | 71.8129 | 64.6829 | 44.737 |
| Find one of | 128.498 | 255.84 | N/A | 294.803 |
| Reverse find one of | 144.412 | 211.391 | N/A | 445.1848 |
| Find string | 52.0553 | 67.0184 | 28.6715 | 2543.4774 |
| Reverse find string | 132.073 | 162.209 | N/A | 3294.2206 |
| Insert string | 30.8257 | 34.8904 | 37.9284 | 177.7181 |
| Remove characters | 27.0146 | 24.288 | 25.2154 | 179.6267 |
| Recursion | 75.5416 | 298.694 | 257.099 | 190.6845 |
While CString outperforms String and std::wstring in the 'Find string' test, it does so by treating itself as a zero terminated string, ignoring it's own length - so when it contains '\x00' characters, it may not find occurences of the search string.
CString::FindOneOf does not allow us to specify an offset within the string to start the search from, limiting the usefulness of the function.
The library also contains a StringBuilder class, and running the 'Append char' test using this class takes 153.791 ms - outperforming all of the string classes.
Since the .Net string is an immutable type, it shouldn't come as a surprise that it does pretty bad in the 'Append char' test, it's not designed for this kind of use.
What came as a surprise is how bad it did in the search and sort tests, while I didn't expect it to match the C++ classes, I expected way better performance than this.
Observations
The String class does well enough, compared to the std::wstring and CString - with one notable exception, appending a character takes about twice as long for the String class compared to CString. Both std::wstring and CString stores their allocated capacity, while String calculates the capacity based on it's length, saving eight bytes of memory.
The "Append char" test appends a character to a string 38 250 000 times for each string type, so for now I think saving those 8 bytes is worth the performance hit, especially since the String class seems to outperform the other two classes once we start to add more than one character at the time.
The Recursion test
I mentioned that I wanted someting that I can efficiently pass around, modify, and return, and the Recursion test attempts to show whether I have succeded or not.
The recursive function takes a reference to a String arg as one it's argumenta. It combines the argument with the string L"Hi", and calls itself with the combined string as the String argument until recursionLevel reaches 10000, and then returns the final combined String.
const size_t maxRecursion = 10000;
String StringRecursion(const String& arg,size_t recursionLevel)
{
String result = arg + L"Hi";
if(recursionLevel < maxRecursion)
{
recursionLevel++;
result = StringRecursion(result,recursionLevel);
}
return result;
}
As the results show, the String class seems to be significantly better at this than CString and std::wstring.
An unexpected benefit related to /Qpar(Auto-Parallelizer) compiler option
The /Qpar compiler switch enables automatic parallelization of loops in our code.
The code for one of the String tests:
void StringVectorGetTotalLength(const std::vector<String>& v)
{
Stopwatch stopwatch;
size_t totalLength = 0;
stopwatch.Start();
for(size_t i = 0; i < 100000;i++)
{
totalLength += v[i].length();
}
stopwatch.Stop();
std::wcout
<< L"std::vector<String> Get total length ("
<< totalLength
<< L") : "
<< stopwatch.Elapsed().TotalMilliseconds()
<< std::endl;
}
The code for the std::wstring test:
void wstringVectorGetTotalLength(const std::vector<std::wstring>& v)
{
Stopwatch stopwatch;
size_t totalLength = 0;
stopwatch.Start();
for(size_t i = 0; i < 100000;i++)
{
totalLength += v[i].length();
}
stopwatch.Stop();
std::wcout
<< L"std::vector<std::wstring> Get total length ("
<< totalLength
<< L") : "
<< stopwatch.Elapsed().TotalMilliseconds()
<< std::endl;
}
When we enable Auto-Parallelization on the above test code, we get quite different results:
- String: 1.8227 - about 60% performance improvement
- std::wstring: 0.5685 - a whopping 1251% performance improvement
The last result is more than a bit baffling - because the compiler isn't parallelizing the above loops, but true nontheless - suddenly the compiler is able to do wonders while optimizing the standard C++ library based code.
It seems specifying /Qpar appearently enables more aggressive optimization, even if, in the end, no parallelization/vectorization takes place.
The String class
The constructors
String()
: data(nullptr)
{}
String s;
std::wcout << L"Empty string:" << (s?L"Not null" : L"null") << std::endl;
The default constructor just sets
data to
nullptr, making this operation very fast.
String(const String& other);
If
other is an empty string, the copy constructor sets
data to
nullptr, and when it's not, it just increments the reference count on the buffer.
String(String&& other);
The move constructor assigns
other.data to
data before assigning
nullptr to
other.data.
String(size_type length, wchar_t c);
Creates a new string with length
length, and fills it with the character
c.
String(const wchar_t* str,size_type length, wchar_t padCharacter = defaultPadCharacter );
Creates a new string with length
length. If
str is not
nullptr, it copies
length characters from
str into the string, otherwise the string gets filled with
length number of
padCharacter.
String(const wchar_t* str1,size_type length1,
const wchar_t* str2,size_type length2,
wchar_t padCharacter = defaultPadCharacter);
This constructor creates a new
String by concatenating two string sources, if
str1 or
str2 is
nullptr, the
padCharacter will be used to fill
length1 or
length2 characters in the new
String respectively.
String(const wchar_t* str1,size_type length1,
const wchar_t* str2,size_type length2,
const wchar_t* str3,size_type length3,
wchar_t padCharacter = defaultPadCharacter);
This constructor creates a new
String by concatenating three string sources, if
str1,
str2 or
str3 is
nullptr, the
padCharacter will be used to fill
length1,
length2 or
length3 characters in the new
String respectively.
String(const wchar_t* str);
Creates a new
String from a zero terminated string. If str is nullptr or the length is 0, the new
String will have data set to
nullptr.
The destructor
~String();
Decrements the reference count of the
Buffer, and destroys the
Buffer when the new reference count becomes 0.
The operators
String& operator = (const String& other)
Copy assignment sets
data to
other.data and if
data is not
nullptr it increments the reference count of the
Buffer. Handles self assignment.
String& operator = (String&& other)
Move assignment
data to
other.data before setting
other.data to
nullptr. Handles self assignment.
String& operator = (const wchar_t* str);
Copies a zero terminated string into this
String. Handles the special case of:
String s1 = L"Hello";
s1 = s1.c_str() + 1;
bool operator == (const String& other) const;
bool operator != (const String& other) const;
bool operator <= (const String& other) const;
bool operator < (const String& other) const;
bool operator >= (const String& other) const;
bool operator > (const String& other) const;
bool operator == (const wchar_t* str) const;
bool operator != (const wchar_t* str) const;
bool operator <= (const wchar_t* str) const;
bool operator < (const wchar_t* str) const;
bool operator >= (const wchar_t* str) const;
bool operator > (const wchar_t* str) const;
Full set of comparition operators.
operator bool() const;
Returns false if data is nullptr, which allows us to test for an empty string using a simple expression:
String s(L"Hello");
if(s)
{
}
wchar_t& operator[](size_type index);
Returns a reference to the character at
index. This makes it possible to assign characters to specific positions in the
String object:
String s1(L"Hi!",3);
String s2 = s1;
s2[1] = L'o';
This operator ensures that s2 references a unique buffer.
wchar_t operator[](size_type index) const;
Returns the character at
index.
String& operator += (const String& other);
Appends the
String other to this
String.
String& operator += (const wchar_t* str);
Appends the zero terminated string,
str, to this
String.
String& operator += (const wchar_t c);
Appends the charater,
c, to this
String.
friend String operator + (const String& str1,const String& str2)
Creates a new string by concatenating str1 and str2.
friend String operator + (const String& str1,const wchar_t* str2)
Creates a new string by concatenating str1 and str2.
friend String operator + (const String& str,const wchar_t c)
Creates a new string by concatenating the
String str and the character
c.
Comparition
int CompareTo(const String& other) const;
int CompareTo(const wchar_t* str) const;
returns:
- < 0 the argument is greater than this
String
- = 0 the argument is equal to this
String
- > 0 the argument is less than this
String
Size and character data access
String& SetLength(size_type newLength)
Ensures that
data points to an array that at least has a size of
newLength+1 characters, or
nullptr if
newLength is 0.
size_type length() const;
size_type Length() const;
Returns the lenght of the string, in characters, excluding the terminating zero, or 0 if
data is
nullptr.
const wchar_t* c_str() const;
Returns
data, be aware that
data may be shared between several
String objects.
wchar_t* c_str();
Returns
data, if
data is not
nullptr then
data is guaranteed to only be referenced by this
String object.
const wchar_t* begin() const;
wchar_t* begin();
const wchar_t* cbegin() const;
const wchar_t* end() const;
const wchar_t* cend() const;
wchar_t* end();
Provies "iterator like" access to the character buffer. For the non const versions when data is not nullptr then data is guaranteed to only be referenced by this String.
By "iterator like" I mean it's enough to provide the functionality required for range based for loops:
String s1 = L"Hello";
for(auto c : s1)
{
std::wcout << L'\'' << c << L'\'' << std::endl;
}
for(auto& c : s1)
{
c = c+1;
}
for(auto& c : s1)
{
std::wcout << L'\'' << c << L'\'' << std::endl;
}
const String& CopyTo( wchar_t* buffer, size_type bufferSize,
size_type start = 0, wchar_t padCharacter = defaultPadCharacter ) const;
Copies at most bufferSize characters from this String to the buffer specified by buffer starting at start. If there is not enough characters remaining between start and the end of this String, the remainder of the buffer will be filled with padCharacter.
String SubString ( size_type start, size_type length = npos) const;
Returns a String object containing a substring of this String, If start + length is greater than the length of this String, the returned String contain the characters between start and the end of this String.
Searching
size_type IndexOfAnyOf ( const wchar_t *searchChars,
size_type numberOfSearchChars, size_type start) const;
size_type IndexOfAnyOf ( const String& searchChars, size_type start = 0) const;
size_type IndexOfAnyOf( const wchar_t* searchChars, size_type start = 0) const;
Returns the index of the first occurence of any character from searchChars, starting the search at start. The function returns String::npos if no such character is found.
size_type IndexOfAnyBut ( const wchar_t *searchChars,
size_type numberOfSearchChars, size_type start) const;
size_type IndexOfAnyBut ( const String& searchChars, size_type start = 0) const;
size_type IndexOfAnyBut( const wchar_t* searchChars, size_type start = 0) const;
Returns the index of the first occurence of any character not from searchChars, starting the search at start. The function returns String::npos if no such character is found.
size_type LastIndexOfAnyOf ( const wchar_t *searchChars,
size_type numberOfSearchChars, size_type start) const;
size_type LastIndexOfAnyOf( const String& searchChars, size_type start = npos) const;
size_type LastIndexOfAnyOf( const wchar_t* searchChars, size_type start = npos) const;
Searches the String from the end, in backwards direction, for an occurence of any character from searchChars, starting the search at start. If a match is found, the funtion returns the index of the matching character, or String::npos if no such character is found.
size_type LastIndexOfAnyBut ( const wchar_t *searchChars,
size_type numberOfSearchChars, size_type start ) const;
size_type LastIndexOfAnyBut( const String& searchChars, size_type start = npos) const;
size_type LastIndexOfAnyBut( const wchar_t* searchChars, size_type start = npos) const;
Searches the String from the end, in backwards direction, for an occurence of any character not from searchChars, starting the search at start. If a match is found, the funtion returns the index of the matching character, or String::npos if no such character is found.
size_type IndexOf( const wchar_t *searchString, size_type searchStringLength, size_type start) const;
size_type IndexOf( const String& searchString, size_type start = 0) const;
size_type IndexOf( const wchar_t* searchString, size_type start = 0) const;
size_type IndexOf( const wchar_t c, size_type start = 0) const;
Searches the String for the content specified by searchString, or the character c, starting the search at start. Returns the index of the first match, or String::npos if no match is found.
size_type IndexOf( bool ( *test )(wchar_t ) , size_type start = 0) const;
size_type IndexOf( bool ( *test )(const wchar_t*, size_type length ) , size_type start = 0) const;
size_type IndexOf( bool ( *test )(const wchar_t*, const wchar_t* ) , size_type start = 0) const;
Searches the String for a match evaluated by test, starting the search at start. Returns the index of the first match, or String::npos if no match is found:
void CheckInvalidPathChars(const String& path)
{
if(path.IndexOf([] (wchar_t c) -> bool
{
return (c == '\"' || c == '<' || c == '>' || c == '|' || c < 32);
}) != String::npos)
{
throw ArgumentException("Invalid path character");
}
}
size_type LastIndexOf( const wchar_t *searchString,
size_type searchStringLength, size_type start ) const;
size_type LastIndexOf( const String& searchString, size_type start = npos) const;
size_type LastIndexOf( const wchar_t* searchString, size_type start = npos) const;
size_type LastIndexOf( wchar_t c, size_type start = npos ) const;
Searches the String, in backwards direction, for the content specified by searchString, or the character c, starting the search at start. Returns the index of the first match, or String::npos if no match is found.
size_type LastIndexOf( bool ( *test )(wchar_t ) , size_type start = npos) const;
size_type LastIndexOf( bool ( *test )(const wchar_t*, size_type length ) , size_type start = npos) const;
size_type LastIndexOf( bool ( *test )(const wchar_t*, const wchar_t*) , size_type start = npos) const;
Searches the String, in backwards direction, for a match evaluated by test, starting the search at start. Returns the index of the first match, or String::npos if no match is found.
bool StartsWith(const wchar_t* str) const;
bool StartsWith(const String& str) const;
Returns true if the string starts with a full match with the argument.
Editing
All editing operations ensures that the Buffer is unique to the String, a Buffer that is initially shared with other String objects will only have it's reference count decremented, while modifications are made to a new Buffer.
String& UpperCase();
Converts all the characters in this String to upper case.
String& LowerCase();
Converts all the characters in this String to lower case.
String& Remove(size_type start, size_type length = npos);
Removes length number of characters from this String, starting at start.
String& RemoveRange(size_type start, size_type end);
Removes the characters starting at index start, up to, but not including, the index specified by end.
String& Keep(size_type start, size_type length = npos);
Removes all characters up to the index start, and those from the index start+length up to the end of this String.
String& KeepRange(size_type start, size_type end);
Removes all characters up to the index start, and those from the index end up to the end of this String.
String& Insert( const wchar_t* text, size_type textLength, size_type position );
String& Insert( const String& text, size_type position = 0);
String& Insert( const wchar_t* text, size_type position = 0);
Inserts text at the specified position. If position is greater than the length of this String, the text is appended to this String.
String& TrimRight(const wchar_t* charactersToRemove, size_type numberOfCharactersToRemove);
String& TrimRight(const String& charactersToRemove);
Trims away any characters given by charactersToRemove from the "Right" side of this String.
String& TrimRight();
Trims away any "white space" characters from the "Right" side of this String.
String& TrimLeft();
Trims away any "white space" characters from the "Left" side of this String.
String& Trim();
Trims away any "white space" characters from the "Left" and "Right" side of this String.
History
- 23. of November, 2012 - Initial posting
- 24. of November, 2012 - Added performance information
- 28. of November, 2012 - Added new StringBuilder class
- 30. of November, 2012 - Library update
- 8. of December, 2012 - Library update
- 15. of December, 2012 - Library update