65.9K
CodeProject is changing. Read more.
Home

fix_str - An (Almost) Immutable String Class in C++

starIconstarIconstarIconstarIcon
emptyStarIcon
starIcon

4.58/5 (10 votes)

Oct 20, 2005

6 min read

viewsIcon

71571

downloadIcon

637

New style string class(es) for ASCII and UNICODE, single- and multi-threaded environments.

Introduction

C#, Java, Python and other programming languages have an immutable string class. Why not C++? Immutable value objects have demonstrated many advantages (in languages that foster them). The problem is that in C++ you cannot put an immutable object into a std::container or call a someobj.set(string) function without operator=, a mutating function. Other languages seemingly don't face this problem because they conceive strings as reference objects whereas in C++ they only make sense as value objects. On the other hand, in order to be usable, immutable objects need mutable(!) references. So, from a conceptual point of view, the difference between immutable string classes in C#, Java, ... and the (almost immutable) string class(es) for C++ I present here is not as big as it seems at first sight.

Background

Motivation

More often than not, you need not change a string once it has been created. It seems reasonable to design a string class that is 'cheap' to copy and assign but 'expensive' to modify. This is what the immutable string classes in C#, Java, and other languages aim at. Consequently, calling someobj.set(string) or someobj.get() functions, inserting strings into a container, sorting, replacing strings, ... can be done without ever requiring a 'deep' copy of the string contents.

In general, this kind of a string class is useful when the string changes rarely but is copied frequently. For heavily changing strings, C# and Java provide a mutable 'StringBuilder' companion class.

Three 'Prototypical' String Classes

Three prototypical (imaginary) C++ string classes can be distinguished:

  • 'StringBuilder': mutable string class for strings with frequently changing contents.
  • 'FixString': immutable string, no changes after construction (like in C# String).
  • 'AutoBuffer': array of characters on the stack without dynamic allocation.

In C++, the current std::string[5] implementations typically combine two of the above prototypical approaches, a compromise that hardly is optimal or even appropriate for all cases.

Comparison of fix_str with other string implementations

  std:string[5]
(VC 6.0)
std::string[5]
(VC 7.1)
CString
(MFC)
fix_str
sizeof 16 28 4 4
copy / assignment method reference-counted + COW[1] deep copy + SSO[2] reference-counted + COW[1] reference-counted
copy / assignment speed fast fast or slow[3] fast fast
default constructor fast fast fast fast
constructor for length > 0 slow fast or slow[3] slow slow
usable in multi-threaded environments ? yes ? yes/no[4]
thread safe for concurrent write no no no no
mutable yes yes yes assignable but otherwise immutable
  1. COW: Copy-On-Write
  2. SSO: Small String Optimization; the string contains a buffer (16 byte in VC 7.1) for small strings and allocates memory on the heap only for larger strings.
  3. Fast with SSO for strings <= 15 char or <= 7 wchar_t (UNICODE), respectively.
  4. Different classes for single- and multi-threaded environments (see below).
  5. 'std::string' is a typedef of template<class charT, class traits = char_traits<charT>, class Allocator = allocator<charT> > class basic_string.

fix_str

fix_str basics

  • fix_str is a (set of) very lightweight string class(es).
  • implemented deliberately as classes, not as a template, and without namespaces.
  • designed as a value type.
  • default constructor, copy constructor and operator= are always 'cheap'.
  • the contents of a fix_str object cannot be changed except by assignment.

Using the Code

Examples:

// constructors for 0 - 8 arguments
fix_str fs ("Hello", " ", "world", "!");
fix_str fs2 (fs, " and again ", fs);
// no dynamic allocation for assignment, copying, and empty fix_str
fix_str fs4;
fix_str fs5 (fs);
fs4 = fs5;
// non-static member functions (and friends)
size_t pos = fs.find ("world"); // pos: 6
pos = fs.rfind ("Hell"); // pos: 0
long h = fs.hash_code();
if (fs == fs2) { ... }
if (fs2 > fs) { ... }
// static member functions create a new fix_str object
fs = fix_str::sub_str (fs, 5); // fs: " world!"
fs = fix_str::trim (fs); // fs: "world!"
fs = fix_str::pad_front (fs, 9, '.'); // fs: "...world!"
fs = fix_str::value_of (123); // fs: "123"

Four Types of fix_str Functions

You can distinguish four groups of fix_str functions:

  1. default constructor, copy constructor, assignment operator: these functions do not allocate heap memory, have exception specification throw().
  2. constructors for 1 - 8 arguments (fix_strs or character strings).
  3. non-static member functions like find(), rfind(), hash_code() and (friend) operators ==, !=, <, >, <=, >=; (exception specification throw()).
  4. static member functions: sub_str(), duplicate(), trim_front(), trim_back(), trim(), pad_front(), pad_back(), value_of(); these create a new fix_str object and therefore allocate heap memory (of course, e.g. trim_front() only creates a new object if a trim is necessary, otherwise it just returns the input).

One design goal for fix_str is to clearly separate 'expensive' and 'cheap' functions. You always know the cost of each function call when you write it. There are no hidden, but sometimes expensive, 'optimizations' behind your back.

Points of Interest

Unicode and Multi-Threading

Why four fix_str classes?

There are different classes for:

  • ASCII (char) and UNICODE (wchar_t) strings (similar to Win32-API functions).
  • Single- and multi-threaded environments.

Strictly speaking, having different classes for single- and multi-threaded environments indicates that an implementation detail (reference-counting) shows up in the class interface.

The fix_str variants:

  char (ASCII) wchar_t (UNICODE)
Single-Threaded Environment fix_str_as fix_str_ws
Multi-Threaded Environment fix_str_am fix_str_wm

About 'Thread-Safety'

The term "thread safety" is sometimes used with unclear or ambiguous meaning, especially in C++. One must always ask: 'Thread safe in what respect?'. I don't call fix_str classes 'thread safe'. Some are usable in multi-threaded environments.

fix_str Classes for Multi-Threaded Environments

fix_str objects which are used in different threads may share the same internal representation and hence the same reference-counter (because they are copies of each other). In this case, atomic increment and decrement of the reference-counter must be assured internally by the implementation. This is what the fix_str classes for multi-threaded environments, fix_str_am and fix_str_wm, guarantee. As a rule of thumb, take these classes when you use copies of the same object in different threads.

But it is never safe for two or more threads to concurrently write to (assign to) the same fix_str object (remember, assignment is the only way to change any fix_str object). Concurrent writes to the same object must always be protected by the user.

fix_str Classes for Single-Threaded Environments

On the other hand, concurrency problems cannot occur when you:

  • work exclusively in a single-threaded environment or
  • never pass copies of fix_str objects between threads

In the latter cases you may prefer the slightly faster fix_str classes for single-threaded environments, fix_str_as and fix_str_ws. Hint: the static member function fix_str::duplicate() can be used to create completely independent copies of fix_str objects (no shared reference-counter, see also function documentation).

Win32

There is a default typedef in fix_str.h for fix_str, dependent on the definition of the macros _MT (Multi-Threaded) and _UNICODE.

#if    defined (_UNICODE) &&  defined (_MT) // UNICODE, Multi-Threaded
  typedef fix_str_wm fix_str;
#elif  defined (_UNICODE) && !defined (_MT) // UNICODE, Single-Threaded
  typedef fix_str_ws fix_str;
#elif !defined (_UNICODE) &&  defined (_MT) // ASCII,   Multi-Threaded
  typedef fix_str_am fix_str;
#elif !defined (_UNICODE) && !defined (_MT) // ASCII,   Single-Threaded
  typedef fix_str_as fix_str;
#endif

You can use fix_str in the familiar Win32 style, including the popular but annoying _T() macro:

fix_str fs (_T("Hello, world!"));

Each fix_str_xx class is available individually. You may even use different fix_str_xx classes in the same application:

fix_str_wm fs1 (L"Hello, world!");
fix_str_as fs2 ("Hello, world!");

The fix_str classes also work in non-Windows environments (at least in single-threaded).

Limitations

  • no operator+ is provided for performance reasons; instead use a constructor (actually, this is a feature, not a limitation):
    fs = fix_str ("Use ", "a ", "constructor ", "to ",
                  "efficiently ", "concatenate ", "strings");
  • embedded NULLs are not possible since fix_str is based on standard C functions.
  • usable for fixed-length character encodings like UTF-16 which is the encoding standard at Microsoft (and on the Macintosh, on the Java platform, ...). fix_str objects are compared 'binary', i.e. they are equal only if they contain the same sequence of bytes.

Other 'Immutable' String Implementations in C++

  • const_string<>: in sum, 'immutable' but with some mutating functions, 'Boost-style' and 'boost' namespace but not a Boost library, 'thread safe' but not safe for concurrent writes.

Conclusion

fix_str is a set of lightweight string classes akin to the immutable string classes in other languages. You may consider using fix_str when a string changes rarely but is copied frequently, e.g. when a container is sorted and when set()/get() functions for strings are called a lot.

History

  • October 18, 2005 - Submission to CodeProject.
  • October 28, 2005 - Submission of updated article to CodeProject.
    • Article refactored, especially paragraphs 'Unicode and Multi-Threading' and 'About Thread-Safety' rewritten for more clarity (hopefully).