Click here to Skip to main content
Click here to Skip to main content

Non MFC and Non STL string class: EsString

, 25 Nov 2004
Rate this:
Please Sign up or sign in to vote.
Standalone refcounted string class.

Introduction

The EsString class described further is an attempt (one of many other people did) to create a lightweight independent (from other libraries) string class, which would provide clone-on-write as well, namely, several string objects may refer to the same actual string, unless one wants to change its contents.

In brief, what this string class is:

  1. copy-on-write string with chunk memory allocation (compiled conditionally).
  2. doesn't depend on ATL, MFC, STL.
  3. internally relies solely on CRT calls, and if support enabled, on Win API.
  4. may be (with some restrictions) used as items buffer, not zero-terminated string.
  5. provides conversion from single char to wide char and vice versa (though conversion is quite dumb).
  6. implements concatenation operators.
  7. string comparison, in form of operators as well as in functional form; case insensitive comparison included.
  8. string case manipulation ToLower\ToUpper.
  9. substring extraction.
  10. left|right|whole string trimming by specified char pattern.
  11. string replacement.
  12. string search methods:
    • for specified char in forward and reverse, starting from specified position.
    • for substring in forward and reverse, starting from specified position.
    • for first char that's in\not in char pattern, in forward and reverse, starting from specified position.
  13. provides C-style formatting method.
  14. provides simple numerical-to-string conversion for double and int types.
  15. allows to represent arbitrary binary buffer as hex string.
  16. if used with ES_WINDOWS defined, it additionally allows to:
    • load strings from Windows resources
    • obtain textual information about system error codes
    • convert from BSTR strings

...and what it is not so far:

  1. it doesn't provide object thread-safety, it's planned though.
  2. it doesn't support MBCS at all, don't rely on it if looking for proper MBCS locales handling.

Background

I believe, programmers like using string classes. As for myself, I'm a lazy developer, I hate to keep track of endless string buffer (re)(de)allocation, as well as thinking of how long each buffer should be to keep all necessary data, not causing overflows, etc. That's what one does when using CRT only calls to manipulate strings. Also, one should keep in mind that if A, B, C point to the same memory block, filled with (string) data, and then, later, some changes are made to this block via B, A and C will obviously point to the changed sequence as well, and maybe this doesn't make that guy happy. Well, let's see what's on standard menu. Basically, there are:

  1. ATL|MFC string,
  2. VCL (one from C++ Builder) string,
  3. quite stand-alone _bstr_t string,
  4. STL string.

Strings (1)-(3) use the clone-on-write (or, copy-on-write) approach. In a nutshell, you may have many instances of string objects, which may internally refer to one shared string buffer. Copy assignments are playing fast, no memory allocation and actual buffer copying needed, just shared data reference count changes. If only you make change to one of these string objects, it internally creates exact copy of original referenced buffer, releases previously referenced buffer, and becomes the only referrer of the new one. This makes possible passing such kind of string objects by value, because the object itself is relatively small, no actual string copy occurs in this case.

STL string (4) is plain and straightforward - unless you use references or pointers to string objects - you will have as many string content copies as were created via assignment operators or copy constructors. Safe, but dumb. But... STL string may have one advantage, because, actually it's not a string, but a buffer of characters, namely, you may have STL string with length n containing n binary zeroes.

So, my intention was to use the copy-on-write approach found in (1)-(3), and have the ability to store binary zeroes (but the latter wasn't the main goal). Also, I tried to keep away from using (and thus depending on) something like ATL, MFC, VCL (for God's sake), or STL, unless it's really needed.

Using the code

Code consists of two header files with self-explanatory names: EsRefCounted.hpp and EsString.hpp. The former contains:

//refcounted base 
class EsRefCounted;


//refcounted smartptr template
//class ValueT is supposed to inherit from EsRefCounted 
template <class ValueT> 
class EsRefCountedPtrT;

EsRefCounted is the base class for shared data holder, that's what EsRefCountedPtrT-derived wrapper classes expect to hold; the actual string data holder and manipulator class EsStringValueT, described further, is inherited from EsRefCounted.

EsRefCountedPtrT templated class provides all basic refcounting logic, and overridden operator =, such as all = assignments of EsRefCountedPtrT-derived objects of the same type will use it instead.

EsString.hpp contains several helper templates, refcounted string data holder, and actual string templated class derived from EsRefCountedPtrT.

//helper classes
template <bool IsWide, typename CharT>
struct EsCharTraitsBaseT; //main template

//non-wide char template partial specialization
template <typename CharT>
struct EsCharTraitsBaseT<false, CharT>

//wide char partial specialization
template <typename CharT>
struct EsCharTraitsBaseT<true, CharT>

//chartype helper struct
template <typename CharT>
struct EsIsWideCharTypeT 
{
    enum {Yes = sizeof(CharT) > sizeof(char), No = !Yes};
};

template <typename CharT >
class EsCharTraitsT : EsCharTraitsBaseT< 
                      EsIsWideCharTypeT<CharT>::Yes, CharT >

First, why the hell these helpers are needed after all? String manipulation internally uses CRT function calls. Of course, these functions have different names depending on the specialization for single or wide strings\characters. OK, one may say, why not use uniform tchar mappings? Tchar mappings are OK, unless you have to use single byte and wide strings at the same time, and I wanted to make particular string template instantiation "decide" on which branch of CRT (and, in some cases, WinAPI) functions to use. Also, some methods should work different internally, for BSTR - to string conversions, for example, while these differences should be hidden from string class itself. The "topmost" helper template EsCharTraitsT provides uniform static methods for generic CharT sequence manipulation, as well as (static const) member equal to byte size of the CharT used for current template instantiation. Why static? Actually, EsCharTraitsT is used just as a placeholder for internally used code specific for concrete CharT type. It doesn't need to be ever created as object instance. All it's used for are EsCharTraitsT<>::SomeMethod() calls inside string methods.

//refcounted string base (actually just data container)
template <typename CharT>
class EsStringValueT : public EsRefCounted

EsStringValueT class provides (inlined where appropriate) methods for string search, manipulation, formatting, extraction, etc. The referrer wrapper class basically delegates its calls to the corresponding methods of this object, maintaining string buffer uniformity if needed, and extending EsStringValueT methods appropriately.

//EsString class template
template <typename CharT>
class EsStringT : public EsRefCountedPtrT< EsStringValueT<CharT> >

This is the main string "worker", followed by its implementation. EsString.hpp file contains explicit specializations for single and wide chars as well as some UNICODE mappings:

typedef EsStringT<char>            EsStringA;
typedef EsStringT<wchar_t>    EsStringW;
//if std::list was included
#ifdef _LIST_
    typedef std::list<EsStringA> EsAStrings;
    typedef std::list<EsStringW> EsWStrings;
#endif //_LIST_
//unicode-specific string defines

  #ifdef _UNICODE

  #define EsString EsStringW

  //if std::list was included

  #ifdef _LIST_

  #define EsStrings EsWStrings

  #endif //_LIST_

  #else

  #define EsString EsStringA

  //if std::list was included

  #ifdef _LIST_

  #define EsStrings EsAStrings

  #endif //_LIST_

#endif //_UNICODE

Well, there are two strings defined, for char and wchar_t, and "standard" string, that's char or wchar_t based depending on _UNICODE flag. In addition, if STL <list> is included somewhere before EsString.hpp header, string lists based on std::list become available. Actually, the latter code is legacy thing, because string header was cropped from the project I'm currently working on, I just decided to leave it as-is.

Features:

Followed is the detailed description of the interface provided by EsString class; ES_WINDOWS symbol allows to include\exclude Windows dependent stuff:

//constructors
    EsStringT() //default constructor
    //construct from CharT sequence, if nCount == 0 - we suppose
    // that sequence is zero terminated, otherwise,
    // nCount items will be copied from it,
    //and if bAddZeroTerminator == false in that case,
    // resulting buffer may be not zero - terminated string
    EsStringT( const CharT *pStr, size_t nCount = 0, 
                                  bool bAddZeroTerminator = true ) 
    //initialize our contents to nCount cCh chars
    EsStringT( CharT cCh, size_t nCount = 1 )
    //initialize ouselves from string of other type, 
    //making implicit char conversion
    template <typename OtherCharT>
    EsStringT( const EsStringT<OtherCharT>& crefOther )
//OS dependent
#ifdef ES_WINDOWS
    //copy BSTR contents to us, releasing source BSTR
    // if bReleaseBSTR is set; BSTR to single byte conversion
    // performed for EsStringT based on char
    explicit EsStringT( BSTR pStr, bool bReleaseBSTR = true )
#endif //ES_WINDOWS
//raw access - intended for use when char* based buffer
// access is needed, regardless of the actual CharT type
    //raw buffer length 
    inline size_t    GetRawLen() const
    //type casts
    inline const char* GetRaw() const
//string-like access
    inline size_t    GetLen() const
    inline const CharT* c_str() const
    inline const CharT& At(size_t nIdx) const
    //non-const reference method should create contents clone,
    // to make sure we don't change shared contents
    inline CharT& At(size_t nIdx)
    inline const CharT& operator[] (size_t nIdx) const
    //non-const reference operator implicitly 
    //calls non-const reference method
    inline CharT& operator[] (size_t nIdx)
    //return true if contained string is zero-terminated.
    // this method is not 100% guarantee that object
    // contains zero -terminated string though
    //it just tests the last char in internal buffer,
    // but we may use this class as sequential container,
    // so it may contain more zero items somewhere
    inline bool IsZeroTerminated() const
    //char position search
    inline int GetPos( CharT cCh, int iFrom = 0 ) const
    inline int GetRPos( CharT cCh, int iFrom ) const
    //char match search:
    //find pos of the first char listed in strPattern
    inline int FindFirstIn( const CharT* strPattern, 
                            int iFrom = 0 ) const
    //reverse find pos of the first char listed in strPattern
    inline int RFindFirstIn( const CharT* strPattern, int iFrom ) const
    //find pos of the first char not listed in strPattern
    inline int FindFirstNotIn( const CharT* strPattern, 
                               int iFrom = 0 ) const
    //reverse find pos of the first char not listed in strPattern
    inline int RFindFirstNotIn( const CharT* strPattern, 
                                int iFrom ) const
    //substring pos search
    inline int GetPos( const EsStringT<CharT> strPattern, 
                                                 int iFrom = 0 ) const
    inline int GetRPos( const EsStringT<CharT> strPattern, int iFrom ) const
//comparison
    inline int Compare( const EsStringT<CharT>& crefOther ) const
    inline int CompareIC( const EsStringT<CharT>& crefOther ) const
//string manipulation
    //addition
    inline void Add( const EsStringT<CharT>& crefOther )
    //replacement
    void Replace( const EsStringT<CharT>& crefPattern, 
                  const EsStringT<CharT>& crefReplaceBy )
    //trimming
    //patterned trimming
    inline void TrimLeft(const CharT* strPattern)
    inline void TrimRight(const CharT* strPattern)
    inline void Trim(const CharT* strPattern)
    //extraction
    inline EsStringT SubString(size_t nStart, int iCount = -1) const
    //lower\upper
    inline const EsStringT& ToLower() 
    inline const EsStringT& ToUpper() 
//formatters
    // standard formatter
    // alas, internally it uses _vscwprintf or _vsnprintf to calc
    // buffer length required to hold formatted results
    // and so far, these functions exists AFAIK in MS CRT only
    void Format(const EsStringT<CharT> strFormat, ...)
    //bin-to-hex formatter, strHexPfx is used to specify some
    // custom hex prefix, like 0x, for instance
    inline const EsStringT& BinToHex( const char* pBuff, 
           size_t nLen, const CharT* strHexPfx = NULL )
//conversion
    //value access helper. don't const cast its result!!
    inline const BaseValT* GetValue() const
    //conversion from other string
    template <class OtherCharT>
    inline void ConvertFrom( const EsStringT<OtherCharT>& crefSrc )
    //OS dependent stuff
#ifdef ES_WINDOWS
    inline void ConvertFrom(const BSTR pSrc, bool bReleaseBSTR = true)
#endif //ES_WINDOWS
    // assignment from other string type. if strings of the same
    // type are assigned, EsRefcountedPrtT
    // assignment operator is used instead
    template <class OtherCharT>
    inline void operator= (const EsStringT<OtherCharT>& crefSrc)
    //numerical convertions
    inline void ToString( double dVal )
    inline void ToString( int iVal )
//operators
    //addition
    inline EsStringT operator+ ( const EsStringT<CharT>& crefOther )
    inline void operator+= ( const EsStringT<CharT>& crefOther )
    //comparison
    inline bool operator< ( const EsStringT<CharT>& crefOther ) const
    inline bool operator== ( const EsStringT<CharT>& crefOther ) const
    inline bool operator> ( const EsStringT<CharT>& crefOther ) const
    inline bool operator!= ( const EsStringT<CharT>& crefOther ) const
    inline bool operator<= ( const EsStringT<CharT>& crefOther ) const
    inline bool operator>= ( const EsStringT<CharT>& crefOther ) const
//utilities
  // trailing char check, may be useful
  // in path backslash addition\checking
  static EsStringT IncludeTrailingChar( const EsStringT<CharT> strSrc, 
                                        CharT chTrail )
  //unicity check. return true if we're the only 
  //referrer of contained string buffer
  inline bool IsUnique() const
  //make sure we're unique referrer of contents
  inline void Unique()
//OS - dependent stuff
#ifdef ES_WINDOWS 
  //standard error description extraction
  static EsStringT GetErrorDescription(int iErrorCode, 
         DWORD nLangId = MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT) )
  //assign extracted value to itself
  inline void AssignErrorDescription(int iErrorCode, 
         DWORD nLangId = MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT) )
  //load string from resource
  static EsStringT GetResourceString( UINT uID, 
         HINSTANCE hInstance = NULL, size_t nSizeMax = 1024 )
  //assign loaded resource string to us
  inline void AssignResourceString( UINT uID, 
         HINSTANCE hInstance = NULL, size_t nSizeMax = 1024 )
#endif //ES_WINDOWS

History

  • 11.24.2004 - Added compilation flag and code for chunk memory allocation, fixed several minor bugs and typos. Demo project updated to allow for performance counting.
  • 11.10.2004 - First release.

Compatibility & Performance

First, I've tried to make source compatible with three compiles, that I use often. It was developed primarily under VC++ .NET, the other two are BCC from Borland C++ Builder 6, and recent GCC. Unfortunately, MS compilers from earlier Visual Studio releases don't "understand" partial template specialization, so this branch was omitted, I just didn't want to spend time working around cl.exe bugs. Code as it is for now, should normally compile and run under BCC and .NET's cl, GCC may also do, but I didn't run GCC-compiled demo.

As for performance testing, so far, I've implemented the simple code making 10^7 character summations, taking 10 iterations, for statistics, measuring time taken for each iteration, as well as the memory allocated within it. But the "memory used" parameter may be used only for very, very rough estimation, because it's rather inaccurate. People, who have experience in getting exact values of memory allocated by process, please give an advice. Times shown below were measured on PM-1400 notebook with 512 Mb RAM.

Borland C++ Builder 6:

AnsiString, 10000000 summations
  #0 Run for: 2.58 sec, used mem: 10035200 bytes
  #1 Run for: 2.39 sec, used mem: 10010624 bytes
  #2 Run for: 2.39 sec, used mem: 10039296 bytes
  #3 Run for: 2.39 sec, used mem: 9981952 bytes
  #4 Run for: 2.39 sec, used mem: 9973760 bytes
  #5 Run for: 2.39 sec, used mem: 10047488 bytes
  #6 Run for: 2.39 sec, used mem: 9969664 bytes
  #7 Run for: 2.40 sec, used mem: 10059776 bytes
  #8 Run for: 2.38 sec, used mem: 9969664 bytes
  #9 Run for: 2.39 sec, used mem: 9994240 bytes<

EsString, 10000000 summations:
  #0 Run for: 1.47 sec, used mem: 9273344 bytes
  #1 Run for: 1.30 sec, used mem: 9670656 bytes
  #2 Run for: 1.32 sec, used mem: 9707520 bytes
  #3 Run for: 1.31 sec, used mem: 9752576 bytes
  #4 Run for: 1.32 sec, used mem: 9576448 bytes
  #5 Run for: 1.30 sec, used mem: 9859072 bytes
  #6 Run for: 1.31 sec, used mem: 10010624 bytes
  #7 Run for: 1.32 sec, used mem: 9719808 bytes
  #8 Run for: 1.30 sec, used mem: 10010624 bytes
  #9 Run for: 1.31 sec, used mem: 10043392 bytes

STL string, 10000000 summations: bad_alloc assertion after 30sec run, used mem > 1Gb.

MS VC++ .NET

MFC\ATL string, 10000000 summations:
#0      Run for: 38.15 sec,     used mem: 9039872 bytes
#1      Run for: 38.55 sec,     used mem: 9969664 bytes
#2      Run for: 38.56 sec,     used mem: 9977856 bytes
#3      Run for: 38.11 sec,     used mem: 25337856 bytes
#4      Run for: 38.80 sec,     used mem: 5672960 bytes
#5      Run for: 38.77 sec,     used mem: 2985984 bytes
#6      Run for: 38.79 sec,     used mem: 10039296 bytes
#7      Run for: 38.92 sec,     used mem: 6062080 bytes
#8      Run for: 38.76 sec,     used mem: 10522624 bytes
#9      Run for: 38.78 sec,     used mem: 9994240 bytes

EsString, 10000000 summations:
#0      Run for: 1.06 sec,      used mem: 10055680 bytes
#1      Run for: 1.05 sec,      used mem: 10047488 bytes
#2      Run for: 1.05 sec,      used mem: 9998336 bytes
#3      Run for: 1.05 sec,      used mem: 10027008 bytes
#4      Run for: 1.05 sec,      used mem: 10022912 bytes
#5      Run for: 1.04 sec,      used mem: 9977856 bytes
#6      Run for: 1.04 sec,      used mem: 10043392 bytes
#7      Run for: 1.06 sec,      used mem: 10027008 bytes
#8      Run for: 1.04 sec,      used mem: 10027008 bytes
#9      Run for: 1.05 sec,      used mem: 10006528 bytes

EsString, without chunk allocation
#0      Run for: 41.74 sec,     used mem: 9322496 bytes
#1      Run for: 42.10 sec,     used mem: 9830400 bytes
#2      Run for: 42.13 sec,     used mem: 9973760 bytes
#3      Run for: 42.14 sec,     used mem: 9912320 bytes
#4      Run for: 42.14 sec,     used mem: 9986048 bytes
#5      Run for: 42.14 sec,     used mem: 10006528 bytes
#6      Run for: 42.14 sec,     used mem: 9654272 bytes
#7      Run for: 42.12 sec,     used mem: 10080256 bytes
#8      Run for: 42.11 sec,     used mem: 10027008 bytes
#9      Run for: 42.12 sec,     used mem: 9953280 bytes

These tests show that EsString class performs quite well with chunk memory allocation switched on. It obviously outperforms standard strings during massive concatenations, while providing roughly the same memory usage. The closest rival is, to my surprise, Borland's AnsiString, that is only by 1.5 slower. MS's CString seems surprisingly slow, I believe, its memory allocation policy is responsible for it. For comparison, I ran the same test without chunk memory allocation, and EsString showed runtimes relatively close to CString's, but in that case, the latter was slightly faster. I didn't test the small string case, because I believe that the relative results would be the same, except the CString, which uses statically (on-stack) allocated buffer for short strings, and it might boost its performance then.

As for STL string, as I said at the beginning of this article, its logic is quite dumb and straight-forward, so if you look for assignment and concatenation performance - don't use these, unless you absolutely have to. Alternatively, if you know exactly how much characters you would expect to be added, reserve string's capacity beforehand.

Comments

ES_ASSERT(x) macro is used in string code here, and it is defined in precompiled header of demo project as:

//debug defines
#ifndef ES_ASSERT
    #if !defined(_DEBUG) && !defined(NDEBUG)
        #define NDEBUG
    #endif
    #include <assert.h>
    #define ES_ASSERT(x)    assert(x)
#endif //ES_ASSERT

When used in some project, EsString and related headers may be included in precompiled header after ES_ASSERT define, like it's done in the demo:

#include "EsRefCounted.hpp"
#include "EsString.hpp"

Source archive contains additional files - stdafx.h and its .cpp. That's because the former has several helper defines, as well as sketchy EsException class used in EsString code.

Any code (and performance) improvements, bug fixes, etc. are welcomed.

Plans for further development - make this class thread-safe, use conditional defines to exclude the thread-safety locks, for the sake of performance. Optionally, if project development plans would demand it - add EsString - based string stream, that will make use of EsString's good concatenation performance.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here

About the Author

Vsevolod
Software Developer (Senior) Eko-Sfera
Russian Federation Russian Federation
Born in 05.20.1971, in Moscow.
Graduated from Moscow Physical Engineering Institute in 1993.
Gained PhD. in Phys. Math. sciences in 1998.
Programmer experience over 8 years.
Assembler(s), Pascal, VBasic, JScript, ANSI C, C++.
Microcontrollers, Serial communication, MSJet DB, MFC, ATL, COM.
MSDev Studio, Borland CBuilder.
Russian, English.
 
Married, with one child.

Comments and Discussions

 
GeneralReference document Pinmemberc56759121-Feb-07 10:13 
GeneralRe: Reference document PinmemberVsevolod21-Feb-07 10:49 
GeneralSTL Speed is faster on VC 7 PinmemberNick Halstead7-Jul-06 6:34 
QuestionWhat STL? Pinmembergnk25-Nov-04 10:52 
AnswerRe: What STL? PinmemberVsevolod25-Nov-04 21:18 
GeneralRe: What STL? Pinmembergnk28-Nov-04 19:57 
GeneralRe: What STL? PinmemberVsevolod28-Nov-04 23:35 
GeneralRe: What STL? Pinmembergnk30-Nov-04 13:24 
GeneralNice Effort. PinmemberChris Meech12-Nov-04 3:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web04 | 2.8.140709.1 | Last Updated 25 Nov 2004
Article Copyright 2004 by Vsevolod
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid