|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionThis is an in-depth, behind-the-scenes look at strings in the Common Language Runtime and the .NET Framework. This study provides detailed information of the implementation of strings, describes efficient and inefficient ways of using strings. I plan on developing a series of articles, exploring the underlying implementations of various fundamental features in C# and the framework, of which this is the first article, if The Code Project community finds this information helpful and offers any additional suggestions to this piece. Most of the information presented here cannot be found anywhere else, not in MSDN, not in any book, and not in any article. My intent was to understand how to use C# efficiently to develop a serious commercial application. I am ex-Microsoft developer of Excel, who has started his own software company, developing applications employing artificial intelligence. BackgroundStrings, as you know are a fundamental type, in the .NET world. They are one of a small set of exclusive objects that the CLR and the JIT compiler have intimate knowledge of. The others include, of course, the primitive data types, In .NET version 1.0, each heap-allocated object consists of a 4 byte objectheader and 4 byte pointer to a method table. The 4 byte object header provides five bit flags, one of which is reserved for the garbage collector to mark an object as reachable and not free space. The remaining bits refer to a 27 bit index called a syncindex, which may refer to another table. This index has multiple purposes: As implied by its name it is used for synchronization, whenever the "lock" keyword is used. It is also used as the default hash code for The Internally, strings resembled OLE
Strings are always appended with a null character, even though it is valid for a string to contain embedded. This facilitates interoperability with unmanaged code and the traditional Win32 API. Altogether strings occupy 16 bytes of memory + 2 bytes per character allocated + 2 bytes for the final null character. As described in the table the number of characters allocated may be up to twice the string length, if is used to create the string. Extremely Efficient StringBuilderClosely related to
If the string needs to grow beyond its capacity, a new string will be constructed with the greater of twice the previous capacity or the new desired size, subject to the maxcapacity constraint. This approach takes O(3n) which is linear time. The alternative approach of growing the string by a fixed amount rather than a percentage results in quadratic time performance, O(n^2). When Modifying the StringBuilder costs 16 bytes not including the memory used by the string. However, the same As you can see, creating Strings using StringBuilder is an extremely efficient operation. Other performance tips:
Minimizing Garbage CollectionGarbage CollectorUtilizing .NET uses a three-generation approach to collecting memory, based on the heuristic that newly allocated memory tends to be freed more frequently than older allocations, which tend to be more permanent. Gen 0 is the youngest generation and, after a garbage collection, any survivors go on to Gen 1. Likewise, any survivors of a Gen 1 collection go on to Gen 2. Usually garbage collection will occur only on Gen 0, and only if after it has reached some limit. The cost of allocating memory on the heap under garbage collection is less than that under the C runtime heap allocator. Until memory is exhausted, the cost of allocating each new object is that of incrementing a pointer--which is close to the performance of advancing the stack pointer. According to Microsoft, the cost of a garbage collection of generation 0 is equivalent to a page fault--from 0 to 10 milliseconds. The cost of a generation 1 collection is from 10 to 30 ms, while a generation 2 collection depends on the working set. My own profiling indicated that generation 0 collections occur 10-100 times more frequently than generation the other two. One book by Jeffrey Richter that I have read suggested hypothetical limits of 250Kb for Gen 0, 2Mb for Gen 1 and 10Mb for Gen 2. However, my own investigation into the Rotor shared source CLI indicated that the threshold appears to be initially 800Kb for Gen 0 and 1Mb for Gen 1; of course, these are undocumented and subject to change. The thresholds are automatically adjusted dynamically according to actual program allocations. If very little memory is being freed in Gen 0 and survives to Gen 1, the threshold is expanded. Suboptimal string functionsMany of the functions provided by the String class often generated needless allocations that increase the frequency of garbage collections. For example, the functions It is very difficult, if not impossible, to escape the numerous hidden allocations occurring with the class library. For example, whenever any numeric data type such as int or float is formatted as string (for example, through When using Other parts of the library exhibit these inefficiencies, too. In the Windows Forms Library, for example, the Control Text property always returns a new string; this may actually be understandable, because the property may not be cached and therefore it must call into Windows API with the function GDI+ is the worst abuser of the garbage collector, because a new string must be constructed for each call to One good string function is the Direct Modifications of Strings1) in-place modification of stringspublic static unsafe void ToUpper(string str)
{
fixed (char *pfixed = str)
for (char *p=pfixed; *p != 0; p++)
*p = char.ToUpper(*p);
}
The example above demonstrates how to change an immutable string through the use of unsafe pointers. A good example of efficiency gained through this approach with is The fixed keyword pins the string in heap so that is cannot move during a garbage collection and allows the address of the string to be converted to a pointer. The new address points to the start of the string (or if an index is included as in The functions below provide a more complete set of functions for modify a string, without reflection. public static unsafe int GetCapacity(string s)
{
fixed(char *p = s)
{
int *pcapacity = (int *)p - 2;
return *pcapacity;
}
}
public static unsafe int GetLength(string s)
{
// This function is redundant, because it accomplishes
// the same role as s.Length
// but it does demonstrate some of the precautions
// that must be taken to
// recover the length variable
fixed(char *p = s)
{
int *plength = (int *)p - 1;
int length = *plength & 0x3fffffff;
Debug.Assert(length == s.Length);
return length;
}
}
public static unsafe void SetLength(string s, int length)
{
fixed(char *p = s)
{
int *pi = (int *)p;
if (length<0 || length > pi[-2])
throw( new ArgumentOutOfRangeException("length") );
pi[-1] = length;
p[length] = '\0';
}
}
public static unsafe string NewString(int capacity)
{
return new string('\0', capacity);
}
public static unsafe void SetChar(string s, int index, char ch)
{
if (index<0 || index >= s.Length)
throw( new ArgumentOutOfRangeException("length") );
fixed(char *p = s)
p[index] = ch;
}
Modifying the values will indeed change the string. If strings can be change, why then are the immutable? The fact that strings are immutable allows them to be as easily passed as around as regular integers. Strings can be initialized with null and can be blitted from one structure to next, without having to use an elaborate mechanism for copy-on-write. Capacity refers to the array length of the string, which cannot be changed. However, the logical string length can be changed, by referring to the previous 32-bit integer just before the string. However, examining or modifying this value requires caution. The string length must always be less then array length of the string. The high two bits contains flags that indicate whether the string consists entirely of 7-bit ASCII text, in which case in can be sorted and compared quickly. A value of zero for both of these bits indicate that the state is indeterminate. When reading the length value, the length must be anded with 0x3fffffff to avoid reading the two bits. Modifying the length is okay as long as the high two bits are clear, which is the normally the case. Future implementations or current non-Windows implementations of the CLR may change the underlying implementation of the string. Fortunately, the version of the runtime that you built with will be the one selected when your executable is started. System.Reflection also allows programmers to access hidden fields, members, and properties--yes, even those marked with internal and private. This reflection-based approach performs much more slowly than the manual approach above, but does not require unsafe code and should be more robust in the face of changing versions of the runtime. The line demonstrates the ability to change the length of an immutable string through reflection:
I have written a short test function to demonstrate the use of the various string functions described above. /// <SUMMARY>
/// The main entry point for the application.
/// </SUMMARY>
[STAThread]
static unsafe void Main(string[] args)
{
StringBuilder sb = new StringBuilder();
sb.Append("Good morning");
string test = "How are you";
SetLength(test, 5);
SetChar(test, 1, 'a');
string [] sarray = new String[] { String.Empty, "hello", "goodbye",
sb.ToString(), test};
foreach (string s in sarray)
Console.WriteLine("'{0}' has capacity {1}, length {2}", s,
GetCapacity(s),
GetLength(s));
Console.ReadLine();
}
The test code above results in the following output, demonstrating the full ease at which strings can actually be changed. '' has capacity 1, length 0
'hello' has capacity 6, length 5
'goodbye' has capacity 8, length 7
'Good morning' has capacity 17, length 12
'Haw a' has capacity 12, length 5
2) Retrieving the working copy of string in StringBuilderFor the faint-hearted, who prefer a less intimate, less low-level interaction with strings. It's possible to recover the actual string that is being used by
The returned string "test" will continue to reflect changes made within StringBuilder, unless a modification requires more capacity than the string holds. It which case StringBuilder abandons the original string and constructs an entirely new larger working string. One caveat is that A future article on "Objects UNDOCUMENTED" will demonstrate how to access hidden members without using reflection. String alternativesAn alternative to constructing Strings without impacting the garbage collector is to do one of the following: All of these alternatives use unsafe features, which have no real implications if you are writing a standalone application. Keep in mind, managed C++ is unsafe. I don't really recommend these approaches, but they are mentioned for your awareness. 1) stack-based character array char *array = stackalloc char[arraysize ];
2) special struct for fixed-sized strings [StructLayout(LayoutKind.Explicit, Size=514)]
public unsafe struct Str255
{
[FieldOffset(0)] char start;
[FieldOffset(512)] byte length;
#region Properties
public int Length
{
get { return length; }
set
{
length = Convert.ToByte(value);
fixed(char *p=&start) p[length] = '\0';
}
}
public char this[int index]
{
get
{
if (index<0 || index>=length)
throw(new IndexOutOfRangeException());
fixed(char *p=&start)
return p[index];
}
set
{
if (index<0 || index>=length)
throw(new IndexOutOfRangeException());
fixed(char *p=&start)p[index] = value;
}
}
#endregion
}
Str255 is a stack-allocated value type that allows string operations to be performed without impacting the garbage collector. It can handle up to 255 characters and includes a length byte. A reference to the structure can be directly passed to a Windows API call, because the start of the structure is the first character of a C-string. Of course additional functions need to be written by editing strings. Careful attention would also be needed to ensure that the string is null-terminated for Windows interoperability. Also a conversion routine would be necessary to convert the struct to a .NET string, so that others CLR functions can use it. If this is used to create a .NET string, it would be superior to StringBuilder in one respect: Str255 would require fewer allocations. Range Check EliminationIndexing an array or string normally includes range-checking. According to Microsoft, the compiler performs a special optimization that improves the performance of iterating through an array or string. First, compare the following three approaches to iterating through a string. Which is fastest? 1) standard iteration int hash = 0;
for (int i=0; i< s.Length; i++)
{
hash += s[i];
}
2) iterative loop with saved length variable int hash = 0;
int length = s.length;
for (int i=0; i< length; i++)
{
hash += s[i];
}
3) foreach iteration foreach (char ch in s)
{
hash += s[i];
}
Surprisingly, with NET v1.0 of the JIT compiler, the first example, which repeatedly calls Why is example 1 faster than example 2? This is because the compiler recognizes the pattern for (int i=0; i<s.length; i++) for both strings and arrays. Strings are immutable; they have constant length. Since both strings and array are fixed lengths, the compilers simply stores away the length, so that no function call is made on each iteration. (The JIT compiler, which automatically inlines non-virtual method calls that consist of simple control flow and less than 32 bytes of IL instructions, may actually be inlining references to the string length.) In addition, the compiler eliminates all range-check tests on any instance of s[i] within the loop, because i is guaranteed in the for condition to be within the range 0 <= i < NOTE: In the version 1.1, the C# team special cases There is a fourth approach that should be even faster, although I have not verified it; however, it does require unsafe code to be emitted. fixed (char *pfixed = s)
{
for (char *p = pfixed; *p; p++)
hash += *p++;
}
Efficient String SwitchesC# supports switches on strings. In doing so, it uses an efficient mechanism for switching.
public void Test(string test)
{
switch(test)
{
case "a": break;
case "b": break;
...
}
}
For small number of cases, the compiler will generate code that looks at the internal string hash table. The compiler will first call String.IsIntern on the switch value. String.IsIntern actually returns a string instead of boolean, returning null for failure, and the interned value of the string for success. Every string constant in a case label is automatically known at compile-time and is therefore guaranteed to be intern. At that point, the compiler will compare the interned value of the string to each string constant using reference equality.
public void Test(string test)
{
object interned = IsInterned(test);
if (interned != null)
{
if (interned == "a")
{
; // Handle case a
}
else if (interned == "b")
{
; // Handle case b
}
}
}
Here is the sample IL for this code. .method public hidebysig instance void Test(string test) cil managed
{
// Code size 34 (0x22)
.maxstack 2
.locals ([0] string CS$00000002$00000000)
IL_0000: ldstr "a"
IL_0005: leave.s IL_0007
IL_0007: ldarg.1
IL_0008: dup
IL_0009: stloc.0
IL_000a: brfalse.s IL_001f
IL_000c: ldloc.0
IL_000d: call string [mscorlib]System.String::IsInterned(string)
IL_0012: stloc.0
IL_0013: ldloc.0
IL_0014: ldstr "a"
IL_0019: beq.s IL_001d
IL_001b: br.s IL_001f
IL_001d: br.s IL_0021
IL_001f: br.s IL_0021
IL_0021: ret
}
If the number of cases is large (in this example it was 14), the compiler generates a sparse hashtable constructed with a loadfactor of .5 and with twice the capacity. [[ In actuality, a hashtable of with a loadfactor of .5 is generated with nearly a 3-1 ratio, since the The hashtable will map each string to a corresponding index as shown in the C# illustration of what the compiler generates below.
private static Hashtable CompilerGeneratedHash;
public void Test(string test)
{
if (CompilerGeneratedVariable==null)
{
CompilerGeneratedHash=new Hashtable(28, 0.5);
CompilerGeneratedHash.Add("a", 0);
CompilerGeneratedHash.Add("b", 1);
...
}
object result = CompilerGeneratedHash[test];
if (result != null)
{
switch( (int) result )
{
case 0:
case 1:
...
}
}
}
The actual IL version is shown below. .method public hidebysig instance void Test(string test) cil managed
{
// Code size 373 (0x175)
.maxstack 4
.locals ([0] object CS$00000002$00000000)
IL_0000: volatile.
IL_0002: ldsfld class [mscorlib]
System.Collections.Hashtable '<PRIVATEIMPLEMENTATIONDETAILS>
'
::'$$method0x6000106-1'
IL_0007: brtrue IL_0100
IL_000c: ldc.i4.s 28
IL_000e: ldc.r4 0.5
IL_0013: newobj instance void [mscorlib]
System.Collections.Hashtable::.ctor(int32, float32)
IL_0018: dup
IL_0019: ldstr "a"
IL_001e: ldc.i4.0
IL_001f: box [mscorlib]System.Int32
IL_0024: call instance void [mscorlib]
System.Collections.Hashtable::Add(object, object)..... object)
IL_00f9: volatile.
IL_00fb: stsfld class [mscorlib]
System.Collections.Hashtable '<PRIVATEIMPLEMENTATIONDETAILS>
'
::'$$method0x6000106-1'
IL_0100: ldarg.1
IL_0101: dup
IL_0102: stloc.0
IL_0103: brfalse.s IL_0172
IL_0105: volatile.
IL_0107: ldsfld class [mscorlib]
System.Collections.Hashtable '<PRIVATEIMPLEMENTATIONDETAILS>
'
::'$$method0x6000106-1'
IL_010c: ldloc.0
IL_010d: call
instance object [mscorlib]System.Collections.Hashtable::get_Item(object)
IL_0112: dup
IL_0113: stloc.0
IL_0114: brfalse.s IL_0172
IL_0116: ldloc.0
IL_0117: unbox [mscorlib]System.Int32
IL_011c: ldind.i4
IL_011d: switch (
IL_0158,
IL_015a,
IL_015c,
IL_015e,
....
IL_0170)
IL_0156: br.s IL_0172
....
IL_0174: ret
}
WhidbeyThe next version of the C# compiler will be introducing a number of improvements for strings. Strings will have an improved hashing function with better distributional properties, so you don't want to rely on the current behavior. Additional new functions in the string classes include the following: the static method These are in the current pre-release versions of Whidbey available, but there will likely be more changes prior to beta. ConclusionThis concludes my discourse on strings for now. I will continue update this article with new source code and actual benchmarks in the future. Be sure to watch for update versions of this page. As a result of the enthusiasm that this article has generated, I will continue to develop more UNDOCUMENTED articles. My next article will be a discussion of the implementations of arrays and collections. I hope to publish a couple dozen articles when I am done with the series. My sources include various books Microsoft publishes, the shared source CLI, MSDN, magazine articles, interviews, inside sources, developer conference presentations, and good old disassembly. All of this behind-the-scenes information takes some amount of work to research and obtain, so, if you enjoyed this article, don't forget to vote. Version History
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||