Introduction
In every good book that you read about the .NET framework or the .NET programming languages (C#, VB.NET), a special section is dedicated to strings (the System.String class). The reason for that is that this class has a special behaviour, and the better you understand how the .NET framework is handling it, the better your code will be.
Background
This article is targeting beginner and intermediate developers, but at the same time, advanced developers may find a few interesting things here.
1. Value Type vs. Reference Type
The String is a reference type since it inherits from System.Object. In order to assign a string literal to a System.String object, the following simplified syntax can be used:
String s = "Hello!";
As you know, this syntax is used to assign values to any of the value types from the .NET framework.
int i = 10;
decimal d = 5.25
The syntax String s = new String ("Hello!"); does not work - actually, it does not even compile because the String constructor is not overloaded to accept a string literal. Instead, another way to create a String object is:
char[] chars = {'a','b','c'};
String s = new String(chars);
So, from the way the value is often assigned (first assignment), System.String can be mistakenly seen as a value type.
Another scenario is when System.String is a parameter type for any method. The value types in .NET framework can be passed by value (default behaviour) or by reference. On the other hand, reference types are passed always by reference (they don't require the ref keyword). In the code below, the parameter of the "UpdateValue" method is an integer (a value type).
static void Main(string[] args)
{
int myValue = 10;
UpdateValue(ref myValue);
Console.WriteLine("My value: {0}", myValue);
}
static void UpdateValue(ref int sValue)
{
sValue = 20;
}
If we change the code by replacing the <int> value type with the System.String reference type, we will get the same end result: the myValue variable will be different - the "UpdatedValue" is displayed in the console (I'll explain in the next articles on this subject why this happens with strings):
static void Main(string[] args)
{
String myValue = "CurrentValue";
UpdateString(ref myValue);
Console.WriteLine("My value: {0}", myValue);
}
static void UpdateString(ref String sValue)
{
sValue = "UpdatedValue";
}
If the "ref" keyword is removed from the method call and also from the method parameter definition, then the output will display the value before the call to the method: "CurrentValue". So, the question is why System.String is a reference type when it acts as a value type? The answer is a little more complex. Many modern programming languages (including Java, C#) consider String to be a primitive type, and therefore the compiler treats it as such. Again, why is String a reference type? Mainly because it inherits from System.Object, and therefore in memory, it exists under the Heap (the Heap is the area in memory where all the reference types are stored) and not under the Stack (this is where all value types go). Keep in mind though that System.String is a very special case of reference type.
2. String Interning Process
So, what happens behind the scene? The CLR (Common Language Runtime) internally creates a hash table (also called the "intern pool") where all the strings declared in an application or the ones that are programmatically added are kept. The behaviour of interning strings which are declared in the application is different from one version to another of the .NET framework, and therefore as a rule of thumb, do not trust that the declared strings are automatically interned (added to the internal hash table). However, you can trust that all the ones for which the String.Intern method (see below for a description of the behaviour) was used are in this hash table.
The key of the hash table is the string value, and the value is the reference to the String object(s). The way to add the strings programmatically to the hash table is by calling the method String.Intern(String s). By calling this method, CLR checks if there is an identical string already in the table. If an entry is found, then the method returns the reference to the existing string. If no entry is found, then CLR creates a copy of the string which is added to the internal hash table and the reference to the copy is returned. Another method related to the string interning process is String.IsInterned(String s). First of all, I need to mention that the return type of this method is not a boolean (Microsoft should have come up with a better naming for this method or for the returned type). The type that this method returns is a String, and the logic is the following: if the string that is passed in the method call exists in the internal hash table, then the method returns a reference to the String object that is interned already (the table already contains it). Otherwise, the method returns null, and here it's worth mentioning that the string is not added to the internal table if it does not exist.
There is also a difference between the String object references that exist in the Heap and the ones that are in the internal hash table (impacted by the interning process) in the way that not all the strings are in the internal hash table. Enough talk for now, and let's see the results when executing some code:
String s1 = "Hello world!";
String s2 = "Hello world!";
Console.WriteLine("Object references are equal: {0}",
Object.ReferenceEquals(s1, s2));
String s3 = String.Intern("Hello world!");
String s4 = String.Intern("Hello world!");
Console.WriteLine("Object references are equal: {0}",
Object.ReferenceEquals(s3, s4));
In the first case (strings s1 and s2), the .NET framework may add the strings to the intern pool. It depends on the framework version and also on the compile parameters/settings. Therefore, the result may be different depending on the platform and settings, and you should never write code that relies on it. In the second case, however, the result will always be the same: the ReferenceEquals method returns 'true' because the string literals are programmatically added to the intern pool. So far, we've seen that the System.String is a reference type, and in the second scenario, s3 and s4 are pointing to the same reference (memory address on the heap). The question that you may ask is: if I change the value of s4, then s3 will be changed as well? The answer is 'No', or to be more precise, 'Not Really', and in the next article, you will find out why.
3. Strings are Immutable
In the Object Oriented programming world, an immutable object is an object which cannot be modified once it is created. This behaviour of strings is what made the interning process possible. Having the strings immutable, a copy of the reference can be created instead of copying the entire object. Therefore, multiple objects can point to the same string literal. But, immutability does not mean that the memory where the object data (string literal) is stored is read-only. What it really means is that behind the scenes, the .NET framework makes sure that you cannot change the value of the string literal (or at least not when working with managed/safe code). Let's see what happens in the following code:
Line 1: String s1 = String.Intern("ABC");
Line 2: String s2 = String.Intern("ABC");
Line 3: s2 = s2.ToLower();
Line 1 adds the literal "ABC" to the intern pool, and returns the reference to the object s1. Line 2 tries to add the literal "ABC" to the intern pool, but in this case, aligned with the .NET documentation, "ABC" is not added since it already exists. In turn, the same reference is returned to the object s2. Until now, both of the objects point to the same string literal by pointing to the same reference. The very interesting part comes in Line 3. Here, the method 'toLower()' does the following: creates a new string literal and populates it with the value "abc". The reference to the string literal is then returned, and now the object s2 points to a new memory location. Note that by no means the memory location which holds the literal "ABC" was overwritten with the value "abc" in this case. Therefore, we are in the situation that s1 still points to "ABC" and now s2 points to "abc". This assumption is all good and valid when we are in the context of managed/safe code. If we deal with unmanaged code, we need to be very careful when we do operations with strings. As I mentioned above, the memory location where the string literal is stored is not read-only, and therefore it can be overwritten if we write code that does that. And with the unmanaged code, this can be achieved. Let’s see what happens in the below example:
static void Main(string[] args)
{
String s1 = String.Intern("String cannot be changed");
String s2 = String.Intern("String cannot be changed");
int bufferLength = s1.Length;
GetUserName(s1, ref bufferLength);
Console.WriteLine("The second string: {0}",s2);
}
[DllImport("Advapi32", CharSet = CharSet.Unicode)]
static extern bool GetUserName(
[MarshalAs(UnmanagedType.LPWStr)] string userName, ref int bufferLength);
Running the above code on my computer, the following message was displayed in the console (Marius is the my NT username): "The second string: Marius cannot be changed". So, we declare s1 and s2, and we make sure that they point to the same literal by using the String.Intern(String s) method. Next, an unmanaged/unsafe piece of code is called: GetUserName from "Advapi32.dll" (you can follow the link for the MSDN description of the method). What happens during the method call is the interesting part: the method is passed one of the strings declared, and since the unmanaged code does not follow the rules of the managed code regarding the immutability of the strings, it writes the actual response at the memory location that s1 points to. But in the managed world, the s2 object also points to the same memory location, and therefore the content of the string literal is actually changed.
The source of this article can also be found here.