Click here to Skip to main content
Licence CPOL
First Posted 14 Jan 2009
Views 14,374
Bookmarked 26 times

.NET Strings

By | 14 Jan 2009 | Article
Overview of .NET string handling.

Introduction

In every good book that you read about the .NET framework or the .NET programming languages (C#, VB.NET), a special section is dedicated to strings (the System.String class). The reason for that is that this class has a special behaviour, and the better you understand how the .NET framework is handling it, the better your code will be.

Background

This article is targeting beginner and intermediate developers, but at the same time, advanced developers may find a few interesting things here.

1. Value Type vs. Reference Type

The String is a reference type since it inherits from System.Object. In order to assign a string literal to a System.String object, the following simplified syntax can be used:

String s = "Hello!";

As you know, this syntax is used to assign values to any of the value types from the .NET framework.

int i = 10;
decimal d = 5.25 

The syntax String s = new String ("Hello!"); does not work - actually, it does not even compile because the String constructor is not overloaded to accept a string literal. Instead, another way to create a String object is:

char[] chars = {'a','b','c'};
String s = new String(chars);
// it will create a String object which holds the literal "abc"

So, from the way the value is often assigned (first assignment), System.String can be mistakenly seen as a value type.

Another scenario is when System.String is a parameter type for any method. The value types in .NET framework can be passed by value (default behaviour) or by reference. On the other hand, reference types are passed always by reference (they don't require the ref keyword). In the code below, the parameter of the "UpdateValue" method is an integer (a value type).

static void Main(string[] args)
{
    int myValue = 10;
    UpdateValue(ref myValue);
    Console.WriteLine("My value: {0}", myValue);          
}

static void UpdateValue(ref int sValue)
{
    sValue = 20;
}

If we change the code by replacing the <int> value type with the System.String reference type, we will get the same end result: the myValue variable will be different - the "UpdatedValue" is displayed in the console (I'll explain in the next articles on this subject why this happens with strings):

static void Main(string[] args)
{
    String myValue = "CurrentValue";
    UpdateString(ref myValue);
    Console.WriteLine("My value: {0}", myValue);          
}
static void UpdateString(ref String sValue)
{
    sValue = "UpdatedValue";
}

If the "ref" keyword is removed from the method call and also from the method parameter definition, then the output will display the value before the call to the method: "CurrentValue". So, the question is why System.String is a reference type when it acts as a value type? The answer is a little more complex. Many modern programming languages (including Java, C#) consider String to be a primitive type, and therefore the compiler treats it as such. Again, why is String a reference type? Mainly because it inherits from System.Object, and therefore in memory, it exists under the Heap (the Heap is the area in memory where all the reference types are stored) and not under the Stack (this is where all value types go). Keep in mind though that System.String is a very special case of reference type.

2. String Interning Process

So, what happens behind the scene? The CLR (Common Language Runtime) internally creates a hash table (also called the "intern pool") where all the strings declared in an application or the ones that are programmatically added are kept. The behaviour of interning strings which are declared in the application is different from one version to another of the .NET framework, and therefore as a rule of thumb, do not trust that the declared strings are automatically interned (added to the internal hash table). However, you can trust that all the ones for which the String.Intern method (see below for a description of the behaviour) was used are in this hash table.

The key of the hash table is the string value, and the value is the reference to the String object(s). The way to add the strings programmatically to the hash table is by calling the method String.Intern(String s). By calling this method, CLR checks if there is an identical string already in the table. If an entry is found, then the method returns the reference to the existing string. If no entry is found, then CLR creates a copy of the string which is added to the internal hash table and the reference to the copy is returned. Another method related to the string interning process is String.IsInterned(String s). First of all, I need to mention that the return type of this method is not a boolean (Microsoft should have come up with a better naming for this method or for the returned type). The type that this method returns is a String, and the logic is the following: if the string that is passed in the method call exists in the internal hash table, then the method returns a reference to the String object that is interned already (the table already contains it). Otherwise, the method returns null, and here it's worth mentioning that the string is not added to the internal table if it does not exist.

There is also a difference between the String object references that exist in the Heap and the ones that are in the internal hash table (impacted by the interning process) in the way that not all the strings are in the internal hash table. Enough talk for now, and let's see the results when executing some code:

//declarative vs programmatically interning       

String s1 = "Hello world!";
String s2 = "Hello world!";
Console.WriteLine("Object references are equal: {0}", 
                  Object.ReferenceEquals(s1, s2));

String s3 = String.Intern("Hello world!");
String s4 = String.Intern("Hello world!");
Console.WriteLine("Object references are equal: {0}", 
                  Object.ReferenceEquals(s3, s4));

In the first case (strings s1 and s2), the .NET framework may add the strings to the intern pool. It depends on the framework version and also on the compile parameters/settings. Therefore, the result may be different depending on the platform and settings, and you should never write code that relies on it. In the second case, however, the result will always be the same: the ReferenceEquals method returns 'true' because the string literals are programmatically added to the intern pool. So far, we've seen that the System.String is a reference type, and in the second scenario, s3 and s4 are pointing to the same reference (memory address on the heap). The question that you may ask is: if I change the value of s4, then s3 will be changed as well? The answer is 'No', or to be more precise, 'Not Really', and in the next article, you will find out why.

3. Strings are Immutable

In the Object Oriented programming world, an immutable object is an object which cannot be modified once it is created. This behaviour of strings is what made the interning process possible. Having the strings immutable, a copy of the reference can be created instead of copying the entire object. Therefore, multiple objects can point to the same string literal. But, immutability does not mean that the memory where the object data (string literal) is stored is read-only. What it really means is that behind the scenes, the .NET framework makes sure that you cannot change the value of the string literal (or at least not when working with managed/safe code). Let's see what happens in the following code:

Line 1: String s1 = String.Intern("ABC");
Line 2: String s2 = String.Intern("ABC");
Line 3: s2 = s2.ToLower();

Line 1 adds the literal "ABC" to the intern pool, and returns the reference to the object s1. Line 2 tries to add the literal "ABC" to the intern pool, but in this case, aligned with the .NET documentation, "ABC" is not added since it already exists. In turn, the same reference is returned to the object s2. Until now, both of the objects point to the same string literal by pointing to the same reference. The very interesting part comes in Line 3. Here, the method 'toLower()' does the following: creates a new string literal and populates it with the value "abc". The reference to the string literal is then returned, and now the object s2 points to a new memory location. Note that by no means the memory location which holds the literal "ABC" was overwritten with the value "abc" in this case. Therefore, we are in the situation that s1 still points to "ABC" and now s2 points to "abc". This assumption is all good and valid when we are in the context of managed/safe code. If we deal with unmanaged code, we need to be very careful when we do operations with strings. As I mentioned above, the memory location where the string literal is stored is not read-only, and therefore it can be overwritten if we write code that does that. And with the unmanaged code, this can be achieved. Let’s see what happens in the below example:

static void Main(string[] args)
{
     String s1 = String.Intern("String cannot be changed");
     String s2 = String.Intern("String cannot be changed");

     int bufferLength = s1.Length;
     GetUserName(s1, ref bufferLength);
     Console.WriteLine("The second string: {0}",s2);
}

[DllImport("Advapi32", CharSet = CharSet.Unicode)]
static extern bool GetUserName(
[MarshalAs(UnmanagedType.LPWStr)] string userName, ref int bufferLength);

Running the above code on my computer, the following message was displayed in the console (Marius is the my NT username): "The second string: Marius cannot be changed". So, we declare s1 and s2, and we make sure that they point to the same literal by using the String.Intern(String s) method. Next, an unmanaged/unsafe piece of code is called: GetUserName from "Advapi32.dll" (you can follow the link for the MSDN description of the method). What happens during the method call is the interesting part: the method is passed one of the strings declared, and since the unmanaged code does not follow the rules of the managed code regarding the immutability of the strings, it writes the actual response at the memory location that s1 points to. But in the managed world, the s2 object also points to the same memory location, and therefore the content of the string literal is actually changed.

The source of this article can also be found here.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Marius Serban

Software Developer
www.mariusserban.com
Canada Canada

Member



Comments and Discussions

 
You must Sign In to use this message board. (secure sign-in)
 
Search this forum  
 FAQ
    Noise  Layout  Per page   
  Refresh
GeneralNice PinmentorHans Dietrich11:50 8 Mar '10  
Generalinterning does not decrease memory usage?! PinmemberHalfHuman23:19 21 Jan '09  
GeneralRe: interning does not decrease memory usage?! PinmemberMarius Serban19:06 22 Jan '09  
GeneralRe: interning does not decrease memory usage?! PinmemberHalfHuman21:44 22 Jan '09  
GeneralRe: interning does not decrease memory usage?! PinmemberMarius Serban17:15 24 Jan '09  
GeneralRe: interning does not decrease memory usage?! PinmemberHalfHuman22:05 22 Jan '09  
GeneralRe: interning does not decrease memory usage?! PinmemberMarius Serban17:37 24 Jan '09  
GeneralRe: interning does not decrease memory usage?! PinmemberHalfHuman22:48 26 Jan '09  
Generalexcellent article PinmemberDonsw11:59 18 Jan '09  
GeneralString.Copy and interning Pinmembersupercat96:43 15 Jan '09  
Is it guaranteed that if a string is generated programmatically without an explicit request to intern it, that the allocated object will not get used by any other string even if the contents happen to match?
 
For example, if I had a function that was supposed to return a string if it worked, but return an indication if it failed (assuming the "failure" was sufficiently expected that an exception would be inappropriate), and if an empty string would be a legitimate result, would it be reasonable to do something like this:
Public Shared ErrorCondition1 as New String("?"c,0)
Public Shared ErrorCondition2 as New String("?"c,0)
 
Sub Whatever
  st = MyFunction()  ' Returns a legitmate string or an ErrorCondition one.
  ' Should return String.Empty if the result is legitimately a zero-length string.
 
  If String.IsNullOrEmpty(st) And st IsNot String.Empty Then
    If st is ErrorCondition1 Then
      .. Handle error condition 1
    Else If st is ErrorCondition2 Then
      .. Handle error condition 1
    Else
      .. It's some other error condition
    End If
  End If
End Sub
If the function always returns String.Empty for any 'legitimate' zero-length result, would code like the above be guaranteed to work (without the framework itself getting creative and deciding to map the zero-length strings to the same object)?
GeneralRe: String.Copy and interning [modified] PinmemberMarius Serban17:48 24 Jan '09  
GeneralRe: String.Copy and interning Pinmembersupercat919:31 24 Jan '09  
GeneralRe: String.Copy and interning PinmemberMarius Serban19:52 24 Jan '09  
GeneralRe: String.Copy and interning Pinmembersupercat98:21 25 Jan '09  
GeneralThoughts PinmemberPIEBALDconsult5:23 15 Jan '09  
GeneralRe: Thoughts PinmemberMarius Serban6:05 15 Jan '09  
GeneralRe: Thoughts PinmemberJohn Brett4:27 24 Feb '09  
GeneralRe: Thoughts PinmemberPIEBALDconsult5:37 24 Feb '09  
GeneralGood job PinmemberAdrian Dorache4:59 15 Jan '09  
GeneralGood article + String modifications PinmemberPavel Pawlowski3:07 15 Jan '09  
GeneralRe: Good article + String modifications PinmemberMarius Serban6:12 15 Jan '09  
GeneralRe: Good article + String modifications Pinmembersupercat96:29 15 Jan '09  

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Mobile
Web02 | 2.5.120529.1 | Last Updated 15 Jan 2009
Article Copyright 2009 by Marius Serban
Everything else Copyright © CodeProject, 1999-2012
Terms of Use
Layout: fixed | fluid