Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles
(untagged)

Microsoft Visual C# .NET 2003 Developer's Cookbook

0.00/5 (No votes)
3 Mar 2004 1  
Chapter 3: Strings and Regular Expressions
Authors Mark Schmidt, Simon Robinson
Title Microsoft Visual C# .NET 2003 Developer's Cookbook
Publisher Sams
Published DEC 12, 2003
ISBN 0672325802
Price US$ 44.99
Pages 816

Chapter 3: Strings and Regular Expressions

3.0. Introduction

It would be very rare to create an entire application without using a single string. Strings help make sense of the seemingly random jumble of binary data that applications use to accomplish a task. They appear in all facets of application development from the smallest system utility to large enterprise services. Their value is so apparent that more and more connected systems are leaning toward string data within their communication protocols by utilizing the Extensible Markup Language (XML) rather than the more cumbersome traditional transmission of large binary data. This book uses strings extensively to examine the internal contents of variables and the results of program flow using Framework Class Libraries (FCL) methods such as Console.WriteLine  and MessageBox.Show.

In this chapter, you will learn how to take advantage of the rich support for strings within the .NET Framework and the C# language. Coverage includes ways to manipulate string contents, programmatically inspect strings and their character attributes, and optimize performance when working with string objects. Furthermore, this chapter uncovers the power of regular expressions and how they allow you to effectively parse and manipulate string data. After reading this chapter, you will be able to use regular expressions in a variety of different situations where their value is apparent.

3.1. Creating and Using String Objects

You want to create and manipulate string data within your application.

Technique

The C# language, knowing the importance of string data, contains a string  keyword that simulates the behavior of a value data type. To create a string, declare a variable using the string  keyword. You can use the assignment operator to initialize the variable using a static string or with an already initialized string variable.

string string1 = "This is a string";
string string2 = string1;

To gain more control over string initialization, declare a variable using the System.String  data type and create a new instance using the new keyword. The System.String  class contains several constructors that you can use to initialize the string value. For instance, to create a new string that is a small subset of an existing string, use the overloaded constructor, which takes a character array and two integers denoting the beginning index and the number of characters from that index to copy:

class Class1
{
  [STAThread]
  static void Main(string[] args)
  {
    string string1 = "Field1, Field2";
    System.String string2 = new System.String( string1.ToCharArray(), 8, 6 );

    Console.WriteLine( string2 );

  }
}

Finally, if you know a string will be intensively manipulated, use the System.Text. StringBuilder  class. Creating a variable of this data type is similar to using the System.String  class, and it contains several constructors to initialize the internal string value. The key internal difference between a regular string object and a StringBuilder lies in performance. Whenever a string is manipulated in some manner, a new object has to be created, which subsequently causes the old object to be marked for deletion by the garbage collector. For a string that undergoes several transformations, the performance hit associated with frequent object creation and deletions can be great. The StringBuilder class, on the other hand, maintains an internal buffer, which expands to make room for more string data should the need arise, thereby decreasing frequent object activations.

Comments

There is no recommendation on whether you use the string  keyword or the System.String  class. The string keyword is simply an alias for this class, so it is all a matter of taste. We prefer using the string  keyword, but this preference is purely aesthetic. For this reason, we simply refer to the System.String  class as the string  class or data type.

The string  class contains many methods, both instance and static, for manipulating strings. If you want to compare strings, you can use the Compare  method. If you are just testing for equality, then you might want to use the overloaded equality operator (==). However, the Compare  method returns an integer instead of Boolean value denoting how the two strings differ. If the return value is 0, then the strings are equal. If the return value is greater than 0, as shown in Listing 3.1, then the first operand is greater alphabetically than the second operand. If the return value is less than 0, the opposite is true. When a string is said to be alphabetically greater or lower than another, each character reading from left to right from both strings is compared using its equivalent ASCII value.

Listing 3.1 Using the Compare Method in the String Class

using System;

namespace _1_UsingStrings
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      string string1 = "";
      String string2 = "";

      Console.Write( "Enter string 1: " );
      string1 = Console.ReadLine();
      Console.Write( "Enter string 2: " );
      string2 = Console.ReadLine();
      
      // string and String are the same types

      Console.WriteLine( "string1 is a {0}\nstring2 is a {1}", 
        string1.GetType().FullName, string2.GetType().FullName );

      CompareStrings( string1, string2 );
    }

    public static void CompareStrings( string str1, string str2 )
    {
      int compare = String.Compare( str1, str2 );
      
      if( compare == 0 )
      {
        Console.WriteLine( "The strings {0} and {1} are the same.\n",
          str1, str2 );
      }
      else if( compare < 0 )
      {
        Console.WriteLine( "The string {0} is less than {1}",
          str1, str2 );
      }
      else if( compare > 0 )
      {
        Console.WriteLine( "The string {0} is greater than {1}",
          str1, str2 );
      }
    }
  }
}

As mentioned earlier, the string  class contains both instance and static methods. Sometimes you have no choice about whether to use an instance or static method. However, a few of the instance methods contain a static version as well. Because calling a static method is a nonvirtual function call, you see performance gains if you use this version. An example where you might see both instance and static versions appears in Listing 3.1. The string comparison uses the static Compare  method. You can also do so using the nonstatic CompareTo  method using one of the string instances passed in as parameters. In most cases, the performance gain is negligible, but if an application needs to repeatedly call these methods, you might want to consider using the static over the non-static method.

The string  class is immutable. Once a string is created, it cannot be manipulated. Methods within the string  class that modify the original string instance actually destroy the string and create a new string object rather than manipulate the original string instance. It can be expensive to repeatedly call string  methods if new objects are created and destroyed continuously. To solve this, the .NET Framework contains a StringBuilder  class contained within the System.Text  namespace, which is explained later in this chapter.

3.2. Formatting Strings

Given one or more objects, you want to create a single formatted string representation.

Technique

You can format strings using numeric and picture formatting within String.Format  or within any method that uses string-formatting techniques for parameters such as Console.WriteLine.

Comments

The String  class as well as a few other methods within the .NET Framework allow you to format strings to present them in a more ordered and readable format. Up to this point in the book, we used basic formatting when calling the Console.WriteLine  method. The first parameter to Console.WriteLine  is the format specifier string. This string controls how the remaining parameters to the method should appear when displayed. You use placeholders within the format string to insert the value of a variable. This placeholder uses the syntax {n}  where n is the index in the parameter list following the format specifier. Take the following line of code, for instance:

Console.WriteLine( "x={0}, y={1}, {0}+{1}={2}", x, y, x+y );

This line of code has three parameters following the format specifier string. You use placeholders within the format specification, and when this method is called, the appropriate substitutions are made. Although you can do the same thing using string concatenation, the resultant line of code is slightly obfuscated:

string s = "x=" + x + ",y=" + y + ", " + x + "+" + y + "=" + (x+y);
Console.WriteLine( s );

You can further refine the format by applying format attributes on the placeholders themselves. These additional attributes follow the parameter index value and are separated from that index with a :  character. There are two types of special formatting available. The first is numeric formatting, which lets you format a numeric parameter into one of nine different numeric formats, as shown in Table 3.1. The format of these specifiers, using the currency format as an example, is Cxx where xx is a number from 1 to 99 specifying the number of digits to display. Listing 3.2 shows how to display an array of integers in hexadecimal format, including how to specify the number of digits to display. Notice also how you can change the case of the hexadecimal numbers A through F by using an uppercase or lowercase format specifier.

Table 3.1 Numeric Formatting Specifiers

Character

Format

Description

C  or c

Currency

Culturally aware currency format.

D  or d

Decimal

Only supports integral numbers. Displays a string using decimal digits preceded by a minus sign if negative.

E  or e

Exponential/scientific notation

Displays numbers in the form d.ddddddE�dd where d is a decimal digit.

F  or f

Fixed point

Displays a series of decimal digits with a decimal point and additional digits.

G  or g

General format

Displays either as a fixed-point or scientific notation based on the size of the number.

N  or n

Number format

Similar to fixed point but uses a separator character (such as ,) for groups of digits.

P  or p

Percentage

Multiplies the number by 100 and displays with a percent symbol.

R  or r

Roundtrip

Formats a floating-point number so that it can be successfully converted back to its original value.

X  or x

Hexadecimal

Displays an integral number using the base-16 number system.


Listing 3.2 Specifying a Different Numeric Format by Adding Format Specifiers on a Parameter Placeholder

using System;

namespace _2_Formatting
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      double[] numArray = {2, 5, 4.5, 45.43, 200000};

      // format in lowercase hex

      Console.WriteLine( "\n\nHex (lower)\n-----------" );
      foreach( double num in numArray )
      {
        Console.Write( "0x{0:x}\t", (int) num );
      }

      // format in uppercase hex

      Console.WriteLine( "\n\nHex (upper)\n-----------" );
      foreach( double num in numArray )
      {
        Console.Write( "0x{0:X}\t", (int) num );
      }
    }
  }
}

Another type of formatting is picture formatting. Picture formatting allows you to create a custom format specifier using various symbols within the format specifier string. Table 3.2 lists the available picture format characters. Listing 3.3 also shows how to create a custom format specifier. In that code, the digits of the input number are extracted and displayed using a combination of digit placeholders and a decimal-point specifier. Furthermore, you can see that you are free to add characters not listed in the table. This freedom allows you to add literal characters intermixed with the digits.

Table 3.2 Picture Formatting Specifiers

Character

Name

Description

0

Zero placeholder

Copies a digit to the result string if a digit is at the position of the 0. If no digit is present, a 0 is displayed.

#

Display digit placeholder

Copies a digit to the result string if a digit appears at the position of the #. If no digit is present, nothing is displayed.

.

Decimal point

Represents the location of the decimal point in the resultant string.

,

Group separator and number scaling

Inserts thousands separators if placed between two placeholders or scales a number down by 1,000 per ,  character when placed directly to the left of a decimal point.

&

Percent

Multiplies a number by 100 and inserts a %  symbol.

E�0, e�0

Exponential notation

Displays the number in exponential notation using the number of 0s as a placeholder for the exponent value.

\

Escape character

Used to specify a special escape-character formatting instruction. Some of these include \n  for newline, \t  for tab, and \\  for the \  character.

;

Section separator

Separates positive, negative, and zero numbers in the format string in which you can apply different formatting rules based on the sign of the original number.


Listing 3.3 shows how custom formatting can separate a number by its decimal point. Using a foreach loop, each value is printed using three different formats. The first format will output the value's integer portion using the following format string:

0:$#,#

Next, the decimal portion is written. If the value does not explicitly define a decimal portion, zeroes are written instead. The format string to output the decimal value is

$.#0;

Finally, the entire value is displayed up to two decimal places using the following format string:

{0:$#,#.00}

Listing 3.3 Using Picture Format Specifiers to Create Special Formats

using System;

namespace _2_Formatting
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      double[] numArray = {2, 5, 4.5, 45.43, 200000};

      // format as custom

      Console.WriteLine( "\n\nCustom\n------" );
      foreach( double num in numArray )
      {
        Console.WriteLine( "{0:$#,# + $.#0;} = {0:$#,#.00}", num );
      }
    }
  }
}

3.3. Accessing Individual String Characters

You want to process individual characters within a string.

Technique

Use the index operator ([]) by specifying the zero-based index of the character within the string that you want to extract. Furthermore, you can also use the foreach  enumerator on the string using a char  structure as the enumeration data type.

Comments

The string  class is really a collection of objects. These objects are individual characters. You can access each character using the same methods you would use to access an object in most other collections (which is covered in the next chapter).

You use an indexer to specify which object in a collection you want to retrieve. In C#, the first object begins at the 0 index of the string. The objects are individual characters whose data type is System.Char, which is aliased with the char  keyword. The indexer for the string  class, however, can only access a character and cannot set the value of a character at that position. Because a string is immutable, you cannot change the internal array of characters unless you create and return a new string. If you need the ability to index a string to set individual characters, use a StringBuilder  object.

Listing 3.4 shows how to access the characters in a string. One thing to point out is that because the string also implements the IEnumerable interface, you can use the foreach  control structure to enumerate through the string.

Listing 3.4 Accessing Characters Using Indexers and Enumeration

using System;
using System.Text;

namespace _3_Characters
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      string str = "abcdefghijklmnopqrstuvwxyz";

      str = ReverseString( str );
      Console.WriteLine( str );

      str = ReverseStringEnum( str );
      Console.WriteLine( str );
    }

    static string ReverseString( string strIn )
    {
      StringBuilder sb = new StringBuilder(strIn.Length);

      for( int i = 0; i < strIn.Length; ++i )
      {
        sb.Append( strIn[(strIn.Length-1)-i] );
      }
      return sb.ToString();
    }

    static string ReverseStringEnum( string strIn )
    {
      StringBuilder sb = new StringBuilder( strIn.Length );
      foreach( char ch in strIn )
      {
        sb.Insert( 0, ch );
      }

      return sb.ToString();
    }
  }
}

3.4. Analyzing Character Attributes

You want to evaluate the individual characters in a string to determine a character's attributes.

Technique

The System.Char  structure contains several static functions that let you test individual characters. You can test whether a character is a digit, letter, or punctuation symbol or whether the character is lowercase or uppercase.

Comments

One of the hardest issues to handle when writing software is making sure users input valid data. You can use many different methods, such as restricting input to only digits, but ultimately, you always need an underlying validating test of the input data.

You can use the System.Char structure to perform a variety of text-validation procedures. Listing 3.5 demonstrates validating user input as well as inspecting the characteristics of a character. It begins by displaying a menu and then waiting for user input using the Console.ReadLine method. Once a user enters a command, you make a check using the method ValidateMainMenuInput. This method checks to make sure the first character in the input string is not a digit or punctuation symbol. If the validation passes, the string is passed to a method that inspects each character in the input string. This method simply enumerates through all the characters in the input string and prints descriptive messages based on the characteristics. Some of the System.Char  methods for inspection have been inadvertently left out of Listing 3.5. Table 3.3 shows the remaining methods and their functionality. The results of running the application in Listing 3.5 apper in Figure 3.1.

Listing 3.5 Using the Static Methods in System.Char to Inspect the Details of a Single Character

using System;

namespace _4_CharAttributes
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      char cmd = 'x';

      string input;
      do
      {
        DisplayMainMenu();
        input = Console.ReadLine();

        if( (input == "" ) || 
           ValidateMainMenuInput( Char.ToUpper(input[0]) ) == 0 )
        {
          Console.WriteLine( "Invalid command!" );
        }
        else
        {
          cmd = Char.ToUpper(input[0]);

          switch( cmd )
          {
            case 'Q':
            {
              break;
            }

            case 'N':
            {
              Console.Write( "Enter a phrase to inspect: " );
              input = Console.ReadLine();
              InspectPhrase( input );
              break;
            }
          }
        }
      } while ( cmd != 'Q' );
    }

    private static void InspectPhrase( string input )
    {
      foreach( char ch in input )
      {
        Console.Write( ch + " - ");

        if( Char.IsDigit(ch) )
          Console.Write( "IsDigit " );
        if( Char.IsLetter(ch) )
        {
          Console.Write( "IsLetter " );
          Console.Write( "(lowercase={0}, uppercase={1})", 
            Char.ToLower(ch), Char.ToUpper(ch));
        }
        if( Char.IsPunctuation(ch) )
          Console.Write( "IsPunctuation " );
         if( Char.IsWhiteSpace(ch) )
          Console.Write( "IsWhitespace" );

        Console.Write("\n");
        
      }
    }
    private static int ValidateMainMenuInput( char input )
    {
      // a simple check to see if input == 'N' or 'Q' is good enough

      // the following is for illustrative purposes

      if( Char.IsDigit( input ) == true )
        return 0;
      else if ( Char.IsPunctuation( input ) )
        return 0;
      else if( Char.IsSymbol( input ))
        return 0;
      else if( input != 'N' && input != 'Q' )
        return 0;

      return (int) input;
    }

    private static void DisplayMainMenu()
    {
      Console.WriteLine( "\nPhrase Inspector\n-------------------" );
      Console.WriteLine( "N)ew Phrase" );
      Console.WriteLine( "Q)uit\n" );
      Console.Write( ">> " );
    }
    }
}

Table 3.3 System.Char  Inspection Methods

Name

Description

IsControl

Denotes a control character such as a tab or carriage return.

IsDigit

Indicates a single decimal digit.

IsLetter

Used for alphabetic characters.

IsLetterOrDigit

Returns true  if the character is a letter or a digit.

IsLower

Used to determine whether a character is lowercase.

IsNumber

Tests whether a character is a valid number.

IsPunctuation

Denotes whether a character is a punctuation symbol.

IsSeparator

Denotes a character used to separate strings. An example is the space character.

IsSurrogate

Checks for a Unicode surrogate pair, which consists of two 16-bit values primarily used in localization contexts.

IsSymbol

Used for symbolic characters such as $  or #.

IsUpper

Used to determine whether a character is uppercase.

IsWhiteSpace

Indicates a character classified as whitespace such as a space character, tab, or carriage return.


F
igure 3.1. Use the static  method in the System.Char  class to inspect character attributes.

The System.Char  structure is designed to work with a single Unicode character. Because a Unicode character is 2 bytes, the range of a character is from 0 to 0xFFFF. For portability reasons in future systems, you can always check the size of a char  by using the MaxValue  constant declared in the System.Char structure. One thing to keep in mind when working with characters is to avoid the confusion of mixing char  types with integer types. Characters have an ordinal value, which is an integer value used as a lookup into a table of symbols. One example of a table is the ASCII table, which contains 255 characters and includes the digits 0 through 9, letters, punctuation symbols, and formatting characters. The confusion lies in the fact that the number 6, for instance, has an ordinal char  value of 0x36. Therefore, the line of code meant to initialize a character to the number 6

char ch = (char) 6;

is wrong because the actual character in this instance is ^F, the ACK control character used in modem handshaking protocols. Displaying this value in the console would not provide the 6 that you were looking for. You could have chosen two different methods to initialize the variable. The first way is

char ch = (char) 0x36;

which produces the desired result and prints the number 6 to the console if passed to the Console.Write  method. However, unless you have the ASCII table memorized, this procedure can be cumbersome. To initialize a char variable, simply place the value between single quotes:

char ch = '6';

3.5. Case-Insensitive String Comparison

You want to perform case-insensitive string comparison on two strings.

Technique

Use the overloaded Compare  method in the System.String  class which accepts a Boolean value, ignoreCase, as the last parameter. This parameter specifies whether the comparison should be case insensitive (true) or case sensitive (false). To compare single characters, convert them to uppercase or lowercase, using ToUpper or ToLower, and then perform the comparison.

Comments

Validating user input requires a lot of forethought into the possible values a user can enter. Making sure you cover the range of possible values can be a daunting task, and you might ultimately run into human-computer interaction issues by severely limiting what a user can enter. Case-sensitivity issues increase the possible range of values, leading to greater security with respect to such things as passwords, but this security is usually at the expense of a user's frustration when she forgets whether a character is capitalized. As with many other programming problems, you must weigh the pros and cons.

To perform a case-insensitive comparison, you can use one of the many overloaded Compare  methods within the System.String  class. The methods that allow you to ignore case issues use a Boolean value as the last parameter in the method. This parameter is named ignoreCase, and when you set it to true, you make a case-insensitive comparison, as demonstrated in Listing 3.6.

Listing 3.6 Performing a Case-Insensitive String Comparison

using System;

namespace _5_CaseComparison
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      string str1 = "This Is A String.";
      string str2 = "This is a string.";

      Console.WriteLine( "Case sensitive comparison of" +
        " str1 and str2 = {0}", String.Compare( str1, str2 ));

      Console.WriteLine( "Case insensitive comparison of" + 
        " str1 and str2 = {0}", String.Compare( str1, str2, true ));
    }
  }
}

3.6. Working with Substrings

You need to change or extract a specific portion of a string.

Technique

To copy a portion of a string into a new string, use the SubString  method within the System.String class. You call this method using the string object instance of the source string:

string source = "ABCD1234WXYZ";
string dest = source.Substring( 4, 4 );
Console.WriteLine( "{0}\n", dest );

To copy a substring into an already existing character array, use the CopyTo  method. To assign a character array to an existing string object, create a new instance of the string using the new keyword, passing the character array as a parameter to the string constructor as shown in the following code, whose ouput appears in Figure 3.2:

string source = "ABCD";
char [] dest = { '1', '2', '3', '4', '5', '6', '7', '8' };

Console.Write( "Char array before = " );
Console.WriteLine( dest );

// copy substring into char array

source.CopyTo( 0, dest, 4, source.Length );

Console.Write( "Char array after = " );
Console.WriteLine( dest );

// copy back into source string

source = new String( dest );

Console.WriteLine( "New source = {0}\n", source ); 


Figure 3.2 Use the CopyTo  method to copy a substring into an existing character array.

If you need to remove a substring within a string and replace it with a different substring, use the Replace method. This method accepts two parameters, the substring to replace and the string to replace it with:

string replaceStr = "1234";
string dest = "ABCDEFGHWXYZ";

dest = dest.Replace( "EFGH", replaceStr );

Console.WriteLine( dest );

To extract an array of substrings that are separated from each other by one or more delimiters, use the Split  method. This method uses a character array of delimiter characters and returns a string array of each substring within the original string as shown in the following code, whose output appears in Figure 3.3. You can optionally supply an integer specifying the maximum number of substrings to split:

char delim = '\\';
string filePath = "C:\\Windows\\Temp";
string [] directories = null;

directories = filePath.Split( delim );
        
foreach (string directory in directories) 
{
  Console.WriteLine("{0}", directory);
}


Figure 3.3 You can use the Split  method in the System.String  class to place delimited substrings into a string array.

Comments

Parsing strings is not for the faint of heart. However, the job becomes easier if you have a rich set of methods that allow you to perform all types of operations on strings. Substrings are the goal of a majority of these operations, and the string  class within the .NET Framework contains many methods that are designed to extract or change just a portion of a string.

The Substring method extracts a portion of a string and places it into a new string object. You have two options with this method. If you pass a single integer, the Substring method extracts the substring that starts at that index and continues until it reaches the end of the string. One thing to keep in mind is that C# array indices are 0 based. The first character within the string will have an index of 0. The second Substring method accepts an additional parameter that denotes the ending index. It lets you extract parts of a string in the middle of the string.

You can create a new character array from a string by using the ToCharArray method of the string  class. Furthermore, you can extract a substring from the string and place it into a character array by using the CopyTo  method. The difference between these two methods is that the character array used with the CopyTo  method must be an already instantiated array. Whereas the ToCharArray returns a new character array, the CopyTo  method expects an existing character array as a parameter to the method. Furthermore, although methods exist to extract character arrays from a string, there is no instance method available to assign a character array to a string. To do this, you must create a new string object using the new keyword, as opposed to creating the familiar value-type string, and pass the character array as a parameter to the string constructor.

Using the Replace method is a powerful way to alter the contents of a string. This method allows you to search all instances of a specified substring within a string and replace those with a different substring. Additionally, the length of the substring you want to replace does not have to be the same length of the string you are replacing it with. If you recall the number of times you have performed a search and replace in any application, you can see the possible advantages of this method.

One other powerful method is Split. By passing a character array consisting of delimiter characters, you can split a string into a group of substrings and place them into a string array. By passing an additional integer parameter, you can also control how many substrings to extract from the source string. Referring to the code example earlier demonstrating the Split method, you can split a string representing a directory path into individual directory names by passing the \  character as the delimiter. You are not, however, confined to using a single delimiter. If you pass a character array consisting of several delimiters, the Split method extracts substrings based on any of the delimiters that it encounters.

3.7. Using Verbatim String Syntax

You want to represent a path to a file using a string without using escape characters for path separators.

Technique

When assigning a literal string to a string object, preface the string with the @  symbol. It turns off all escape-character processing so there is no need to escape path separators:

string nonVerbatim = "C:\\Windows\\Temp";
string verbatim = @"C:\Windows\Temp";

Comments

A compiler error that happens so frequently comes from forgetting to escape path separators. Although a common programming faux pas is to include hard-coded path strings, you can overlook that rule when testing an application. Visual C# .NET added verbatim string syntax as a feature to alleviate the frustration of having to escape all the path separators within a file path string, which can be especially cumbersome for large paths.

3.8. Choosing Between Constant and Mutable Strings

You want to choose the correct string data type to best fit your current application design.

Technique

If you know a string's value will not change often, use a string object, which is a constant value. If you need a mutable string, one that can change its value without having to allocate a new object, use a StringBuilder.

Comments

Using a regular string object is best when you know the string will not change or will only change slightly. This change includes the whole gamut of string operations that change the value of the object itself, such as concatenation, insertion, replacement, or removal of characters. The Common Language Runtime (CLR) can use certain properties of strings to optimize performance. If the CLR can determine that two string objects are the same, it can share the memory that these string objects occupy. These strings are then known as interned strings. The CLR contains an intern pool, which is a lookup table of string instances. Strings are automatically interned if they are assigned to a literal string within code. However, you can also manually place a string within the intern pool by using the Intern method. To test whether a string is interned, use the IsInterned  method, as shown in Listing 3.7.

Listing 3.7 Interning a String by Using the Intern  Method

using System;

namespace _7_StringBuilder
{
  /// <summary>

  /// Summary description for Class1.

  /// </summary>

  class Class1
  {
    /// <summary>

    /// The main entry point for the application.

    /// </summary>

     [STAThread]
    static void Main(string[] args)
    {
      string sLiteral = "Automatically Interned";
      string sNotInterned = "Not " + sLiteral;

      TestInterned( sLiteral );
      TestInterned( sNotInterned );

      String.Intern( sNotInterned );
      TestInterned( sNotInterned );
    }
    
    static void TestInterned( string str )
    {
      if( String.IsInterned( str ) != null )
      {
        Console.WriteLine( "The string \"{0}\" is interned.", str );
      }
      else
      {
        Console.WriteLine( "The string \"{0}\" is not interned.", str );
      }
    }
  }
}

A StringBuilder  behaves similarly to a regular string object and also contains similar method calls. However, there are no static methods because the StringBuilder  class is designed to work on string instances. Method calls on an instance of a StringBuilder object change the internal string of that object, as shown in Listing 3.8. A StringBuilder  maintains its mutable appearance by creating a buffer that is large enough to contain a string value and additional memory should the string need to grow.

Listing 3.8 Manipulating an Internal String Buffer Instead of Returning New String Objects

using System;
using System.Text;

namespace _7_StringBuilder
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      string string1 = "";
      String string2 = "";

      Console.Write( "Enter string 1: " );
      string1 = Console.ReadLine();
      Console.Write( "Enter string 2: " );
      string2 = Console.ReadLine();
      
      BuildStrings( string1, string2 );
    }

    public static void BuildStrings( string str1, string str2 )
    {
      StringBuilder sb = new StringBuilder( str1 + str2 );
      sb.Insert( str1.Length, " is the first string.\n" );
      sb.Insert( sb.Length, " is the second string.\n" );

      Console.WriteLine( sb );
    }
  }
}

3.9. Optimizing StringBuilder  Performance

Knowing that a StringBuilder object can suffer more of a performance hit than a regular string object, you want to optimize the StringBuilder object to minimize performance issues.

Technique

Use the EnsureCapacity method in the StringBuilder class. Set this integral value to a value that signifies the length of the longest string you may store in this buffer.

Comments

The StringBuilder class contains methods that allow you to expand the memory of the internal buffer based on the size of the string you may store. As your string continually grows, the StringBuilder won't have to repeatedly allocate new memory for the internal buffer. In other words, if you attempt to place a larger length string than what the internal buffer of the StringBuilder  class can accept, then the class will have to allocate additional memory to accept the new data. If you continuously add strings that increase in size from the last input string, the StringBuilder class will have to allocate a new buffer size, which it does internally by calling the GetStringForStringBuilder  method defined in the System.String  class. This method ultimately calls the unmanaged method FastAllocateString. By giving the StringBuilder class a hint using the EnsureCapacity method, you can help alleviate some of this continual memory reallocation, thereby optimizing the StringBuilder performance by reducing the amount of memory allocations needed to store a string value.

3.10. Understanding Basic Regular Expression Syntax

You want to create a regular expression.

Technique

Regular expressions consist of a series of characters and quantifiers on those characters. The characters themselves can be literal or can be denoted by using character classes, such as \d, which denotes a digit character class, or \S, which denotes any nonwhitespace character.

Table 3.4 Regular Expression Single Character Classes

Class

Description

\d

Any digit

\D

Any nondigit

\ws

Any word character

\W

Any nonword character

\s

Any whitespace character

\SW

Any nonwhitespace

In addition to the single character classes, you can also specify a range or set of characters using ranged and set character classes. This ability allows you to narrow the search for a specified character by limiting characters within a specified range or within a defined set.

Table 3.5 Ranged and Set Character Classes

Format

Description

.

Any character except newline.

\p{uc}

Any character within the Unicode character category uc.

[abcxyz]

Any literal character in the set.

\P{uc}

Any character not within the Unicode character category uc.

[^abcxyz]

Any character not in the set of literal characters.

Quantifiers work on character classes to expand the number of characters the character classes should match. You need to specify, for instance, a wildcard character on a character class, which means 0 or more characters within that class. Additionally, you can also specify a set number of matches of a class that should occur by using an integer within braces following the character class designation.

Table 3.6 Character Class Quantifiers

Format

Description

*

0 or more characters

+

1 or more characters

?

0 or 1 characters

{n}

Exactly n characters

{n,}

At least n characters

{n,m}

At least n but no more than m characters

You can also specify where a certain regular expression should start within a string. Positional assertions allow you to, for instance, match a certain expression as long as it occurs at the beginning or ending of a string. Furthermore, you can create a regular expression that operates on a set of words within a string by using a positional assertion that continues matching on each subsequent word separated by any nonalphanumeric character.

Table 3.7 Positional (Atomic Zero-Width) Assertions

Format

Description

^

Beginning of a string or beginning of a newline

\z

End of the string, including the newline character

$

End of a string before a newline character or at the end of the line

\G

Continues where the last match left off

\A

Beginning of a string

\b

Between word boundaries (between alphanumeric and nonalphanumeric characters)

\Z

End of the string before the newline character

\B

Characters not between word boundaries

Comments

Regular expressions use a variety of characters both symbolic and literal to designate how a particular string of text should be parsed. The act of parsing a string is known as matching, and when applied to a regular expression, the match will be either true or false. In other words, when you use a regular expression to match a series of characters, the match will either succeed or fail. As you can see, this process has powerful applicability in the area of input validation.

You build regular expressions using a series of character classes and quantifiers on those character classes as well as a few miscellaneous regular-expression constructs. You use character classes to match a single character based either on what type of character it is, such as a digit or letter, or whether it belongs within a specified range or set of characters (as shown in Table 3.4). Using this information, you can create a series of character classes to match a certain string of text. For instance, if you want to specify a phone number using character classes, you can use the following regular expression:

\(\d\d\d\)\s\d\d\d-\d\d\d\d

This expression begins by first escaping the left parenthesis. You must escape it because parentheses are used for grouping expressions. Next you can see three digits representing a phone number's area code followed by the closing parenthesis. You use a \s  to denote a whitespace character. The remainder of the regular expression contains the remaining digits of the phone number.

In addition to the single character classes, you can also use ranged and set character classes. They give you fine-grain control on exactly the type of characters the regular expression should match. For instance, if you want to match any character as long as it is a vowel, use the following expression:

[aeiou]

This line means that a character should match one of the literal characters within that set of characters. An even more specialized form of single character classes are Unicode character categories. Unicode categories are similar to some of the character-attribute inspection methods shown earlier in this chapter. For instance, you can use Unicode categories to match on uppercase or lowercase characters. Other categories include punctuation characters, currency symbols, and math symbols, to name a few. You can easily find the full list of Unicode categories in MSDN under the topic "Unicode Categories Enumeration."

You can optimize the phone-number expression, although it's completely valid, by using quantifiers. Quantifiers specify additional information about the character, character class, or expression to which it applies. Some quantifiers include wildcards such as *, which means 0 or more occurrences, and ?, which means only 0 or 1 occurrences of a pattern. You can also use braces containing an integer to specify how many characters within a given character class to match. Using this quantifier in the phone-number expression, you can specify that the phone number should contain three digits for the area code followed by three digits and four digits separated by a dash:

\(\d{3}\)\s\d{3}-\d{4}

Because the regular expression itself isn't that complicated, you can still see that using quantifiers can simplify regular-expression creation. In addition to character classes and quantifiers, you can also use positional information within a regular expression. For instance, you can specify that given an input string, the regular expression should operate at the beginning of the string. You express it using the ^  character. Likewise, you can also denote the end of a string using the $  symbol. Take note that this doesn't mean start at the end of the string and attempt to make a match because that obviously seems counterintuitive; no characters exist at the end of the string. Rather, by placing the $  character following the rest of the regular expression, it means to match the string with the regular expression as long as the match occurs at the end of the string. For instance, if you want to match a sentence in which a phone number is the last portion of the sentence, you could use the following:

\(\d{3}\)\s\d{3}-\d{4}$
My phone number is (555) 555-5555 = Match
(555) 555-5555 is my phone number = Not a match

3.11. Validating User Input with Regular Expressions

You want to ensure valid user input by using regular expressions to test for validity.

Technique

Create a RegEx object, which exists within the System.Text.RegularExpressions  namespace, passing the regular expression in as a parameter to the constructor. Next, call the member method Match  using the string you want to validate as a parameter to the method. The method returns a Match object regardless of the outcome. To test whether a match is made, evaluate the Boolean Success property on that Match object as demonstrated in Listing 3.9. It should also be noted that in many cases, the forward slash (\) character is used when working with regular expressions. To avoid compilation errors from inadvertently specifying an invalid control character, use the @  symbol to turn off escape processing.

Listing 3.9 Validating User Input of a Phone Number Using a Regular Expression

using System;
using System.Text.RegularExpressions;

namespace _11_RegularExpressions
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      Regex phoneExp = new Regex( @"^\(\d{3}\)\s\d{3}-\d{4}$" );
      string input;

      Console.Write( "Enter a phone number: " );
      input = Console.ReadLine();
      
      while( phoneExp.Match( input ).Success == false )
      {
        Console.WriteLine( "Invalid input. Try again." );
        Console.Write( "Enter a phone number: " );
        input = Console.ReadLine();
      }

      Console.WriteLine( "Validated!" );
    }
  }
}

Comments

Earlier in this chapter I mentioned that you could perform data validation using the static methods within the System.Char class. You can inspect each character within the input string to ensure it matches exactly what you are looking for. However, this method of input validation can be extremely cumbersome if you have different input types to validate because it requires custom code for each validation. In other words, using the methods in the System.Char  class is not recommended for anything but the simplest of data-validation procedures.

Regular expressions, on the other hand, allow you to perform the most advanced input validation possible, all within a single expression. You are in effect passing the parsing of the input string to the regular-expression engine and offloading all the work that you would normally do.

In Listing 3.9, you can see how you create and use a regular expression to test the validity of a phone number entered by a user. The regular expression is similar to the previous expressions used earlier for phone numbers except for the addition of positional markers. The regular expression is valid if a user enters a phone number and nothing else. A match is successful when the Success  property within the Match  object, which is returned from the Regex.Match method, is true. The only caveat to using regular expressions for input validation is that even though you know the validation failed, you are unable to query the Regex or Match class to see what part of the string failed.

3.12. Replacing Substrings Using Regular Expressions

You want to replace all substrings that match a regular expression with a different substring that also uses regular-expression syntax.

Technique

Create a Regex object, passing the regular expression used to match characters in the input string to the Regex constructor. Next, call the Regex  method Replace, passing the input string to process and the string to replace each match within the input string. You can also use the static Replace method, passing the regular expression as the first parameter to the method as shown in the last line of Listing 3.10.

Listing 3.10 Using Regular Expressions to Replace Numbers in a Credit Card with xs

using System;
using System.Text.RegularExpressions;

namespace _12_RegExpReplace
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      Regex cardExp = new Regex( @"(\d{4})-(\d{4})-(\d{4})-(\d{4})" );
      string safeOutputExp = "$1-xxxx-xxxx-$4";
      string cardNum;

      Console.Write( "Please enter your credit card number: " );
      cardNum = Console.ReadLine();
    
      while( cardExp.Match( cardNum ).Success == false )
      {
        Console.WriteLine( "Invalid card number. Try again." );
        Console.Write( "Please enter your credit card number: " );

        cardNum = Console.ReadLine();
      }

      Console.WriteLine( "Secure Output Result = {0}", 
        cardExp.Replace( cardNum, safeOutputExp ));
    }
  }
}

Comments

Although input validation is an extremely useful feature of regular expressions, they also work well as text parsers. The previous recipe used regular expressions to verify that a particular string matched a regular expression exactly. However, you can also use regular expressions to match substrings within a string and return each of those substrings as a group. Furthermore, you can use a separate regular expression that acts on the result of the regular-expression evaluation to replace substrings within the original input string.

Listing 3.10 creates a regular expression that matches the format for a credit card. In that regular expression, you can see that it will match on four different groups of four digits apiece separated by a dash. However, you might also notice that each one of these groups is surrounded with parentheses. In an earlier recipe, I mentioned that to use a literal parenthesis, you must escape it using a backslash because of the conflict with regular-expression grouping symbols. In this case, you want to use the grouping feature of regular expressions. When you place a portion of a regular expression within parentheses, you are creating a numbered group. Groups are numbered starting with 1 and are incremented for each subsequent group. In this case, there are four numbered groups. These groups are used by the replacement string, which is contained in the string safeOutputExp. To reference a numbered group, use the $  symbol followed by the number of the group to reference. This sequence represents all characters within the input string that match the group expression within the regular expression. Therefore, in the replacement string, you can see that it prints the characters within the first group, replaces the characters in the second and third groups with xs, and finally prints the characters in the fourth group.

One thing to note is that you can use the RegEx class to view the groups themselves. If you change the regular expression to "\d{4}", you can then use the Matches method to enumerate all the groups using the foreach keyword, as shown in Listing 3.11. In the listing, the program first checks to make sure at least four matches were made. This number corresponds to four groups of four digits. Next, it uses a foreach enumeration on each Match  object that is returned from the Matches method. If the match is in the second or third group, the values are replaced with xs; otherwise, the Match object's value, the characters within that group, are concatenated to the result string.

Listing 3.11 Enumerating Through the Match  Collection to Perform Special Operations on Each Match in a Regular Expression

static void TestManualGrouping()
{
  Regex cardExp = new Regex( @"\d{4}" );
  string cardNum;
  string safeOutputExp = "";

  Console.Write( "Please enter your credit card number: " );
  cardNum = Console.ReadLine();

  if( cardExp.Matches( cardNum ).Count < 4 )
  {
    Console.WriteLine( "Invalid card number" );
    return;
  }

  foreach( Match field in cardExp.Matches( cardNum ))
  {
    if( field.Success == false )
    {
      Console.WriteLine( "Invalid card number" );
      return;
    }

    if( field.Index == 5 || field.Index == 10 )
    {
      safeOutputExp += "-xxxx-";
    }
    else
    {
      safeOutputExp += field.Value;
    }
  }

  Console.WriteLine( "Secure Output Result = {0}", safeOutputExp );
}

3.13. Building a Regular Expression Library

You want to create a library of regular expressions that you can reuse in other projects.

Technique

Use the CompileToAssembly static method within the Regex class to compile a regular expression into an assembly. This method uses an array of RegexCompilationInfo objects that contain any number of regular expressions you want to add to the assembly.

The RegexCompilationInfo class contains a constructor with five fields that you must fill out. The parameters denote the string for the regular expression; any options for the regular expression, which appear in the RegexOptions  enumerated type; a name for the class that is created to hold the regular expression; a corresponding namespace; and a Boolean value specifying whether the created class should have a public access modifier.

After creating the RegexCompilationInfo  object, create an AssemblyName object, making sure to reference the System.Reflection namespace, and set the Name property to a name you want the resultant assembly filename to be. Because the CompileToAssembly creates a DLL, exclude the DLL extension on the assembly name. Finally, place all the RegexCompilationInfo objects within an array, as shown in Listing 3.12, and call the CompileToAssembly method. Listing 3.12 demonstrates how to create a RegexCompilationInfo object and how to use that object to compile a regular expression into an assembly using the CompileToAssembly  method.

Listing 3.12 Using the CompileToAssembly  Regex  Method to Save Regular Expressions in a New Assembly for Later Reuse

using System;
using System.Text.RegularExpressions;
using System.Reflection;

namespace _12_RegExpReplace
{
  class Class1
  {
    [STAThread]
    static void Main(string[] args)
    {
      CompileRegex(@"(\d{4})-(\d{4})-(\d{4})-(\d{4})", @"regexlib" );
  }
    
    static void CompileRegex( string exp, string assemblyName )
    {
      RegexCompilationInfo compInfo = 
        new RegexCompilationInfo( exp, 0, "CreditCardExp", "", true );
      AssemblyName assembly = new AssemblyName();
      assembly.Name = assemblyName;

      RegexCompilationInfo[] rciArray = { compInfo };

      Regex.CompileToAssembly( rciArray, assembly );
    }
  }
}

Comments

If you use regular expressions regularly, then you might find it advantageous to create a reusable library of the expressions you tend to use the most. The Regex class contains a method named CompileToAssembly  that allows you to compile several regular expressions into an assembly that you can then reference within other projects.

Internally, you will find a class for each regular expression you added, all contained within its corresponding namespace, as specified in the RegexCompilationInfo object when you created it. Furthermore, each of these classes inherits from the Regex class so all the Regex methods are available for you to use. As you can see, creating a library of commonly used regular expressions allows you to reuse and share these expressions in a multitude of different projects. A change in a regular expression simply involves changing one assembly instead of each project that hard-coded the regular expression.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here