|
Authors |
Mark Schmidt, Simon Robinson |
Title |
Microsoft Visual C# .NET 2003 Developer's Cookbook |
Publisher |
Sams |
Published |
DEC 12, 2003 |
ISBN |
0672325802 |
Price |
US$ 44.99 |
Pages |
816 |
Chapter 3: Strings and Regular Expressions
3.0. Introduction
It would be very rare to create an entire application without using a single string. Strings help make sense of the seemingly random jumble of binary data that applications use to accomplish a task. They appear in all facets of application development from the smallest system utility to large enterprise services. Their value is so apparent that more and more connected systems are leaning toward string data within their communication protocols by utilizing the Extensible Markup Language (XML) rather than the more cumbersome traditional transmission of large binary data. This book uses strings extensively to examine the internal contents of variables and the results of program flow using Framework Class Libraries (FCL) methods such as Console.WriteLine
and MessageBox.Show
.
In this chapter, you will learn how to take advantage of the rich support for strings within the .NET Framework and the C# language. Coverage includes ways to manipulate string contents, programmatically inspect strings and their character attributes, and optimize performance when working with string objects. Furthermore, this chapter uncovers the power of regular expressions and how they allow you to effectively parse and manipulate string data. After reading this chapter, you will be able to use regular expressions in a variety of different situations where their value is apparent.
3.1. Creating and Using String Objects
You want to create and manipulate string data within your application.
Technique
The C# language, knowing the importance of string data, contains a string
keyword that simulates the behavior of a value data type. To create a string, declare a variable using the string
keyword. You can use the assignment operator to initialize the variable using a static string or with an already initialized string variable.
string string1 = "This is a string";
string string2 = string1;
To gain more control over string initialization, declare a variable using the System.String
data type and create a new instance using the new
keyword. The System.String
class contains several constructors that you can use to initialize the string value. For instance, to create a new string that is a small subset of an existing string, use the overloaded constructor, which takes a character array and two integers denoting the beginning index and the number of characters from that index to copy:
class Class1
{
[STAThread]
static void Main(string[] args)
{
string string1 = "Field1, Field2";
System.String string2 = new System.String( string1.ToCharArray(), 8, 6 );
Console.WriteLine( string2 );
}
}
Finally, if you know a string will be intensively manipulated, use the System.Text. StringBuilder
class. Creating a variable of this data type is similar to using the System.String
class, and it contains several constructors to initialize the internal string value. The key internal difference between a regular string object and a StringBuilder
lies in performance. Whenever a string is manipulated in some manner, a new object has to be created, which subsequently causes the old object to be marked for deletion by the garbage collector. For a string that undergoes several transformations, the performance hit associated with frequent object creation and deletions can be great. The StringBuilder
class, on the other hand, maintains an internal buffer, which expands to make room for more string data should the need arise, thereby decreasing frequent object activations.
Comments
There is no recommendation on whether you use the string
keyword or the System.String
class. The string
keyword is simply an alias for this class, so it is all a matter of taste. We prefer using the string
keyword, but this preference is purely aesthetic. For this reason, we simply refer to the System.String
class as the string
class or data type.
The string
class contains many methods, both instance and static, for manipulating strings. If you want to compare strings, you can use the Compare
method. If you are just testing for equality, then you might want to use the overloaded equality operator (==
). However, the Compare
method returns an integer instead of Boolean value denoting how the two strings differ. If the return value is 0, then the strings are equal. If the return value is greater than 0, as shown in Listing 3.1, then the first operand is greater alphabetically than the second operand. If the return value is less than 0, the opposite is true. When a string is said to be alphabetically greater or lower than another, each character reading from left to right from both strings is compared using its equivalent ASCII value.
Listing 3.1 Using the Compare Method in the String
Class
using System;
namespace _1_UsingStrings
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string string1 = "";
String string2 = "";
Console.Write( "Enter string 1: " );
string1 = Console.ReadLine();
Console.Write( "Enter string 2: " );
string2 = Console.ReadLine();
Console.WriteLine( "string1 is a {0}\nstring2 is a {1}",
string1.GetType().FullName, string2.GetType().FullName );
CompareStrings( string1, string2 );
}
public static void CompareStrings( string str1, string str2 )
{
int compare = String.Compare( str1, str2 );
if( compare == 0 )
{
Console.WriteLine( "The strings {0} and {1} are the same.\n",
str1, str2 );
}
else if( compare < 0 )
{
Console.WriteLine( "The string {0} is less than {1}",
str1, str2 );
}
else if( compare > 0 )
{
Console.WriteLine( "The string {0} is greater than {1}",
str1, str2 );
}
}
}
}
As mentioned earlier, the string
class contains both instance and static methods. Sometimes you have no choice about whether to use an instance or static method. However, a few of the instance methods contain a static version as well. Because calling a static method is a nonvirtual function call, you see performance gains if you use this version. An example where you might see both instance and static versions appears in Listing 3.1. The string comparison uses the static Compare
method. You can also do so using the nonstatic CompareTo
method using one of the string instances passed in as parameters. In most cases, the performance gain is negligible, but if an application needs to repeatedly call these methods, you might want to consider using the static over the non-static method.
The string
class is immutable. Once a string is created, it cannot be manipulated. Methods within the string
class that modify the original string instance actually destroy the string and create a new string object rather than manipulate the original string instance. It can be expensive to repeatedly call string
methods if new objects are created and destroyed continuously. To solve this, the .NET Framework contains a StringBuilder
class contained within the System.Text
namespace, which is explained later in this chapter.
3.2. Formatting Strings
Given one or more objects, you want to create a single formatted string representation.
Technique
You can format strings using numeric and picture formatting within String.Format
or within any method that uses string-formatting techniques for parameters such as Console.WriteLine
.
Comments
The String
class as well as a few other methods within the .NET Framework allow you to format strings to present them in a more ordered and readable format. Up to this point in the book, we used basic formatting when calling the Console.WriteLine
method. The first parameter to Console.WriteLine
is the format specifier string. This string controls how the remaining parameters to the method should appear when displayed. You use placeholders within the format string to insert the value of a variable. This placeholder uses the syntax {
n}
where n is the index in the parameter list following the format specifier. Take the following line of code, for instance:
Console.WriteLine( "x={0}, y={1}, {0}+{1}={2}", x, y, x+y );
This line of code has three parameters following the format specifier string. You use placeholders within the format specification, and when this method is called, the appropriate substitutions are made. Although you can do the same thing using string concatenation, the resultant line of code is slightly obfuscated:
string s = "x=" + x + ",y=" + y + ", " + x + "+" + y + "=" + (x+y);
Console.WriteLine( s );
You can further refine the format by applying format attributes on the placeholders themselves. These additional attributes follow the parameter index value and are separated from that index with a :
character. There are two types of special formatting available. The first is numeric formatting, which lets you format a numeric parameter into one of nine different numeric formats, as shown in Table 3.1. The format of these specifiers, using the currency format as an example, is C
xx where xx is a number from 1 to 99 specifying the number of digits to display. Listing 3.2 shows how to display an array of integers in hexadecimal format, including how to specify the number of digits to display. Notice also how you can change the case of the hexadecimal numbers A through F by using an uppercase or lowercase format specifier.
Table 3.1 Numeric Formatting Specifiers
Character |
Format |
Description |
C or c
|
Currency |
Culturally aware currency format. |
D or d
|
Decimal |
Only supports integral numbers. Displays a string using decimal digits preceded by a minus sign if negative. |
E or e
|
Exponential/scientific notation |
Displays numbers in the form � d. ddddddE� dd where d is a decimal digit. |
F or f
|
Fixed point |
Displays a series of decimal digits with a decimal point and additional digits. |
G or g
|
General format |
Displays either as a fixed-point or scientific notation based on the size of the number. |
N or n
|
Number format |
Similar to fixed point but uses a separator character (such as , ) for groups of digits. |
P or p
|
Percentage |
Multiplies the number by 100 and displays with a percent symbol. |
R or r
|
Roundtrip |
Formats a floating-point number so that it can be successfully converted back to its original value. |
X or x
|
Hexadecimal |
Displays an integral number using the base-16 number system. |
Listing 3.2 Specifying a Different Numeric Format by Adding Format Specifiers on a Parameter Placeholder
using System;
namespace _2_Formatting
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
double[] numArray = {2, 5, 4.5, 45.43, 200000};
Console.WriteLine( "\n\nHex (lower)\n-----------" );
foreach( double num in numArray )
{
Console.Write( "0x{0:x}\t", (int) num );
}
Console.WriteLine( "\n\nHex (upper)\n-----------" );
foreach( double num in numArray )
{
Console.Write( "0x{0:X}\t", (int) num );
}
}
}
}
Another type of formatting is picture formatting. Picture formatting allows you to create a custom format specifier using various symbols within the format specifier string. Table 3.2 lists the available picture format characters. Listing 3.3 also shows how to create a custom format specifier. In that code, the digits of the input number are extracted and displayed using a combination of digit placeholders and a decimal-point specifier. Furthermore, you can see that you are free to add characters not listed in the table. This freedom allows you to add literal characters intermixed with the digits.
Table 3.2 Picture Formatting Specifiers
Character |
Name |
Description |
0
|
Zero placeholder |
Copies a digit to the result string if a digit is at the position of the 0 . If no digit is present, a 0 is displayed. |
#
|
Display digit placeholder |
Copies a digit to the result string if a digit appears at the position of the # . If no digit is present, nothing is displayed. |
.
|
Decimal point |
Represents the location of the decimal point in the resultant string. |
,
|
Group separator and number scaling |
Inserts thousands separators if placed between two placeholders or scales a number down by 1,000 per , character when placed directly to the left of a decimal point. |
&
|
Percent |
Multiplies a number by 100 and inserts a % symbol. |
E�0 , e�0
|
Exponential notation |
Displays the number in exponential notation using the number of 0s as a placeholder for the exponent value. |
\
|
Escape character |
Used to specify a special escape-character formatting instruction. Some of these include \n for newline, \t for tab, and \\ for the \ character. |
;
|
Section separator |
Separates positive, negative, and zero numbers in the format string in which you can apply different formatting rules based on the sign of the original number. |
Listing 3.3 shows how custom formatting can separate a number by its decimal point. Using a foreach
loop, each value is printed using three different formats. The first format will output the value's integer portion using the following format string:
0:$#,#
Next, the decimal portion is written. If the value does not explicitly define a decimal portion, zeroes are written instead. The format string to output the decimal value is
$.#0;
Finally, the entire value is displayed up to two decimal places using the following format string:
{0:$#,#.00}
Listing 3.3 Using Picture Format Specifiers to Create Special Formats
using System;
namespace _2_Formatting
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
double[] numArray = {2, 5, 4.5, 45.43, 200000};
Console.WriteLine( "\n\nCustom\n------" );
foreach( double num in numArray )
{
Console.WriteLine( "{0:$#,# + $.#0;} = {0:$#,#.00}", num );
}
}
}
}
3.3. Accessing Individual String Characters
You want to process individual characters within a string.
Technique
Use the index operator ([]
) by specifying the zero-based index of the character within the string that you want to extract. Furthermore, you can also use the foreach
enumerator on the string using a char
structure as the enumeration data type.
Comments
The string
class is really a collection of objects. These objects are individual characters. You can access each character using the same methods you would use to access an object in most other collections (which is covered in the next chapter).
You use an indexer to specify which object in a collection you want to retrieve. In C#, the first object begins at the 0 index of the string. The objects are individual characters whose data type is System.Char
, which is aliased with the char
keyword. The indexer for the string
class, however, can only access a character and cannot set the value of a character at that position. Because a string is immutable, you cannot change the internal array of characters unless you create and return a new string. If you need the ability to index a string to set individual characters, use a StringBuilder
object.
Listing 3.4 shows how to access the characters in a string. One thing to point out is that because the string also implements the IEnumerable
interface, you can use the foreach
control structure to enumerate through the string.
Listing 3.4 Accessing Characters Using Indexers and Enumeration
using System;
using System.Text;
namespace _3_Characters
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string str = "abcdefghijklmnopqrstuvwxyz";
str = ReverseString( str );
Console.WriteLine( str );
str = ReverseStringEnum( str );
Console.WriteLine( str );
}
static string ReverseString( string strIn )
{
StringBuilder sb = new StringBuilder(strIn.Length);
for( int i = 0; i < strIn.Length; ++i )
{
sb.Append( strIn[(strIn.Length-1)-i] );
}
return sb.ToString();
}
static string ReverseStringEnum( string strIn )
{
StringBuilder sb = new StringBuilder( strIn.Length );
foreach( char ch in strIn )
{
sb.Insert( 0, ch );
}
return sb.ToString();
}
}
}
3.4. Analyzing Character Attributes
You want to evaluate the individual characters in a string to determine a character's attributes.
Technique
The System.Char
structure contains several static functions that let you test individual characters. You can test whether a character is a digit, letter, or punctuation symbol or whether the character is lowercase or uppercase.
Comments
One of the hardest issues to handle when writing software is making sure users input valid data. You can use many different methods, such as restricting input to only digits, but ultimately, you always need an underlying validating test of the input data.
You can use the System.Char
structure to perform a variety of text-validation procedures. Listing 3.5 demonstrates validating user input as well as inspecting the characteristics of a character. It begins by displaying a menu and then waiting for user input using the Console.ReadLine
method. Once a user enters a command, you make a check using the method ValidateMainMenuInput
. This method checks to make sure the first character in the input string is not a digit or punctuation symbol. If the validation passes, the string is passed to a method that inspects each character in the input string. This method simply enumerates through all the characters in the input string and prints descriptive messages based on the characteristics. Some of the System.Char
methods for inspection have been inadvertently left out of Listing 3.5. Table 3.3 shows the remaining methods and their functionality. The results of running the application in Listing 3.5 apper in Figure 3.1.
Listing 3.5 Using the Static Methods in System.Char
to Inspect the Details of a Single Character
using System;
namespace _4_CharAttributes
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
char cmd = 'x';
string input;
do
{
DisplayMainMenu();
input = Console.ReadLine();
if( (input == "" ) ||
ValidateMainMenuInput( Char.ToUpper(input[0]) ) == 0 )
{
Console.WriteLine( "Invalid command!" );
}
else
{
cmd = Char.ToUpper(input[0]);
switch( cmd )
{
case 'Q':
{
break;
}
case 'N':
{
Console.Write( "Enter a phrase to inspect: " );
input = Console.ReadLine();
InspectPhrase( input );
break;
}
}
}
} while ( cmd != 'Q' );
}
private static void InspectPhrase( string input )
{
foreach( char ch in input )
{
Console.Write( ch + " - ");
if( Char.IsDigit(ch) )
Console.Write( "IsDigit " );
if( Char.IsLetter(ch) )
{
Console.Write( "IsLetter " );
Console.Write( "(lowercase={0}, uppercase={1})",
Char.ToLower(ch), Char.ToUpper(ch));
}
if( Char.IsPunctuation(ch) )
Console.Write( "IsPunctuation " );
if( Char.IsWhiteSpace(ch) )
Console.Write( "IsWhitespace" );
Console.Write("\n");
}
}
private static int ValidateMainMenuInput( char input )
{
if( Char.IsDigit( input ) == true )
return 0;
else if ( Char.IsPunctuation( input ) )
return 0;
else if( Char.IsSymbol( input ))
return 0;
else if( input != 'N' && input != 'Q' )
return 0;
return (int) input;
}
private static void DisplayMainMenu()
{
Console.WriteLine( "\nPhrase Inspector\n-------------------" );
Console.WriteLine( "N)ew Phrase" );
Console.WriteLine( "Q)uit\n" );
Console.Write( ">> " );
}
}
}
Table 3.3 System.Char
Inspection Methods
Name
|
Description |
IsControl
|
Denotes a control character such as a tab or carriage return. |
IsDigit
|
Indicates a single decimal digit. |
IsLetter
|
Used for alphabetic characters. |
IsLetterOrDigit
|
Returns true if the character is a letter or a digit. |
IsLower
|
Used to determine whether a character is lowercase. |
IsNumber
|
Tests whether a character is a valid number. |
IsPunctuation
|
Denotes whether a character is a punctuation symbol. |
IsSeparator
|
Denotes a character used to separate strings. An example is the space character. |
IsSurrogate
|
Checks for a Unicode surrogate pair, which consists of two 16-bit values primarily used in localization contexts. |
IsSymbol
|
Used for symbolic characters such as $ or # . |
IsUpper
|
Used to determine whether a character is uppercase. |
IsWhiteSpace
|
Indicates a character classified as whitespace such as a space character, tab, or carriage return. |
F
igure 3.1. Use the static
method in the System.Char
class to inspect character attributes.
The System.Char
structure is designed to work with a single Unicode character. Because a Unicode character is 2 bytes, the range of a character is from 0 to 0xFFFF. For portability reasons in future systems, you can always check the size of a char
by using the MaxValue
constant declared in the System.Char
structure. One thing to keep in mind when working with characters is to avoid the confusion of mixing char
types with integer types. Characters have an ordinal value, which is an integer value used as a lookup into a table of symbols. One example of a table is the ASCII table, which contains 255 characters and includes the digits 0 through 9, letters, punctuation symbols, and formatting characters. The confusion lies in the fact that the number 6, for instance, has an ordinal char
value of 0x36. Therefore, the line of code meant to initialize a character to the number 6
char ch = (char) 6;
is wrong because the actual character in this instance is ^F, the ACK control character used in modem handshaking protocols. Displaying this value in the console would not provide the 6 that you were looking for. You could have chosen two different methods to initialize the variable. The first way is
char ch = (char) 0x36;
which produces the desired result and prints the number 6 to the console if passed to the Console.Write
method. However, unless you have the ASCII table memorized, this procedure can be cumbersome. To initialize a char
variable, simply place the value between single quotes:
char ch = '6';
3.5. Case-Insensitive String Comparison
You want to perform case-insensitive string comparison on two strings.
Technique
Use the overloaded Compare
method in the System.String
class which accepts a Boolean value, ignoreCase
, as the last parameter. This parameter specifies whether the comparison should be case insensitive (true
) or case sensitive (false
). To compare single characters, convert them to uppercase or lowercase, using ToUpper
or ToLower
, and then perform the comparison.
Comments
Validating user input requires a lot of forethought into the possible values a user can enter. Making sure you cover the range of possible values can be a daunting task, and you might ultimately run into human-computer interaction issues by severely limiting what a user can enter. Case-sensitivity issues increase the possible range of values, leading to greater security with respect to such things as passwords, but this security is usually at the expense of a user's frustration when she forgets whether a character is capitalized. As with many other programming problems, you must weigh the pros and cons.
To perform a case-insensitive comparison, you can use one of the many overloaded Compare
methods within the System.String
class. The methods that allow you to ignore case issues use a Boolean value as the last parameter in the method. This parameter is named ignoreCase
, and when you set it to true
, you make a case-insensitive comparison, as demonstrated in Listing 3.6.
Listing 3.6 Performing a Case-Insensitive String Comparison
using System;
namespace _5_CaseComparison
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string str1 = "This Is A String.";
string str2 = "This is a string.";
Console.WriteLine( "Case sensitive comparison of" +
" str1 and str2 = {0}", String.Compare( str1, str2 ));
Console.WriteLine( "Case insensitive comparison of" +
" str1 and str2 = {0}", String.Compare( str1, str2, true ));
}
}
}
3.6. Working with Substrings
You need to change or extract a specific portion of a string.
Technique
To copy a portion of a string into a new string, use the SubString
method within the System.String
class. You call this method using the string object instance of the source string:
string source = "ABCD1234WXYZ";
string dest = source.Substring( 4, 4 );
Console.WriteLine( "{0}\n", dest );
To copy a substring into an already existing character array, use the CopyTo
method. To assign a character array to an existing string object, create a new instance of the string using the new
keyword, passing the character array as a parameter to the string constructor as shown in the following code, whose ouput appears in Figure 3.2:
string source = "ABCD";
char [] dest = { '1', '2', '3', '4', '5', '6', '7', '8' };
Console.Write( "Char array before = " );
Console.WriteLine( dest );
source.CopyTo( 0, dest, 4, source.Length );
Console.Write( "Char array after = " );
Console.WriteLine( dest );
source = new String( dest );
Console.WriteLine( "New source = {0}\n", source );
Figure 3.2 Use the CopyTo
method to copy a substring into an existing character array.
If you need to remove a substring within a string and replace it with a different substring, use the Replace
method. This method accepts two parameters, the substring to replace and the string to replace it with:
string replaceStr = "1234";
string dest = "ABCDEFGHWXYZ";
dest = dest.Replace( "EFGH", replaceStr );
Console.WriteLine( dest );
To extract an array of substrings that are separated from each other by one or more delimiters, use the Split
method. This method uses a character array of delimiter characters and returns a string array of each substring within the original string as shown in the following code, whose output appears in Figure 3.3. You can optionally supply an integer specifying the maximum number of substrings to split:
char delim = '\\';
string filePath = "C:\\Windows\\Temp";
string [] directories = null;
directories = filePath.Split( delim );
foreach (string directory in directories)
{
Console.WriteLine("{0}", directory);
}
Figure 3.3 You can use the Split
method in the System.String
class to place delimited substrings into a string array.
Comments
Parsing strings is not for the faint of heart. However, the job becomes easier if you have a rich set of methods that allow you to perform all types of operations on strings. Substrings are the goal of a majority of these operations, and the string
class within the .NET Framework contains many methods that are designed to extract or change just a portion of a string.
The Substring
method extracts a portion of a string and places it into a new string object. You have two options with this method. If you pass a single integer, the Substring
method extracts the substring that starts at that index and continues until it reaches the end of the string. One thing to keep in mind is that C# array indices are 0 based. The first character within the string will have an index of 0. The second Substring
method accepts an additional parameter that denotes the ending index. It lets you extract parts of a string in the middle of the string.
You can create a new character array from a string by using the ToCharArray
method of the string
class. Furthermore, you can extract a substring from the string and place it into a character array by using the CopyTo
method. The difference between these two methods is that the character array used with the CopyTo
method must be an already instantiated array. Whereas the ToCharArray
returns a new character array, the CopyTo
method expects an existing character array as a parameter to the method. Furthermore, although methods exist to extract character arrays from a string, there is no instance method available to assign a character array to a string. To do this, you must create a new string object using the new
keyword, as opposed to creating the familiar value-type string, and pass the character array as a parameter to the string constructor.
Using the Replace
method is a powerful way to alter the contents of a string. This method allows you to search all instances of a specified substring within a string and replace those with a different substring. Additionally, the length of the substring you want to replace does not have to be the same length of the string you are replacing it with. If you recall the number of times you have performed a search and replace in any application, you can see the possible advantages of this method.
One other powerful method is Split
. By passing a character array consisting of delimiter characters, you can split a string into a group of substrings and place them into a string array. By passing an additional integer parameter, you can also control how many substrings to extract from the source string. Referring to the code example earlier demonstrating the Split
method, you can split a string representing a directory path into individual directory names by passing the \
character as the delimiter. You are not, however, confined to using a single delimiter. If you pass a character array consisting of several delimiters, the Split
method extracts substrings based on any of the delimiters that it encounters.
3.7. Using Verbatim String Syntax
You want to represent a path to a file using a string without using escape characters for path separators.
Technique
When assigning a literal string to a string object, preface the string with the @
symbol. It turns off all escape-character processing so there is no need to escape path separators:
string nonVerbatim = "C:\\Windows\\Temp";
string verbatim = @"C:\Windows\Temp";
Comments
A compiler error that happens so frequently comes from forgetting to escape path separators. Although a common programming faux pas is to include hard-coded path strings, you can overlook that rule when testing an application. Visual C# .NET added verbatim string syntax as a feature to alleviate the frustration of having to escape all the path separators within a file path string, which can be especially cumbersome for large paths.
3.8. Choosing Between Constant and Mutable Strings
You want to choose the correct string data type to best fit your current application design.
Technique
If you know a string's value will not change often, use a string
object, which is a constant value. If you need a mutable string, one that can change its value without having to allocate a new object, use a StringBuilder
.
Comments
Using a regular string
object is best when you know the string will not change or will only change slightly. This change includes the whole gamut of string operations that change the value of the object itself, such as concatenation, insertion, replacement, or removal of characters. The Common Language Runtime (CLR) can use certain properties of strings to optimize performance. If the CLR can determine that two string objects are the same, it can share the memory that these string objects occupy. These strings are then known as interned strings. The CLR contains an intern pool, which is a lookup table of string instances. Strings are automatically interned if they are assigned to a literal string within code. However, you can also manually place a string within the intern pool by using the Intern
method. To test whether a string is interned, use the IsInterned
method, as shown in Listing 3.7.
Listing 3.7 Interning a String by Using the Intern
Method
using System;
namespace _7_StringBuilder
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string sLiteral = "Automatically Interned";
string sNotInterned = "Not " + sLiteral;
TestInterned( sLiteral );
TestInterned( sNotInterned );
String.Intern( sNotInterned );
TestInterned( sNotInterned );
}
static void TestInterned( string str )
{
if( String.IsInterned( str ) != null )
{
Console.WriteLine( "The string \"{0}\" is interned.", str );
}
else
{
Console.WriteLine( "The string \"{0}\" is not interned.", str );
}
}
}
}
A StringBuilder
behaves similarly to a regular string object and also contains similar method calls. However, there are no static methods because the StringBuilder
class is designed to work on string instances. Method calls on an instance of a StringBuilder
object change the internal string of that object, as shown in Listing 3.8. A StringBuilder
maintains its mutable appearance by creating a buffer that is large enough to contain a string value and additional memory should the string need to grow.
Listing 3.8 Manipulating an Internal String Buffer Instead of Returning New String Objects
using System;
using System.Text;
namespace _7_StringBuilder
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
string string1 = "";
String string2 = "";
Console.Write( "Enter string 1: " );
string1 = Console.ReadLine();
Console.Write( "Enter string 2: " );
string2 = Console.ReadLine();
BuildStrings( string1, string2 );
}
public static void BuildStrings( string str1, string str2 )
{
StringBuilder sb = new StringBuilder( str1 + str2 );
sb.Insert( str1.Length, " is the first string.\n" );
sb.Insert( sb.Length, " is the second string.\n" );
Console.WriteLine( sb );
}
}
}
3.9. Optimizing StringBuilder
Performance
Knowing that a StringBuilder
object can suffer more of a performance hit than a regular string object, you want to optimize the StringBuilder
object to minimize performance issues.
Technique
Use the EnsureCapacity
method in the StringBuilder
class. Set this integral value to a value that signifies the length of the longest string you may store in this buffer.
Comments
The StringBuilder
class contains methods that allow you to expand the memory of the internal buffer based on the size of the string you may store. As your string continually grows, the StringBuilder
won't have to repeatedly allocate new memory for the internal buffer. In other words, if you attempt to place a larger length string than what the internal buffer of the StringBuilder
class can accept, then the class will have to allocate additional memory to accept the new data. If you continuously add strings that increase in size from the last input string, the StringBuilder
class will have to allocate a new buffer size, which it does internally by calling the GetStringForStringBuilder
method defined in the System.String
class. This method ultimately calls the unmanaged method FastAllocateString
. By giving the StringBuilder
class a hint using the EnsureCapacity
method, you can help alleviate some of this continual memory reallocation, thereby optimizing the StringBuilder
performance by reducing the amount of memory allocations needed to store a string value.
3.10. Understanding Basic Regular Expression Syntax
You want to create a regular expression.
Technique
Regular expressions consist of a series of characters and quantifiers on those characters. The characters themselves can be literal or can be denoted by using character classes, such as \d
, which denotes a digit character class, or \S
, which denotes any nonwhitespace character.
Table 3.4 Regular Expression Single Character Classes
Class |
Description |
\d
|
Any digit |
\D
|
Any nondigit |
\w s
|
Any word character |
\W
|
Any nonword character |
\s
|
Any whitespace character |
\SW
|
Any nonwhitespace |
In addition to the single character classes, you can also specify a range or set of characters using ranged and set character classes. This ability allows you to narrow the search for a specified character by limiting characters within a specified range or within a defined set.
Table 3.5 Ranged and Set Character Classes
Format |
Description |
.
|
Any character except newline. |
\p{ uc}
|
Any character within the Unicode character category uc. |
[abcxyz]
|
Any literal character in the set. |
\P{uc}
|
Any character not within the Unicode character category uc. |
[^abcxyz]
|
Any character not in the set of literal characters. |
Quantifiers work on character classes to expand the number of characters the character classes should match. You need to specify, for instance, a wildcard character on a character class, which means 0 or more characters within that class. Additionally, you can also specify a set number of matches of a class that should occur by using an integer within braces following the character class designation.
Table 3.6 Character Class Quantifiers
Format |
Description |
*
|
0 or more characters |
+
|
1 or more characters |
?
|
0 or 1 characters |
{ n}
|
Exactly n characters |
{ n,}
|
At least n characters |
{ n,m}
|
At least n but no more than m characters |
You can also specify where a certain regular expression should start within a string. Positional assertions allow you to, for instance, match a certain expression as long as it occurs at the beginning or ending of a string. Furthermore, you can create a regular expression that operates on a set of words within a string by using a positional assertion that continues matching on each subsequent word separated by any nonalphanumeric character.
Table 3.7 Positional (Atomic Zero-Width) Assertions
Format |
Description |
^
|
Beginning of a string or beginning of a newline |
\z
|
End of the string, including the newline character |
$
|
End of a string before a newline character or at the end of the line |
\G
|
Continues where the last match left off |
\A
|
Beginning of a string |
\b
|
Between word boundaries (between alphanumeric and nonalphanumeric characters) |
\Z
|
End of the string before the newline character |
\B
|
Characters not between word boundaries |
Comments
Regular expressions use a variety of characters both symbolic and literal to designate how a particular string of text should be parsed. The act of parsing a string is known as matching, and when applied to a regular expression, the match will be either true or false. In other words, when you use a regular expression to match a series of characters, the match will either succeed or fail. As you can see, this process has powerful applicability in the area of input validation.
You build regular expressions using a series of character classes and quantifiers on those character classes as well as a few miscellaneous regular-expression constructs. You use character classes to match a single character based either on what type of character it is, such as a digit or letter, or whether it belongs within a specified range or set of characters (as shown in Table 3.4). Using this information, you can create a series of character classes to match a certain string of text. For instance, if you want to specify a phone number using character classes, you can use the following regular expression:
\(\d\d\d\)\s\d\d\d-\d\d\d\d
This expression begins by first escaping the left parenthesis. You must escape it because parentheses are used for grouping expressions. Next you can see three digits representing a phone number's area code followed by the closing parenthesis. You use a \s
to denote a whitespace character. The remainder of the regular expression contains the remaining digits of the phone number.
In addition to the single character classes, you can also use ranged and set character classes. They give you fine-grain control on exactly the type of characters the regular expression should match. For instance, if you want to match any character as long as it is a vowel, use the following expression:
[aeiou]
This line means that a character should match one of the literal characters within that set of characters. An even more specialized form of single character classes are Unicode character categories. Unicode categories are similar to some of the character-attribute inspection methods shown earlier in this chapter. For instance, you can use Unicode categories to match on uppercase or lowercase characters. Other categories include punctuation characters, currency symbols, and math symbols, to name a few. You can easily find the full list of Unicode categories in MSDN under the topic "Unicode Categories Enumeration."
You can optimize the phone-number expression, although it's completely valid, by using quantifiers. Quantifiers specify additional information about the character, character class, or expression to which it applies. Some quantifiers include wildcards such as *
, which means 0 or more occurrences, and ?
, which means only 0 or 1 occurrences of a pattern. You can also use braces containing an integer to specify how many characters within a given character class to match. Using this quantifier in the phone-number expression, you can specify that the phone number should contain three digits for the area code followed by three digits and four digits separated by a dash:
\(\d{3}\)\s\d{3}-\d{4}
Because the regular expression itself isn't that complicated, you can still see that using quantifiers can simplify regular-expression creation. In addition to character classes and quantifiers, you can also use positional information within a regular expression. For instance, you can specify that given an input string, the regular expression should operate at the beginning of the string. You express it using the ^
character. Likewise, you can also denote the end of a string using the $
symbol. Take note that this doesn't mean start at the end of the string and attempt to make a match because that obviously seems counterintuitive; no characters exist at the end of the string. Rather, by placing the $
character following the rest of the regular expression, it means to match the string with the regular expression as long as the match occurs at the end of the string. For instance, if you want to match a sentence in which a phone number is the last portion of the sentence, you could use the following:
\(\d{3}\)\s\d{3}-\d{4}$
My phone number is (555) 555-5555 = Match
(555) 555-5555 is my phone number = Not a match
3.11. Validating User Input with Regular Expressions
You want to ensure valid user input by using regular expressions to test for validity.
Technique
Create a RegEx
object, which exists within the System.Text.RegularExpressions
namespace, passing the regular expression in as a parameter to the constructor. Next, call the member method Match
using the string you want to validate as a parameter to the method. The method returns a Match
object regardless of the outcome. To test whether a match is made, evaluate the Boolean Success
property on that Match
object as demonstrated in Listing 3.9. It should also be noted that in many cases, the forward slash (\) character is used when working with regular expressions. To avoid compilation errors from inadvertently specifying an invalid control character, use the @
symbol to turn off escape processing.
Listing 3.9 Validating User Input of a Phone Number Using a Regular Expression
using System;
using System.Text.RegularExpressions;
namespace _11_RegularExpressions
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
Regex phoneExp = new Regex( @"^\(\d{3}\)\s\d{3}-\d{4}$" );
string input;
Console.Write( "Enter a phone number: " );
input = Console.ReadLine();
while( phoneExp.Match( input ).Success == false )
{
Console.WriteLine( "Invalid input. Try again." );
Console.Write( "Enter a phone number: " );
input = Console.ReadLine();
}
Console.WriteLine( "Validated!" );
}
}
}
Comments
Earlier in this chapter I mentioned that you could perform data validation using the static methods within the System.Char
class. You can inspect each character within the input string to ensure it matches exactly what you are looking for. However, this method of input validation can be extremely cumbersome if you have different input types to validate because it requires custom code for each validation. In other words, using the methods in the System.Char
class is not recommended for anything but the simplest of data-validation procedures.
Regular expressions, on the other hand, allow you to perform the most advanced input validation possible, all within a single expression. You are in effect passing the parsing of the input string to the regular-expression engine and offloading all the work that you would normally do.
In Listing 3.9, you can see how you create and use a regular expression to test the validity of a phone number entered by a user. The regular expression is similar to the previous expressions used earlier for phone numbers except for the addition of positional markers. The regular expression is valid if a user enters a phone number and nothing else. A match is successful when the Success
property within the Match
object, which is returned from the Regex.Match
method, is true
. The only caveat to using regular expressions for input validation is that even though you know the validation failed, you are unable to query the Regex
or Match
class to see what part of the string failed.
3.12. Replacing Substrings Using Regular Expressions
You want to replace all substrings that match a regular expression with a different substring that also uses regular-expression syntax.
Technique
Create a Regex
object, passing the regular expression used to match characters in the input string to the Regex
constructor. Next, call the Regex
method Replace
, passing the input string to process and the string to replace each match within the input string. You can also use the static Replace
method, passing the regular expression as the first parameter to the method as shown in the last line of Listing 3.10.
Listing 3.10 Using Regular Expressions to Replace Numbers in a Credit Card with xs
using System;
using System.Text.RegularExpressions;
namespace _12_RegExpReplace
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
Regex cardExp = new Regex( @"(\d{4})-(\d{4})-(\d{4})-(\d{4})" );
string safeOutputExp = "$1-xxxx-xxxx-$4";
string cardNum;
Console.Write( "Please enter your credit card number: " );
cardNum = Console.ReadLine();
while( cardExp.Match( cardNum ).Success == false )
{
Console.WriteLine( "Invalid card number. Try again." );
Console.Write( "Please enter your credit card number: " );
cardNum = Console.ReadLine();
}
Console.WriteLine( "Secure Output Result = {0}",
cardExp.Replace( cardNum, safeOutputExp ));
}
}
}
Comments
Although input validation is an extremely useful feature of regular expressions, they also work well as text parsers. The previous recipe used regular expressions to verify that a particular string matched a regular expression exactly. However, you can also use regular expressions to match substrings within a string and return each of those substrings as a group. Furthermore, you can use a separate regular expression that acts on the result of the regular-expression evaluation to replace substrings within the original input string.
Listing 3.10 creates a regular expression that matches the format for a credit card. In that regular expression, you can see that it will match on four different groups of four digits apiece separated by a dash. However, you might also notice that each one of these groups is surrounded with parentheses. In an earlier recipe, I mentioned that to use a literal parenthesis, you must escape it using a backslash because of the conflict with regular-expression grouping symbols. In this case, you want to use the grouping feature of regular expressions. When you place a portion of a regular expression within parentheses, you are creating a numbered group. Groups are numbered starting with 1 and are incremented for each subsequent group. In this case, there are four numbered groups. These groups are used by the replacement string, which is contained in the string safeOutputExp
. To reference a numbered group, use the $
symbol followed by the number of the group to reference. This sequence represents all characters within the input string that match the group expression within the regular expression. Therefore, in the replacement string, you can see that it prints the characters within the first group, replaces the characters in the second and third groups with xs, and finally prints the characters in the fourth group.
One thing to note is that you can use the RegEx
class to view the groups themselves. If you change the regular expression to "\d{4}
", you can then use the Matches
method to enumerate all the groups using the foreach
keyword, as shown in Listing 3.11. In the listing, the program first checks to make sure at least four matches were made. This number corresponds to four groups of four digits. Next, it uses a foreach
enumeration on each Match
object that is returned from the Matches
method. If the match is in the second or third group, the values are replaced with xs; otherwise, the Match
object's value, the characters within that group, are concatenated to the result string.
Listing 3.11 Enumerating Through the Match
Collection to Perform Special Operations on Each Match in a Regular Expression
static void TestManualGrouping()
{
Regex cardExp = new Regex( @"\d{4}" );
string cardNum;
string safeOutputExp = "";
Console.Write( "Please enter your credit card number: " );
cardNum = Console.ReadLine();
if( cardExp.Matches( cardNum ).Count < 4 )
{
Console.WriteLine( "Invalid card number" );
return;
}
foreach( Match field in cardExp.Matches( cardNum ))
{
if( field.Success == false )
{
Console.WriteLine( "Invalid card number" );
return;
}
if( field.Index == 5 || field.Index == 10 )
{
safeOutputExp += "-xxxx-";
}
else
{
safeOutputExp += field.Value;
}
}
Console.WriteLine( "Secure Output Result = {0}", safeOutputExp );
}
3.13. Building a Regular Expression Library
You want to create a library of regular expressions that you can reuse in other projects.
Technique
Use the CompileToAssembly
static method within the Regex
class to compile a regular expression into an assembly. This method uses an array of RegexCompilationInfo
objects that contain any number of regular expressions you want to add to the assembly.
The RegexCompilationInfo
class contains a constructor with five fields that you must fill out. The parameters denote the string for the regular expression; any options for the regular expression, which appear in the RegexOptions
enumerated type; a name for the class that is created to hold the regular expression; a corresponding namespace; and a Boolean value specifying whether the created class should have a public access modifier.
After creating the RegexCompilationInfo
object, create an AssemblyName
object, making sure to reference the System.Reflection
namespace, and set the Name
property to a name you want the resultant assembly filename to be. Because the CompileToAssembly
creates a DLL, exclude the DLL extension on the assembly name. Finally, place all the RegexCompilationInfo
objects within an array, as shown in Listing 3.12, and call the CompileToAssembly
method. Listing 3.12 demonstrates how to create a RegexCompilationInfo
object and how to use that object to compile a regular expression into an assembly using the CompileToAssembly
method.
Listing 3.12 Using the CompileToAssembly
Regex
Method to Save Regular Expressions in a New Assembly for Later Reuse
using System;
using System.Text.RegularExpressions;
using System.Reflection;
namespace _12_RegExpReplace
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
CompileRegex(@"(\d{4})-(\d{4})-(\d{4})-(\d{4})", @"regexlib" );
}
static void CompileRegex( string exp, string assemblyName )
{
RegexCompilationInfo compInfo =
new RegexCompilationInfo( exp, 0, "CreditCardExp", "", true );
AssemblyName assembly = new AssemblyName();
assembly.Name = assemblyName;
RegexCompilationInfo[] rciArray = { compInfo };
Regex.CompileToAssembly( rciArray, assembly );
}
}
}
Comments
If you use regular expressions regularly, then you might find it advantageous to create a reusable library of the expressions you tend to use the most. The Regex
class contains a method named CompileToAssembly
that allows you to compile several regular expressions into an assembly that you can then reference within other projects.
Internally, you will find a class for each regular expression you added, all contained within its corresponding namespace, as specified in the RegexCompilationInfo
object when you created it. Furthermore, each of these classes inherits from the Regex
class so all the Regex
methods are available for you to use. As you can see, creating a library of commonly used regular expressions allows you to reuse and share these expressions in a multitude of different projects. A change in a regular expression simply involves changing one assembly instead of each project that hard-coded the regular expression.