Searching, Modifying, and Encoding Text

logicchild

Rate me:

4.25/5 (8 votes)

1 Jul 2009CPOL13 min read

49.7K

An article that explains Regular Expressions

Introduction

The purpose of this article is to help the developer to enhance the text building capabilities of a .NET Framework application via the System.Text namespace. Processing text is one of the most common programming tasks. User input is typically in text format, and it might need to be validated, sanitized, and reformatted. Also, as a developer, you might need to process text files generated from a legacy system to extract important data. These legacy systems often use nonstandard encoding techniques. Moreover, the majority of software development companies maintain large legacy code bases. Finally, a developer might need to output text files in specific formats to input data in a legacy system.

Forming Regular Expressions – The Basics

As I stated earlier, developers often need to process text. For example, you might need to process text from a user to remove or replace special characters. A regular expression is a set of characters that can be compared to a string to determine whether the string meets specified format requirements. That is, regular expressions are simply arrangements of text that describe a pattern to match within some input text. Simple variants of regular expressions pop up everywhere. The MS_DOS command ‘dir *.cs’ employs a very simple pattern matching technique much like what regular expressions can do. The asterisk followed by the period and the two letters is a simple wildcard pattern. To use regular expressions for pattern matching, we will create a simple console application named TestRegExp that accepts two strings as input and determines whether the first string (a regular expression) matches the second string. Here is some basic code. The regular expression will not make sense at all, but it should by the end of the article:

using System;
using System.Text.RegularExpressions;
namespace TestRegExp
 {
  class Class1
   {
   [STAThread]
  static void Main(string[]  args)
     {
   if (Regex.IsMatch(args[1], args[0]))
     Console.WriteLine("Input matches regular expression");
   else
     Console.WriteLine("Input does not match regular expression");
    }
  }
}

Now we run the application by determining whether the regular expression “^\d[5]$” matches the string “12345” or “1234”:

C:\Windows\MICROS~1.NET\FRAMEW~1\V20~1.507>TestRegExp ^\d{5}$ 1234
Input does not match regular expression
C:\Windows\MICROS~1.NET\FRAMEW~1\V20~1.507>regexp ^\d{5}$ 12345
Input matches regular expression

The Regex.IsMatch method compares a regular expression to a string and returns true if the string matches the regular expression. In this example, “^\d{5}$” means that the string must have exactly five numeric digits. The carat (^) represents the start of the string, “\d” means numeric digits, “{5}” indicates five sequential numeric digits, and “$” represents the end of the string. Those of you who are familiar with the C language know the “%d” is a numeric specifier for numeric digits in the printf() function. Here is another code example. Notice that the first object we create uses the regular expression syntax as its constructor parameters, so when the CLR allocates memory (because we used the “new” operator), those parameters are then referenced in the variable rr. The second string object is an array of strings of both numeric and alphabetic characters. This string will be tested by the regular expression syntax for pattern matching: true if there is a match, false if there is no match:

using System;
using System.Text.RegularExpressions;
  public sealed class Program {
  public static void Main()
        {
          Regex[] rr = new Regex[] { new Regex("^\\d+"), 
		new Regex(@"\d+$"), new Regex("^\\d+$") };
            string[] ss = new string[] { "123abc", "abc123", "123", "abc", "a123bc" };
    foreach (Regex r in rr)
    foreach (string s in ss)
  Console.WriteLine("{0}.IsMatch({1}) = {2}", r.ToString(), s, r.IsMatch(s));
        }
}

Here is the output:

^\d+.IsMatch(123abc) = True
^\d+.IsMatch(abc123) = False
^\d+.IsMatch(123) = True
^\d+.IsMatch(abc) = False
^\d+.IsMatch(a123bc) = False
\d+$.IsMatch(123abc) = False
\d+$.IsMatch(abc123) = True
\d+$.IsMatch(123) = True
\d+$.IsMatch(abc) = False
\d+$.IsMatch(a123bc) = False
^\d+$.IsMatch(123abc) = False
^\d+$.IsMatch(abc123) = False
^\d+$.IsMatch(123) = True
^\d+$.IsMatch(abc) = False
^\d+$.IsMatch(a123bc) = False

Now let’s look to the first example again. If you remove the first character from the regular expression, you drastically change the meaning of the pattern. The regular expression “\d{5}$” will still match valid five-digit numbers, such as “12345”. However, it will also match the input string “abcd12345” or “drop table customers – 12345”.

If you have a background in Perl or the Unix system, then regular expressions may appear easy. Unix/Linux provides the AWK programming language and the “Sed” text editor. AWK is a data-driven language that actually streams data into an editor like Sed in order to have an operation performed on that data. Often that data is a predefined pattern of text where there is one to one correspondence: one pattern has one operation performed on it. I write this because the regular expression syntax appears kind of cryptic. In fact, regular expressions derive from the Unix system. As you can see, regular expressions rely heavily on special characters with meanings that no human could ever decipher on his own or her own. If they seem cryptic, that’s because they are.

Characters that Match Locations in Strings

^ specifies that the match must begin either at the first character of the string or the first character of the line.
$ specifies that the match must end at either the last character of the string, the last character before \n at the of the string, or the last character at the end of the line.
\A specifies that the match must begin at the first character of the string.
\Z specifies that the match must end at either the last character of the string or the last character before the \n at the end of the string.
\z specifies that the match must end at the last character of the string.
\b specifies that the match must occur on a boundary between \w (alphanumeric) and \W (non-alphanumeric) characters.
\G specifies that the match must occur at the point where the previous match ended.
\B specifies that the match must not occur on a \b boundary

How to Match Simple Text

The simplest use of regular expressions is to determine whether a string matches a pattern. For example, the regular expression “abc” matches the strings “abc”, “abcde”, or “yabc”, because each of these strings contains the regular expression. No wildcards are necessary. To match the text that ends at the last character of a string, place a “$” symbol at the end of the regular expression. To exactly match a string, include both “^” and “$”. When searching for words, use “\b” to match a word boundary. For example, “car\b” matches “car” or “tocar”, but not carburetor. Similarly, “\B” matches a nonword boundary and can be used to ensure that a character appears in the middle of a word. For example, “car\B” matches “carburetor” but not “tocar”.

How to Match Special Characters

Special characters can be matched in regular expressions. Below is a list of character escapes used in special expressions. These are the standard escapes. Note that in the .NET Framework, characters are always represented in 16-bit Unicode values. A character is represented with an instance of the System.Char structure (a value type), which is pretty simple: It offers two public read-only constant fields: MinValue, defined as ‘\0’, and MaxValue, defined as ‘\uffff’.

Escape, Character#, and Description

\0 \u0000 Null
\a \u0007 Bell
\b \u0008 Backspace
\t \u00009 Tab
\v \u000B Vertical Tab
\f \u000C Form Feed
\n \u000A Newline
\r \u000D Carriage Return
\e \u001B Escape

A Closer Look at Regular Expressions: The Terminology and the Usage

Regular expressions consist of meta-characters, symbols which are interpreted and treated in a special manner by the expression matcher, and literals, which are characters that match input literally without interpretation by the matcher. There are many variants of meta-characters: wildcards (a.k.a. character classes) and quantifiers, and most are represented with either a single or an escaped character. For example, examine the illustration below:

Literal: abc     Quantifiers: *, {1, 10}     Alternation: |    Backrefererence: \1

Now assume that you accepting input from a web page where the user had to enter her/his social security number. Of course, social security numbers have very precise formatting requirements, so you would probably want to validate this input. Specifically, they consist of three numbers, followed by a dash, followed by two numbers, followed by a dash, followed by four numbers. You could write this down much more concretely as a pattern nnn-nn-nnnn, where each n is any number from 0 – 9. This is exactly what regular expressions enable you to do: they enable to precisely capture textual formatting details within a computer-readable expression. This brings us to our first incarnation of a regular expression for social security numbers:

[0 – 9] [0 – 9] [0 – 9] – [0 – 9] [0 – 9]-[0 – 9][0 – 9][0 – 9][0 – 9]

We can use quantifiers to make this expression more precise:

[0-9]{3} – [0-9]{2} – [0-9]{4}

The curly braces modify the expression preceding them and indicate that the pattern should match exactly n occurrences of it, where n is the number found between the curly braces. With this pattern, we describe the expression just as we would when explaining social security number formatting to another human being. Another slight modification can pattern even simpler. The presence of \d is nearly the same as the custom class [0-9]:

\d{3}-\d{2}-\d{4}

Now to reiterate on the System.Text.RegularExpressions APIs. The following code uses the Regex class to check a couple of strings for social-security pattern matches:

Regex ssn = new Regex(@"\d{3}-\d{2}-\d{4}");
bool IsMatch1 = ssn.IsMatch("011-01-0111");
bool IsMatch2 = ssn.IsMatch("abc-01-0111");
// isMatch1 == true, while isMatch2 == false..

Now consider this code below:

using System;
using System.Text.RegularExpressions; 
public sealed class Program {
public static void Main()
 {
   string s = "011-01-0111";
   string s2 = "abc-01-0111";

  Regex r1 = new Regex("[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]");
 Console.WriteLine("{0}\r\nMatch? {1}: {2}, {3}: {4}",
                r1.ToString(), s, r1.IsMatch(s), s2, r1.IsMatch(s2));

            Regex r2 = new Regex("[0-9]{3}-[0-9]{2}-[0-9]{4}");
            Console.WriteLine("{0}\r\nMatch? {1}: {2}, {3}: {4}",
                r2.ToString(), s, r2.IsMatch(s), s2, r2.IsMatch(s2));
        }
      }

Here is the output:

[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]
Match? 011-01-0111: True, abc-01-0111: False
[0-9]{3}-[0-9]{2}-[0-9]{4}
Match? 011-01-0111: True, abc-01-0111: False

Now let’s use generics, and pass a class just like we would pass a data type. We will build a value type struct of type customer, and pass that type as a parameter to then use regular expressions to evaluate a predefined pattern:

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
struct Customer
        {
            public string Name;
            public string Ssn;
            public string Company;
            public Customer(int i)
            {
                Name = "Silly Willy";
                Ssn = "xxx-xx-xxxx";
                Company = "Acme";
            }
        }

        public class CustomerEvaluator
        {
         private Dictionary<int> customers = new Dictionary<int>();            
            private Customer LoadFromDb(int id)
            {
                if (customers.ContainsKey(id))
                    return customers[id];
                Customer c = new Customer(10); // look up customer from db
                customers.Add(id, c);
                return c;
            }
            public string Evaluate(Match match)
            {
                string id = match.Groups["custid"].Value;
                Customer c = LoadFromDb(int.Parse(id));
                string a = match.Groups["attrib"].Value;
                switch (a)
                {
                    case "name":
                        return c.Name;
                    case "ssn":
                        return c.Ssn;
                    case "company":
                        return c.Company;
                    default:
                        return "?unknown?";
                }
            }
            public string Process(string input)
            {
                Regex r = new Regex(@"\$\{(?<custid />\d+):(?<attrib />\w+)}");
                return r.Replace(input, Evaluate);
            }
        }
        public sealed class Program { 
        public static void Main()
        {
            string input = "Customer name is ${1011:name}; company is ${1011:company}.";
            CustomerEvaluator ce = new CustomerEvaluator();
            Console.WriteLine(ce.Process(input));
        }
      }

Here is what the compiled code outputs:

Customer name is Silly Willy; company is Acme.

Now let's examine encoding.

Understanding Encoding

Although it was not the first encoding type, American Standard Code for Information Interchange (ASCII) is still the foundation for existing encoding types. ASCII assigned characters to 7-bit bytes using the numbers 0 – 127. These characters included English uppercase and lowercase letters, numbers, punctuation, and some special control characters. For example, 0x21 is “!”, 0x31 is “1”, 0x43 is “C”, 0x63 is “c”, and 0x7D is “}”. Note that 127 in base 10 decimal is 0x7D. While ASCII was sufficient for most English-language communications, ASCII did not include letters used in non-English alphabets. To enable computers to be used in non-English speaking location, computer manufacturers made use of the remaining values--128 through 255—in an 8 bit byte. Naturally since one bit can have one of two possible values – a 0 or a 1 – then there are 2 raised to the 8^th power, or 256 (0-255) possible combinations of binary instructions. Over time, different locations assigned different unique characters to values greater than 127. Because different locations might have different characters assigned to a single value, transferring documents between different languages created problems. (Recall that Notepad does not have (or use) a 7 bit ASCII encoding type. Now to help reduce these problems, the American National Standards Institute (ANSI) defined standardized code pages that had standard ASCII values for 0 through 127, and language specific values for 128 – 255. A code page is a list of selected character codes (characters represented as code points) in a certain order. Code pages are usually defined to support specific languages or groups of languages that share common writing systems. Microsoft Windows code pages contain 256 code points and are zero-based.

If you have ever received an email message or seen a Web page that seems to have box characters or question marks where letters should appear, you have witnessed an encoding problem. Because people create Web pages and e-mails in many different languages, each must be tagged with an encoding type. For example, an e-mail might include one of the following headers:

Content-Type:   text/plain;    charset=ISO-8859-1
Content-Type:   text/plain;    charset="Windows-1251"

”ISO-8859-1” corresponds to code page 28591, “Western-European (ISO)”. If it had specified “ISO-8859-7”, it could have contained characters from the “Greek (ISO)” code page, number 28597. Similarly, HTML web pages typically include a meta tag such as one of the following:

HTML

<meta http-equiv="Content-Type" content = "text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

As nearly all computer users know, ASCII and ISO-8859 encoding types are being replaced with Unicode. The .NET Framework uses Unicode UTF-16 (Unicode Transformation Format, 16 bit encoding) to represent characters. In some cases, the .NET Framework uses UTF-8 internally. The System.Text namespace provides classes that allow you to encode and decode characters. System.Text encoding support includes the following encodings:

Unicode UTF-32 encoding encodes all characters as 4 bytes. UTF-32 can be used to convert from little-endian to big-endian.
Unicode UTF-16 encoding encodes each 16-bit character as 2 bytes. It doesn’t affect the characters at all, and no compression occurs. UTF-16 can also be used to convert from little-endian to big-endian.
Unicode UTF-8 encoding encodes some characters as 1 byte, some characters as 2 bytes, some characters as 3 bytes, and some characters as 4 bytes. Characters with a value below 0x0080 (128 in base 10 decimal) are compressed to 1 byte, which works effectively in the United States. Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works well with European and Middle Eastern languages. Characters of 0x0800 are converted into 3 bytes, which works well for East Asian Languages. Finally, surrogate pairs are written out as 4 bytes.
ASCII encoding encodes 16-bit characters into ASCII characters; that is, any character with a value less than ox0080 is converted into a single byte. Any character with a value greater than 0x007F can’t be converted, so the character’s value is lost.
ANSI/ISO encodings

Using the System.Text.Encoding Class

When you need to encode or decode a set of characters, you should obtain an instance of a class derived from the System.Text.Encoding. Encoding is an abstract base class that offers several static read-only properties, each of which returns an instance of an Encoding-derived class. Here is an example that encodes and decodes characters by using UTF-8:

using System;
using System.Text;

public static class Program {
   public static void Main() {
      // This is the string we're going to encode.
      String s = "Hi there.";

      // Obtain an Encoding-derived object that knows how 
      // to encode/decode using UTF8
      Encoding encodingUTF8 = Encoding.UTF8;

      // Encode a string into an array of bytes.
      Byte[] encodedBytes = encodingUTF8.GetBytes(s);

      // Show the encoded byte values.
      Console.WriteLine("Encoded bytes: " +
         BitConverter.ToString(encodedBytes));

      // Decode the byte array back to a string.
      String decodedString = encodingUTF8.GetString(encodedBytes);

      // Show the decoded string.
      Console.WriteLine("Decoded string: " + decodedString);
   }
}

Here is the output:

Encoded bytes: 48-69-20-74-68-65-72-65-2E
Decoded string: Hi there.

And here is the code that performs code page conversion:

using System;
using System.Text;
using System.IO;

public static class Program {
	public static void Main(String[] args) {
		if (args.Length != 4) {
			Console.WriteLine(
				"CodePageConverter <SrcFile> <SrcCodePage> <dstfile /> <dstcodepage />{0}{0}" +
				"Examples:{0}" +
				"   CodePageConverter InFile.txt 65001 
					OutFile.txt 1200{0}" +
				"      => Converts from UTF-8 (codepage 65001) 
					to UTF-16 (Little Endian){0}{0}" +
				"   CodePageConverter InFile.txt 932 
					OutFile.txt UTF-8{0}" +
				"      => Converts from shift-jis 
					(codepage 932) to UTF-8",
				Environment.NewLine);
			return;
		}

		// Open the source stream using the specified encoding
		StreamReader srcText = new StreamReader(args[0], GetEncoding(args[1]));

		// Create the destination stream using the specified encoding
		StreamWriter dstText = new StreamWriter(args[2], 
					false, GetEncoding(args[3]));

		// Read from the source stream and write to the destination stream
		dstText.Write(srcText.ReadToEnd());

		// Close both streams
		srcText.Close();
		dstText.Close();
	}

	static Encoding GetEncoding(String s) {
		try {
			// Assume the user passed an integer identifying a code page
			return Encoding.GetEncoding(Int32.Parse(s));
		}
		catch (FormatException) {
			// The user didn't pass an integer code page value
		}

		// Assume the user passed a string identifying a code page
		return Encoding.GetEncoding(s);
	}
}

And the good old output:

CodePageConverter <SrcFile> <SrcCodePage> <dstfile> <dstcodepage>

Examples:
   CodePageConverter InFile.txt 65001 OutFile.txt 1200
      => Converts from UTF-8 (codepage 65001) to UTF-16 (Little Endian)

   CodePageConverter InFile.txt 932 OutFile.txt UTF-8
      => Converts from shift-jis (codepage 932) to UTF-8

You use the Encoding class’s GetEncoding method to return an encoding object for a specified encoding. You can use the Encoding.GetBytes method to convert a Unicode string to its byte representation in a specified encoding. The code example shown below uses the Encoding.GetEncoding method to create a target encoding object for the Korean code page. The code calls the Encoding.GetBytes method to convert a Unicode string to its byte representation in the Korean Encoding. The code then displays the byte representations of the stings in the Korean code page:

using System;
using System.Text;
public sealed class Program {
public static void Main() {
Encoding e = Encoding.GetEncoding("Korean");
// convert ASCII bytes to Korean encoding
byte[] encoded;
encoded = e.GetBytes("Hello, World!");

// display the byte codes
for (int i = 0; i < encoded.Length; ++i)
 Console.WriteLine("Byte {0} : {1}", i, encoded[i]);
  }
}

Here is the output:

Byte 0 : 72
Byte 1 : 101
Byte 2 : 108
Byte 3 : 108
Byte 4 : 111
Byte 5 : 44
Byte 6 : 32
Byte 7 : 87
Byte 8 : 111
Byte 9 : 114
Byte 10 : 108
Byte 11 : 100
Byte 12 : 33

The code sample above just demonstrates how to convert text to a different code page; however, you would not normally convert an English-language phrase into a different code page. In most code pages, the code points 0 – 127 represent the same ASCII characters. This allows for continuity and legacy code. The code points 128 through 255 differ significantly between code pages. To examine all supported code pages in the .NET Framework, call the Encoding class’s GetEncoding method. This method returns an array of EncodingInfo objects. The following code displays the number, official name, and friendly name of the .NET Framework code bases:

using System;
using System.Text;
public sealed class Program {
public static void Main() {
EncodingInfo[] ei = Encoding.GetEncodings();
foreach (EncodingInfo e in ei)
 Console.WriteLine("{0}:  {1}, {2}", e.CodePage, e.Name, e.DisplayName);
  }
}

And take a look at the output:

37:  IBM037, IBM EBCDIC (US-Canada)
437:  IBM437, OEM United States
500:  IBM500, IBM EBCDIC (International)
708:  ASMO-708, Arabic (ASMO 708)
720:  DOS-720, Arabic (DOS)
737:  ibm737, Greek (DOS)
775:  ibm775, Baltic (DOS)
850:  ibm850, Western European (DOS)
852:  ibm852, Central European (DOS)
855:  IBM855, OEM Cyrillic
857:  ibm857, Turkish (DOS)
858:  IBM00858, OEM Multilingual Latin I
860:  IBM860, Portuguese (DOS)
861:  ibm861, Icelandic (DOS)
862:  DOS-862, Hebrew (DOS)
863:  IBM863, French Canadian (DOS)
864:  IBM864, Arabic (864)
865:  IBM865, Nordic (DOS)
866:  cp866, Cyrillic (DOS)
869:  ibm869, Greek, Modern (DOS)
870:  IBM870, IBM EBCDIC (Multilingual Latin-2)
874:  windows-874, Thai (Windows)
875:  cp875, IBM EBCDIC (Greek Modern)
932:  shift_jis, Japanese (Shift-JIS)
936:  gb2312, Chinese Simplified (GB2312)
949:  ks_c_5601-1987, Korean
950:  big5, Chinese Traditional (Big5)
1026:  IBM1026, IBM EBCDIC (Turkish Latin-5)
1047:  IBM01047, IBM Latin-1
1140:  IBM01140, IBM EBCDIC (US-Canada-Euro)
1141:  IBM01141, IBM EBCDIC (Germany-Euro)
1142:  IBM01142, IBM EBCDIC (Denmark-Norway-Euro)
1143:  IBM01143, IBM EBCDIC (Finland-Sweden-Euro)
1144:  IBM01144, IBM EBCDIC (Italy-Euro)
1145:  IBM01145, IBM EBCDIC (Spain-Euro)
1146:  IBM01146, IBM EBCDIC (UK-Euro)
1147:  IBM01147, IBM EBCDIC (France-Euro)
1148:  IBM01148, IBM EBCDIC (International-Euro)
1149:  IBM01149, IBM EBCDIC (Icelandic-Euro)
1200:  utf-16, Unicode
1201:  unicodeFFFE, Unicode (Big-Endian)
1250:  windows-1250, Central European (Windows)
1251:  windows-1251, Cyrillic (Windows)
1252:  Windows-1252, Western European (Windows)
1253:  windows-1253, Greek (Windows)
1254:  windows-1254, Turkish (Windows)
1255:  windows-1255, Hebrew (Windows)
1256:  windows-1256, Arabic (Windows)
1257:  windows-1257, Baltic (Windows)
1258:  windows-1258, Vietnamese (Windows)
1361:  Johab, Korean (Johab)
10000:  macintosh, Western European (Mac)
10001:  x-mac-japanese, Japanese (Mac)
10002:  x-mac-chinesetrad, Chinese Traditional (Mac)
10003:  x-mac-korean, Korean (Mac)
10004:  x-mac-arabic, Arabic (Mac)
10005:  x-mac-hebrew, Hebrew (Mac)
10006:  x-mac-greek, Greek (Mac)
10007:  x-mac-cyrillic, Cyrillic (Mac)
10008:  x-mac-chinesesimp, Chinese Simplified (Mac)
10010:  x-mac-romanian, Romanian (Mac)
10017:  x-mac-ukrainian, Ukrainian (Mac)
10021:  x-mac-thai, Thai (Mac)
10029:  x-mac-ce, Central European (Mac)
10079:  x-mac-icelandic, Icelandic (Mac)
10081:  x-mac-turkish, Turkish (Mac)
10082:  x-mac-croatian, Croatian (Mac)
12000:  utf-32, Unicode (UTF-32)
12001:  utf-32BE, Unicode (UTF-32 Big-Endian)
20000:  x-Chinese-CNS, Chinese Traditional (CNS)
20001:  x-cp20001, TCA Taiwan
20002:  x-Chinese-Eten, Chinese Traditional (Eten)
20003:  x-cp20003, IBM5550 Taiwan
20004:  x-cp20004, TeleText Taiwan
20005:  x-cp20005, Wang Taiwan
20105:  x-IA5, Western European (IA5)
20106:  x-IA5-German, German (IA5)
20107:  x-IA5-Swedish, Swedish (IA5)
20108:  x-IA5-Norwegian, Norwegian (IA5)
20127:  us-ascii, US-ASCII
20261:  x-cp20261, T.61
20269:  x-cp20269, ISO-6937
20273:  IBM273, IBM EBCDIC (Germany)
20277:  IBM277, IBM EBCDIC (Denmark-Norway)
20278:  IBM278, IBM EBCDIC (Finland-Sweden)
20280:  IBM280, IBM EBCDIC (Italy)
20284:  IBM284, IBM EBCDIC (Spain)
20285:  IBM285, IBM EBCDIC (UK)
20290:  IBM290, IBM EBCDIC (Japanese katakana)
20297:  IBM297, IBM EBCDIC (France)
20420:  IBM420, IBM EBCDIC (Arabic)
20423:  IBM423, IBM EBCDIC (Greek)
20424:  IBM424, IBM EBCDIC (Hebrew)
20833:  x-EBCDIC-KoreanExtended, IBM EBCDIC (Korean Extended)
20838:  IBM-Thai, IBM EBCDIC (Thai)
20866:  koi8-r, Cyrillic (KOI8-R)
20871:  IBM871, IBM EBCDIC (Icelandic)
20880:  IBM880, IBM EBCDIC (Cyrillic Russian)
20905:  IBM905, IBM EBCDIC (Turkish)
20924:  IBM00924, IBM Latin-1
20932:  EUC-JP, Japanese (JIS 0208-1990 and 0212-1990)
20936:  x-cp20936, Chinese Simplified (GB2312-80)
20949:  x-cp20949, Korean Wansung
21025:  cp1025, IBM EBCDIC (Cyrillic Serbian-Bulgarian)
21866:  koi8-u, Cyrillic (KOI8-U)
28591:  iso-8859-1, Western European (ISO)
28592:  iso-8859-2, Central European (ISO)
28593:  iso-8859-3, Latin 3 (ISO)
28594:  iso-8859-4, Baltic (ISO)
28595:  iso-8859-5, Cyrillic (ISO)
28596:  iso-8859-6, Arabic (ISO)
28597:  iso-8859-7, Greek (ISO)
28598:  iso-8859-8, Hebrew (ISO-Visual)
28599:  iso-8859-9, Turkish (ISO)
28603:  iso-8859-13, Estonian (ISO)
28605:  iso-8859-15, Latin 9 (ISO)
29001:  x-Europa, Europa
38598:  iso-8859-8-i, Hebrew (ISO-Logical)
50220:  iso-2022-jp, Japanese (JIS)
50221:  csISO2022JP, Japanese (JIS-Allow 1 byte Kana)
50222:  iso-2022-jp, Japanese (JIS-Allow 1 byte Kana - SO/SI)
50225:  iso-2022-kr, Korean (ISO)
50227:  x-cp50227, Chinese Simplified (ISO-2022)
51932:  euc-jp, Japanese (EUC)
51936:  EUC-CN, Chinese Simplified (EUC)
51949:  euc-kr, Korean (EUC)
52936:  hz-gb-2312, Chinese Simplified (HZ)
54936:  GB18030, Chinese Simplified (GB18030)
57002:  x-iscii-de, ISCII Devanagari
57003:  x-iscii-be, ISCII Bengali
57004:  x-iscii-ta, ISCII Tamil
57005:  x-iscii-te, ISCII Telugu
57006:  x-iscii-as, ISCII Assamese
57007:  x-iscii-or, ISCII Oriya
57008:  x-iscii-ka, ISCII Kannada
57009:  x-iscii-ma, ISCII Malayalam
57010:  x-iscii-gu, ISCII Gujarati
57011:  x-iscii-pa, ISCII Punjabi
65000:  utf-7, Unicode (UTF-7)
65001:  utf-8, Unicode (UTF-8)

To specify an encoding type when writing a file, use an overloaded Stream constructor that accepts an Encoding object. For example, the following code creates several files with different encoding types:

using System;
using System.IO;
using System.Text;
public sealed class Program {
public static void Main() {
StreamWriter swUtf7 = new StreamWriter("utf7.txt", false, Encoding.UTF7);
swUtf7.WriteLine("Hello, World!");
swUtf7.Close();
StreamWriter swUtf8 = new StreamWriter("utf8.txt", false, Encoding.UTF8);
swUtf8.WriteLine("Hello, World!");
swUtf8.Close();
StreamWriter swUtf16 = new StreamWriter("utf16.txt", false, Encoding.Unicode);
swUtf16.WriteLine("Hello, World!");
swUtf16.Close();
StreamWriter swUtf32 = new StreamWriter("utf32.txt", false, Encoding.UTF32);
swUtf32.WriteLine("Hello, World!");
swUtf32.Close();
   }
 }

No output will appear on the console. But if you the MS-DOS ‘dir’ command, you will find a group of text files named “utf7.txt”, “utf8.txt”, “utf16.txt”, and “utf32.txt”. Take particular notice of the utf32.txt file: H e l l o , W o r l d ! Here is the utf16.txt file: Hello, World!. Open up the remaining files to just take a look.

Encodings: Converting Between Characters and Bytes

Technical documentation asserts that in Win32, programmers have to write code to convert Unicode characters and strings to Multi-Byte Character Set (MBCS) characters and strings. If you have ever worked with C++ MFC, then you have undoubtedly changed the project properties from Unicode to Multi-Byte Character Set. Encodings allow a managed code application to interact with strings created by non-Unicode systems. For example, if you want to produce a file readable by an application running on a Japanese version of Windows 95, you have to save the Unicode text by using the Shift-JIS (codepage 932) encoding. Likewise, you’d have to use Shift-JIS encoding to read a file produced on a Japanese Windows 95 system into the CLR.

References

Professional .NET Framework 2.0, written by Joe Duffy
CLR via C#, written by Jeffrey Richter

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

logicchild

Software Developer Monroe Community

United States

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Searching, Modifying, and Encoding Text

Introduction

Forming Regular Expressions – The Basics

Characters that Match Locations in Strings

How to Match Simple Text

How to Match Special Characters

A Closer Look at Regular Expressions: The Terminology and the Usage

Understanding Encoding

Using the System.Text.Encoding Class

Encodings: Converting Between Characters and Bytes

References

License

Comments and Discussions