Introduction
The purpose of this article is to help the developer to enhance the text building capabilities of a .NET Framework application via the System.Text
namespace. Processing text is one of the most common programming tasks. User input is typically in text format, and it might need to be validated, sanitized, and reformatted. Also, as a developer, you might need to process text files generated from a legacy system to extract important data. These legacy systems often use nonstandard encoding techniques. Moreover, the majority of software development companies maintain large legacy code bases. Finally, a developer might need to output text files in specific formats to input data in a legacy system.
Forming Regular Expressions – The Basics
As I stated earlier, developers often need to process text. For example, you might need to process text from a user to remove or replace special characters. A regular expression is a set of characters that can be compared to a string to determine whether the string meets specified format requirements. That is, regular expressions are simply arrangements of text that describe a pattern to match within some input text. Simple variants of regular expressions pop up everywhere. The MS_DOS command ‘dir *.cs’ employs a very simple pattern matching technique much like what regular expressions can do. The asterisk followed by the period and the two letters is a simple wildcard pattern. To use regular expressions for pattern matching, we will create a simple console application named TestRegExp
that accepts two strings as input and determines whether the first string (a regular expression) matches the second string. Here is some basic code. The regular expression will not make sense at all, but it should by the end of the article:
using System;
using System.Text.RegularExpressions;
namespace TestRegExp
{
class Class1
{
[STAThread]
static void Main(string[] args)
{
if (Regex.IsMatch(args[1], args[0]))
Console.WriteLine("Input matches regular expression");
else
Console.WriteLine("Input does not match regular expression");
}
}
}
Now we run the application by determining whether the regular expression “^\d[5]$” matches the string “12345” or “1234”:
C:\Windows\MICROS~1.NET\FRAMEW~1\V20~1.507>TestRegExp ^\d{5}$ 1234
Input does not match regular expression
C:\Windows\MICROS~1.NET\FRAMEW~1\V20~1.507>regexp ^\d{5}$ 12345
Input matches regular expression
The Regex.IsMatch
method compares a regular expression to a string
and returns true
if the string
matches the regular expression. In this example, “^\d{5}$” means that the string
must have exactly five numeric digits. The carat (^) represents the start of the string
, “\d” means numeric digits, “{5}” indicates five sequential numeric digits, and “$” represents the end of the string
. Those of you who are familiar with the C language know the “%d” is a numeric specifier for numeric digits in the printf()
function. Here is another code example. Notice that the first object we create uses the regular expression syntax as its constructor parameters, so when the CLR allocates memory (because we used the “new
” operator), those parameters are then referenced in the variable rr
. The second string
object is an array of string
s of both numeric and alphabetic characters. This string
will be tested by the regular expression syntax for pattern matching: true
if there is a match, false
if there is no match:
using System;
using System.Text.RegularExpressions;
public sealed class Program {
public static void Main()
{
Regex[] rr = new Regex[] { new Regex("^\\d+"),
new Regex(@"\d+$"), new Regex("^\\d+$") };
string[] ss = new string[] { "123abc", "abc123", "123", "abc", "a123bc" };
foreach (Regex r in rr)
foreach (string s in ss)
Console.WriteLine("{0}.IsMatch({1}) = {2}", r.ToString(), s, r.IsMatch(s));
}
}
Here is the output:
^\d+.IsMatch(123abc) = True
^\d+.IsMatch(abc123) = False
^\d+.IsMatch(123) = True
^\d+.IsMatch(abc) = False
^\d+.IsMatch(a123bc) = False
\d+$.IsMatch(123abc) = False
\d+$.IsMatch(abc123) = True
\d+$.IsMatch(123) = True
\d+$.IsMatch(abc) = False
\d+$.IsMatch(a123bc) = False
^\d+$.IsMatch(123abc) = False
^\d+$.IsMatch(abc123) = False
^\d+$.IsMatch(123) = True
^\d+$.IsMatch(abc) = False
^\d+$.IsMatch(a123bc) = False
Now let’s look to the first example again. If you remove the first character from the regular expression, you drastically change the meaning of the pattern. The regular expression “\d{5}$” will still match valid five-digit numbers, such as “12345”. However, it will also match the input string “abcd12345” or “drop table customers – 12345”.
If you have a background in Perl or the Unix system, then regular expressions may appear easy. Unix/Linux provides the AWK programming language and the “Sed” text editor. AWK is a data-driven language that actually streams data into an editor like Sed in order to have an operation performed on that data. Often that data is a predefined pattern of text where there is one to one correspondence: one pattern has one operation performed on it. I write this because the regular expression syntax appears kind of cryptic. In fact, regular expressions derive from the Unix system. As you can see, regular expressions rely heavily on special characters with meanings that no human could ever decipher on his own or her own. If they seem cryptic, that’s because they are.
Characters that Match Locations in Strings
- ^ specifies that the match must begin either at the first character of the string or the first character of the line.
- $ specifies that the match must end at either the last character of the string, the last character before \n at the of the string, or the last character at the end of the line.
- \A specifies that the match must begin at the first character of the string.
- \Z specifies that the match must end at either the last character of the string or the last character before the \n at the end of the string.
- \z specifies that the match must end at the last character of the string.
- \b specifies that the match must occur on a boundary between \w (alphanumeric) and \W (non-alphanumeric) characters.
- \G specifies that the match must occur at the point where the previous match ended.
- \B specifies that the match must not occur on a \b boundary
How to Match Simple Text
The simplest use of regular expressions is to determine whether a string matches a pattern. For example, the regular expression “abc” matches the strings “abc”, “abcde”, or “yabc”, because each of these strings contains the regular expression. No wildcards are necessary. To match the text that ends at the last character of a string, place a “$” symbol at the end of the regular expression. To exactly match a string, include both “^” and “$”. When searching for words, use “\b” to match a word boundary. For example, “car\b” matches “car” or “tocar”, but not carburetor. Similarly, “\B” matches a nonword boundary and can be used to ensure that a character appears in the middle of a word. For example, “car\B” matches “carburetor” but not “tocar”.
How to Match Special Characters
Special characters can be matched in regular expressions. Below is a list of character escapes used in special expressions. These are the standard escapes. Note that in the .NET Framework, characters are always represented in 16-bit Unicode values. A character is represented with an instance of the System.Char
structure (a value type), which is pretty simple: It offers two public
read-only constant fields: MinValue
, defined as ‘\0
’, and MaxValue
, defined as ‘\uffff
’.
Escape, Character#, and Description
- \0 \u0000 Null
- \a \u0007 Bell
- \b \u0008 Backspace
- \t \u00009 Tab
- \v \u000B Vertical Tab
- \f \u000C Form Feed
- \n \u000A Newline
- \r \u000D Carriage Return
- \e \u001B Escape
A Closer Look at Regular Expressions: The Terminology and the Usage
Regular expressions consist of meta-characters, symbols which are interpreted and treated in a special manner by the expression matcher, and literals, which are characters that match input literally without interpretation by the matcher. There are many variants of meta-characters: wildcards (a.k.a. character classes) and quantifiers, and most are represented with either a single or an escaped character. For example, examine the illustration below:
Literal: abc Quantifiers: *, {1, 10} Alternation: | Backrefererence: \1
Now assume that you accepting input from a web page where the user had to enter her/his social security number. Of course, social security numbers have very precise formatting requirements, so you would probably want to validate this input. Specifically, they consist of three numbers, followed by a dash, followed by two numbers, followed by a dash, followed by four numbers. You could write this down much more concretely as a pattern nnn-nn-nnnn, where each n is any number from 0 – 9. This is exactly what regular expressions enable you to do: they enable to precisely capture textual formatting details within a computer-readable expression. This brings us to our first incarnation of a regular expression for social security numbers:
[0 – 9] [0 – 9] [0 – 9] – [0 – 9] [0 – 9]-[0 – 9][0 – 9][0 – 9][0 – 9]
We can use quantifiers to make this expression more precise:
[0-9]{3} – [0-9]{2} – [0-9]{4}
The curly braces modify the expression preceding them and indicate that the pattern should match exactly n occurrences of it, where n is the number found between the curly braces. With this pattern, we describe the expression just as we would when explaining social security number formatting to another human being. Another slight modification can pattern even simpler. The presence of \d is nearly the same as the custom class [0-9]:
\d{3}-\d{2}-\d{4}
Now to reiterate on the System.Text.RegularExpressions
APIs. The following code uses the Regex
class to check a couple of strings for social-security pattern matches:
Regex ssn = new Regex(@"\d{3}-\d{2}-\d{4}");
bool IsMatch1 = ssn.IsMatch("011-01-0111");
bool IsMatch2 = ssn.IsMatch("abc-01-0111");
Now consider this code below:
using System;
using System.Text.RegularExpressions;
public sealed class Program {
public static void Main()
{
string s = "011-01-0111";
string s2 = "abc-01-0111";
Regex r1 = new Regex("[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]");
Console.WriteLine("{0}\r\nMatch? {1}: {2}, {3}: {4}",
r1.ToString(), s, r1.IsMatch(s), s2, r1.IsMatch(s2));
Regex r2 = new Regex("[0-9]{3}-[0-9]{2}-[0-9]{4}");
Console.WriteLine("{0}\r\nMatch? {1}: {2}, {3}: {4}",
r2.ToString(), s, r2.IsMatch(s), s2, r2.IsMatch(s2));
}
}
Here is the output:
[0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]
Match? 011-01-0111: True, abc-01-0111: False
[0-9]{3}-[0-9]{2}-[0-9]{4}
Match? 011-01-0111: True, abc-01-0111: False
Now let’s use generics, and pass a class just like we would pass a data type. We will build a value type struct
of type customer
, and pass that type as a parameter to then use regular expressions to evaluate a predefined pattern:
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
struct Customer
{
public string Name;
public string Ssn;
public string Company;
public Customer(int i)
{
Name = "Silly Willy";
Ssn = "xxx-xx-xxxx";
Company = "Acme";
}
}
public class CustomerEvaluator
{
private Dictionary<int> customers = new Dictionary<int>();
private Customer LoadFromDb(int id)
{
if (customers.ContainsKey(id))
return customers[id];
Customer c = new Customer(10);
customers.Add(id, c);
return c;
}
public string Evaluate(Match match)
{
string id = match.Groups["custid"].Value;
Customer c = LoadFromDb(int.Parse(id));
string a = match.Groups["attrib"].Value;
switch (a)
{
case "name":
return c.Name;
case "ssn":
return c.Ssn;
case "company":
return c.Company;
default:
return "?unknown?";
}
}
public string Process(string input)
{
Regex r = new Regex(@"\$\{(?<custid />\d+):(?<attrib />\w+)}");
return r.Replace(input, Evaluate);
}
}
public sealed class Program {
public static void Main()
{
string input = "Customer name is ${1011:name}; company is ${1011:company}.";
CustomerEvaluator ce = new CustomerEvaluator();
Console.WriteLine(ce.Process(input));
}
}
Here is what the compiled code outputs:
Customer name is Silly Willy; company is Acme.
Now let's examine encoding.
Understanding Encoding
Although it was not the first encoding type, American Standard Code for Information Interchange (ASCII) is still the foundation for existing encoding types. ASCII assigned characters to 7-bit bytes using the numbers 0 – 127. These characters included English uppercase and lowercase letters, numbers, punctuation, and some special control characters. For example, 0x21 is “!”, 0x31 is “1”, 0x43 is “C”, 0x63 is “c”, and 0x7D is “}”. Note that 127 in base 10 decimal is 0x7D. While ASCII was sufficient for most English-language communications, ASCII did not include letters used in non-English alphabets. To enable computers to be used in non-English speaking location, computer manufacturers made use of the remaining values--128 through 255—in an 8 bit byte. Naturally since one bit can have one of two possible values – a 0 or a 1 – then there are 2 raised to the 8th power, or 256 (0-255) possible combinations of binary instructions. Over time, different locations assigned different unique characters to values greater than 127. Because different locations might have different characters assigned to a single value, transferring documents between different languages created problems. (Recall that Notepad does not have (or use) a 7 bit ASCII encoding type. Now to help reduce these problems, the American National Standards Institute (ANSI) defined standardized code pages that had standard ASCII values for 0 through 127, and language specific values for 128 – 255. A code page is a list of selected character codes (characters represented as code points) in a certain order. Code pages are usually defined to support specific languages or groups of languages that share common writing systems. Microsoft Windows code pages contain 256 code points and are zero-based.
If you have ever received an email message or seen a Web page that seems to have box characters or question marks where letters should appear, you have witnessed an encoding problem. Because people create Web pages and e-mails in many different languages, each must be tagged with an encoding type. For example, an e-mail might include one of the following headers:
Content-Type: text/plain; charset=ISO-8859-1
Content-Type: text/plain; charset="Windows-1251"
”ISO-8859-1” corresponds to code page 28591, “Western-European (ISO)”. If it had specified “ISO-8859-7”, it could have contained characters from the “Greek (ISO)” code page, number 28597. Similarly, HTML web pages typically include a meta tag such as one of the following:
<meta http-equiv="Content-Type" content = "text/html; charset=iso-8859-1">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
As nearly all computer users know, ASCII and ISO-8859 encoding types are being replaced with Unicode. The .NET Framework uses Unicode UTF-16 (Unicode Transformation Format, 16 bit encoding) to represent characters. In some cases, the .NET Framework uses UTF-8 internally. The System.Text
namespace provides classes that allow you to encode and decode characters. System.Text
encoding support includes the following encodings:
- Unicode UTF-32 encoding encodes all characters as 4 bytes. UTF-32 can be used to convert from little-endian to big-endian.
- Unicode UTF-16 encoding encodes each 16-bit character as 2 bytes. It doesn’t affect the characters at all, and no compression occurs. UTF-16 can also be used to convert from little-endian to big-endian.
- Unicode UTF-8 encoding encodes some characters as 1 byte, some characters as 2 bytes, some characters as 3 bytes, and some characters as 4 bytes. Characters with a value below 0x0080 (128 in base 10 decimal) are compressed to 1 byte, which works effectively in the United States. Characters between 0x0080 and 0x07FF are converted to 2 bytes, which works well with European and Middle Eastern languages. Characters of 0x0800 are converted into 3 bytes, which works well for East Asian Languages. Finally, surrogate pairs are written out as 4 bytes.
- ASCII encoding encodes 16-bit characters into ASCII characters; that is, any character with a value less than ox0080 is converted into a single byte. Any character with a value greater than 0x007F can’t be converted, so the character’s value is lost.
- ANSI/ISO encodings
Using the System.Text.Encoding Class
When you need to encode or decode a set of characters, you should obtain an instance of a class derived from the System.Text.Encoding
. Encoding
is an abstract
base class that offers several static
read-only properties, each of which returns an instance of an Encoding
-derived class. Here is an example that encodes and decodes characters by using UTF-8:
using System;
using System.Text;
public static class Program {
public static void Main() {
String s = "Hi there.";
Encoding encodingUTF8 = Encoding.UTF8;
Byte[] encodedBytes = encodingUTF8.GetBytes(s);
Console.WriteLine("Encoded bytes: " +
BitConverter.ToString(encodedBytes));
String decodedString = encodingUTF8.GetString(encodedBytes);
Console.WriteLine("Decoded string: " + decodedString);
}
}
Here is the output:
Encoded bytes: 48-69-20-74-68-65-72-65-2E
Decoded string: Hi there.
And here is the code that performs code page conversion:
using System;
using System.Text;
using System.IO;
public static class Program {
public static void Main(String[] args) {
if (args.Length != 4) {
Console.WriteLine(
"CodePageConverter <SrcFile> <SrcCodePage> <dstfile /> <dstcodepage />{0}{0}" +
"Examples:{0}" +
" CodePageConverter InFile.txt 65001
OutFile.txt 1200{0}" +
" => Converts from UTF-8 (codepage 65001)
to UTF-16 (Little Endian){0}{0}" +
" CodePageConverter InFile.txt 932
OutFile.txt UTF-8{0}" +
" => Converts from shift-jis
(codepage 932) to UTF-8",
Environment.NewLine);
return;
}
StreamReader srcText = new StreamReader(args[0], GetEncoding(args[1]));
StreamWriter dstText = new StreamWriter(args[2],
false, GetEncoding(args[3]));
dstText.Write(srcText.ReadToEnd());
srcText.Close();
dstText.Close();
}
static Encoding GetEncoding(String s) {
try {
return Encoding.GetEncoding(Int32.Parse(s));
}
catch (FormatException) {
}
return Encoding.GetEncoding(s);
}
}
And the good old output:
CodePageConverter <SrcFile> <SrcCodePage> <dstfile> <dstcodepage>
Examples:
CodePageConverter InFile.txt 65001 OutFile.txt 1200
=> Converts from UTF-8 (codepage 65001) to UTF-16 (Little Endian)
CodePageConverter InFile.txt 932 OutFile.txt UTF-8
=> Converts from shift-jis (codepage 932) to UTF-8
You use the Encoding
class’s GetEncoding
method to return an encoding object for a specified encoding. You can use the Encoding.GetBytes
method to convert a Unicode string to its byte representation in a specified encoding. The code example shown below uses the Encoding.GetEncoding
method to create a target encoding object for the Korean code page. The code calls the Encoding.GetBytes
method to convert a Unicode string to its byte representation in the Korean Encoding. The code then displays the byte representations of the stings in the Korean code page:
using System;
using System.Text;
public sealed class Program {
public static void Main() {
Encoding e = Encoding.GetEncoding("Korean");
byte[] encoded;
encoded = e.GetBytes("Hello, World!");
for (int i = 0; i < encoded.Length; ++i)
Console.WriteLine("Byte {0} : {1}", i, encoded[i]);
}
}
Here is the output:
Byte 0 : 72
Byte 1 : 101
Byte 2 : 108
Byte 3 : 108
Byte 4 : 111
Byte 5 : 44
Byte 6 : 32
Byte 7 : 87
Byte 8 : 111
Byte 9 : 114
Byte 10 : 108
Byte 11 : 100
Byte 12 : 33
The code sample above just demonstrates how to convert text to a different code page; however, you would not normally convert an English-language phrase into a different code page. In most code pages, the code points 0 – 127 represent the same ASCII characters. This allows for continuity and legacy code. The code points 128 through 255 differ significantly between code pages. To examine all supported code pages in the .NET Framework, call the Encoding
class’s GetEncoding
method. This method returns an array of EncodingInfo
objects. The following code displays the number, official name, and friendly name of the .NET Framework code bases:
using System;
using System.Text;
public sealed class Program {
public static void Main() {
EncodingInfo[] ei = Encoding.GetEncodings();
foreach (EncodingInfo e in ei)
Console.WriteLine("{0}: {1}, {2}", e.CodePage, e.Name, e.DisplayName);
}
}
And take a look at the output:
37: IBM037, IBM EBCDIC (US-Canada)
437: IBM437, OEM United States
500: IBM500, IBM EBCDIC (International)
708: ASMO-708, Arabic (ASMO 708)
720: DOS-720, Arabic (DOS)
737: ibm737, Greek (DOS)
775: ibm775, Baltic (DOS)
850: ibm850, Western European (DOS)
852: ibm852, Central European (DOS)
855: IBM855, OEM Cyrillic
857: ibm857, Turkish (DOS)
858: IBM00858, OEM Multilingual Latin I
860: IBM860, Portuguese (DOS)
861: ibm861, Icelandic (DOS)
862: DOS-862, Hebrew (DOS)
863: IBM863, French Canadian (DOS)
864: IBM864, Arabic (864)
865: IBM865, Nordic (DOS)
866: cp866, Cyrillic (DOS)
869: ibm869, Greek, Modern (DOS)
870: IBM870, IBM EBCDIC (Multilingual Latin-2)
874: windows-874, Thai (Windows)
875: cp875, IBM EBCDIC (Greek Modern)
932: shift_jis, Japanese (Shift-JIS)
936: gb2312, Chinese Simplified (GB2312)
949: ks_c_5601-1987, Korean
950: big5, Chinese Traditional (Big5)
1026: IBM1026, IBM EBCDIC (Turkish Latin-5)
1047: IBM01047, IBM Latin-1
1140: IBM01140, IBM EBCDIC (US-Canada-Euro)
1141: IBM01141, IBM EBCDIC (Germany-Euro)
1142: IBM01142, IBM EBCDIC (Denmark-Norway-Euro)
1143: IBM01143, IBM EBCDIC (Finland-Sweden-Euro)
1144: IBM01144, IBM EBCDIC (Italy-Euro)
1145: IBM01145, IBM EBCDIC (Spain-Euro)
1146: IBM01146, IBM EBCDIC (UK-Euro)
1147: IBM01147, IBM EBCDIC (France-Euro)
1148: IBM01148, IBM EBCDIC (International-Euro)
1149: IBM01149, IBM EBCDIC (Icelandic-Euro)
1200: utf-16, Unicode
1201: unicodeFFFE, Unicode (Big-Endian)
1250: windows-1250, Central European (Windows)
1251: windows-1251, Cyrillic (Windows)
1252: Windows-1252, Western European (Windows)
1253: windows-1253, Greek (Windows)
1254: windows-1254, Turkish (Windows)
1255: windows-1255, Hebrew (Windows)
1256: windows-1256, Arabic (Windows)
1257: windows-1257, Baltic (Windows)
1258: windows-1258, Vietnamese (Windows)
1361: Johab, Korean (Johab)
10000: macintosh, Western European (Mac)
10001: x-mac-japanese, Japanese (Mac)
10002: x-mac-chinesetrad, Chinese Traditional (Mac)
10003: x-mac-korean, Korean (Mac)
10004: x-mac-arabic, Arabic (Mac)
10005: x-mac-hebrew, Hebrew (Mac)
10006: x-mac-greek, Greek (Mac)
10007: x-mac-cyrillic, Cyrillic (Mac)
10008: x-mac-chinesesimp, Chinese Simplified (Mac)
10010: x-mac-romanian, Romanian (Mac)
10017: x-mac-ukrainian, Ukrainian (Mac)
10021: x-mac-thai, Thai (Mac)
10029: x-mac-ce, Central European (Mac)
10079: x-mac-icelandic, Icelandic (Mac)
10081: x-mac-turkish, Turkish (Mac)
10082: x-mac-croatian, Croatian (Mac)
12000: utf-32, Unicode (UTF-32)
12001: utf-32BE, Unicode (UTF-32 Big-Endian)
20000: x-Chinese-CNS, Chinese Traditional (CNS)
20001: x-cp20001, TCA Taiwan
20002: x-Chinese-Eten, Chinese Traditional (Eten)
20003: x-cp20003, IBM5550 Taiwan
20004: x-cp20004, TeleText Taiwan
20005: x-cp20005, Wang Taiwan
20105: x-IA5, Western European (IA5)
20106: x-IA5-German, German (IA5)
20107: x-IA5-Swedish, Swedish (IA5)
20108: x-IA5-Norwegian, Norwegian (IA5)
20127: us-ascii, US-ASCII
20261: x-cp20261, T.61
20269: x-cp20269, ISO-6937
20273: IBM273, IBM EBCDIC (Germany)
20277: IBM277, IBM EBCDIC (Denmark-Norway)
20278: IBM278, IBM EBCDIC (Finland-Sweden)
20280: IBM280, IBM EBCDIC (Italy)
20284: IBM284, IBM EBCDIC (Spain)
20285: IBM285, IBM EBCDIC (UK)
20290: IBM290, IBM EBCDIC (Japanese katakana)
20297: IBM297, IBM EBCDIC (France)
20420: IBM420, IBM EBCDIC (Arabic)
20423: IBM423, IBM EBCDIC (Greek)
20424: IBM424, IBM EBCDIC (Hebrew)
20833: x-EBCDIC-KoreanExtended, IBM EBCDIC (Korean Extended)
20838: IBM-Thai, IBM EBCDIC (Thai)
20866: koi8-r, Cyrillic (KOI8-R)
20871: IBM871, IBM EBCDIC (Icelandic)
20880: IBM880, IBM EBCDIC (Cyrillic Russian)
20905: IBM905, IBM EBCDIC (Turkish)
20924: IBM00924, IBM Latin-1
20932: EUC-JP, Japanese (JIS 0208-1990 and 0212-1990)
20936: x-cp20936, Chinese Simplified (GB2312-80)
20949: x-cp20949, Korean Wansung
21025: cp1025, IBM EBCDIC (Cyrillic Serbian-Bulgarian)
21866: koi8-u, Cyrillic (KOI8-U)
28591: iso-8859-1, Western European (ISO)
28592: iso-8859-2, Central European (ISO)
28593: iso-8859-3, Latin 3 (ISO)
28594: iso-8859-4, Baltic (ISO)
28595: iso-8859-5, Cyrillic (ISO)
28596: iso-8859-6, Arabic (ISO)
28597: iso-8859-7, Greek (ISO)
28598: iso-8859-8, Hebrew (ISO-Visual)
28599: iso-8859-9, Turkish (ISO)
28603: iso-8859-13, Estonian (ISO)
28605: iso-8859-15, Latin 9 (ISO)
29001: x-Europa, Europa
38598: iso-8859-8-i, Hebrew (ISO-Logical)
50220: iso-2022-jp, Japanese (JIS)
50221: csISO2022JP, Japanese (JIS-Allow 1 byte Kana)
50222: iso-2022-jp, Japanese (JIS-Allow 1 byte Kana - SO/SI)
50225: iso-2022-kr, Korean (ISO)
50227: x-cp50227, Chinese Simplified (ISO-2022)
51932: euc-jp, Japanese (EUC)
51936: EUC-CN, Chinese Simplified (EUC)
51949: euc-kr, Korean (EUC)
52936: hz-gb-2312, Chinese Simplified (HZ)
54936: GB18030, Chinese Simplified (GB18030)
57002: x-iscii-de, ISCII Devanagari
57003: x-iscii-be, ISCII Bengali
57004: x-iscii-ta, ISCII Tamil
57005: x-iscii-te, ISCII Telugu
57006: x-iscii-as, ISCII Assamese
57007: x-iscii-or, ISCII Oriya
57008: x-iscii-ka, ISCII Kannada
57009: x-iscii-ma, ISCII Malayalam
57010: x-iscii-gu, ISCII Gujarati
57011: x-iscii-pa, ISCII Punjabi
65000: utf-7, Unicode (UTF-7)
65001: utf-8, Unicode (UTF-8)
To specify an encoding type when writing a file, use an overloaded Stream
constructor that accepts an Encoding
object. For example, the following code creates several files with different encoding types:
using System;
using System.IO;
using System.Text;
public sealed class Program {
public static void Main() {
StreamWriter swUtf7 = new StreamWriter("utf7.txt", false, Encoding.UTF7);
swUtf7.WriteLine("Hello, World!");
swUtf7.Close();
StreamWriter swUtf8 = new StreamWriter("utf8.txt", false, Encoding.UTF8);
swUtf8.WriteLine("Hello, World!");
swUtf8.Close();
StreamWriter swUtf16 = new StreamWriter("utf16.txt", false, Encoding.Unicode);
swUtf16.WriteLine("Hello, World!");
swUtf16.Close();
StreamWriter swUtf32 = new StreamWriter("utf32.txt", false, Encoding.UTF32);
swUtf32.WriteLine("Hello, World!");
swUtf32.Close();
}
}
No output will appear on the console. But if you the MS-DOS ‘dir’ command, you will find a group of text files named “utf7.txt”, “utf8.txt”, “utf16.txt”, and “utf32.txt”. Take particular notice of the utf32.txt file: H e l l o , W o r l d ! Here is the utf16.txt file: Hello, World!. Open up the remaining files to just take a look.
Encodings: Converting Between Characters and Bytes
Technical documentation asserts that in Win32, programmers have to write code to convert Unicode characters and strings to Multi-Byte Character Set (MBCS) characters and strings. If you have ever worked with C++ MFC, then you have undoubtedly changed the project properties from Unicode to Multi-Byte Character Set. Encodings allow a managed code application to interact with strings created by non-Unicode systems. For example, if you want to produce a file readable by an application running on a Japanese version of Windows 95, you have to save the Unicode text by using the Shift-JIS (codepage 932) encoding. Likewise, you’d have to use Shift-JIS encoding to read a file produced on a Japanese Windows 95 system into the CLR.
References
- Professional .NET Framework 2.0, written by Joe Duffy
- CLR via C#, written by Jeffrey Richter
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.