Click here to Skip to main content
15,881,882 members
Articles / Programming Languages / C#

Character Encoding and Usage in .NET

Rate me:
Please Sign up or sign in to vote.
4.10/5 (10 votes)
7 Feb 2014CPOL7 min read 27.8K   18   5
Understanding character encoding with the help of simple examples

Introduction 

We use character encoding every time we are working on computers such as programming, browsing, reading, writing, watching subtitles in a movie, etc. How do computers store the text we write? In the following article, this question will be answered. Secondly, I will discuss with the help of examples how .NET supports us when we read or write text on computers.

Character Encoding

Computer only understands binary language or one can say that computers only understand what is ‘1’ or ‘0’. They do not understand what is character ‘A’ and what is character ‘B’. Therefore, we convert these characters or alphabets into a binary number. That conversion is called the Character Encoding. Hence the conversion from character which is ‘A’ or ‘B’ to a binary representation is called character encoding. There are many methods for the conversion such as ASCII, Unicode and EBCDIC. Similarly, the reverse of that is called the character decoding that is converting binary representation to a character which can be read by people.

 

To illustrate, let’s consider ASCII encoding. In ASCII, a character is represented in binary by using 7 bits. Such as character ‘A’ is represented as 65 in decimal and 100 0001 in binary in ASCII Encoding. Therefore, the total numbers of characters that can be represented using ASCII is about 2^7 = 128 characters. In this encoding, we cannot represents a large number of characters of languages such as French, German, Chinese and other languages. Therefore other encoding schemes have been developed to support a wide range of languages and one of these encoding schemes is Unicode.

Advantages

One benefit of using encoding scheme is to separate the styling from the character representation itself. For example, different fonts style single character differently. So encoding tells what the character is and font tells how to style that character and at the end, text is rendered using the font and character value. This gives the portability to encoding schemes since the styling functionality is moved to Fonts.

Another benefit of developing the encoding scheme is to share data among different O.S and different platforms. In this way, your text written using one O.S can be viewed in another O.S provided that others O.S supports the encoding schemes.

Examples

Example No. 1

The following examples show how the characters ‘A’ and ‘B’ are represented in binary. They are represented as 65 and 66 in decimal respectively. The output of the following program is the decimal code 65, 65, 66, 66.

C#
 // Create encoding schemes. 
Encoding ascii = Encoding.ASCII;

String aStr = "AABB";

byte[] dataBytes = ascii.GetBytes(aStr);

foreach(byte b in dataBytes)
            {
Console.WriteLine("{0}", b);
            }

Example No. 2

The following code shows the range of the ASCII. If any value greater than 127 is given, it will not be able to decode because ASCII only supports 7 bit number and the highest decimal number which can be represented by 7 bit is 127. In .NET, if ASCII encoding is unable to decode a character then .NET represents that character with the question mark character (‘?’). As in the following example, when we try to decode the character for the values 255(0xFF) and 128 (0x80), we will get a question mark. The output for the following program is: ?, ?, A.

C#
Encoding ascii = Encoding.ASCII;
char[] result = ascii.GetChars(newbyte[] { 0xFF, 0x80, 0x41 }, 0, 3);
foreach (char c in result)
            {
Console.Write("{0},", c);
            }

Console.ReadKey();

One can write the code to throw exception when the character is not supported.

Example No. 3

In Notepad, type the following character “AABBCC”. Then go to file options and select ‘Save As’. Then type the file name. Below the option ‘Save As type’, you will see an option with the name Encoding. Keep the Encoding option to ‘ANSI’ and then save the file. Once you have saved the file, now open the same file with a hex editor such as hexworkshop or Ultraedit, you will see the following bytes: 41 41 42 42 43 43. Here 41 represents the code for ‘A’ and similarly code for other characters are present. In this way, one can see how the text is stored in the computers.

Unicode Encoding

We use Unicode because it supports a large number of characters. There are other encodings which support large number of characters but most of the organizations have adopted Unicode and there is wide range of support available for Unicode. Unicode supports more than 110000 characters which includes many languages. Hence Unicode has become the popular choice for encoding.

Unicode can support up to 32 bits. Hence in this way, users of Unicode have large number of characters. In computers, Unicode can be represented as 8 bit, 16 bit or 32 bit length integers. They are called the encoding forms and there are 3 encoding forms UTF-8, UTF-16 and UTF-32 for 8 bit, 16 bit and 32 bit integer length respectively. Here UTF stands for Unicode Transformation Format. Each Unicode transformation format supports all the characters in Unicode standard.

UTF-8 uses 8 to 32 bit or 1 byte to 4 bytes for each character in the Unicode standard. For example, if a character is defined at the very start of the range such as within the range of 0 to 127, then UTF-8 will use only 1 byte or 8 bit to represent that character otherwise more than 1 byte will be used. UTF-8 is also close to ASCII since it is very easy to interchange the code of ASCII and UTF-8 up to 00 to 7F or 0 to 127.

Examples

Example No. 1

The following code shows what the default order for Unicode is, little endian or most significant byte first. Secondly UTF-16 uses 2 bytes for each character. The following code will print A,C.

C#
Encoding utf16 = Encoding.Unicode;
char[] result = utf16.GetChars(newbyte[] { 0x41, 0x00, 0x43,0x00 }, 0, 4);
foreach (char c in result)
            {
Console.Write("{0},", c);
            }

Console.ReadKey();

Example No. 2

The following code used 1 byte or 8 bit to represent a character. Output of the following code will be A,G,C,G.

C#
Encoding utf8 = Encoding.UTF8;
char[] result = utf8.GetChars(newbyte[] { 0x41, 0x47, 0x43, 0x47 }, 0, 4);
foreach (char c in result)
            {
Console.Write("{0},", c);
            }
Console.ReadKey();

As one can notice that in this example 4 number of bytes represents 4 characters whereas in the previous example 4 bytes represent only 2 characters. HenceUTF-8 is a compact format and in one of my projects, I take leverage of this. I have to write text files of about 1GB for a single user test in my organization. At that point, I did not know that I have been using UTF-16 for text writing. When I knew about the encoding schemes, I converted the file writing from UTF-16 to UTF-8 which has significantly improved the storage requirement. Now I require only 0.5GB of storage per user test.

Example No. 3

To further illustrate, let’s open a Notepad; write this text “mytext”, then go to file, then to save as option. In the dialog box, select encoding as Unicode and then save the file with any name. Now open the same file with hexworkshop or ultraedit software, you will see these bytes:

FFFE6D0079007400650078007400

First two bytes (FF FE) are the marker to specify that the upcoming text is encoded using Unicode (UTF-16 little endian). These starting bytes are known as Byte order Mask (BOM). After these starting bytes, 2 bytes is the Unicode code point for the character ‘m’ then character ‘y’ and so on. Here you can see that one character requires 2 bytes if you are writing data using UTF-16. Therefore, if you want to archive or write a large data file, then one should use UTF-8. Similarly UTF-16 and UTF-32 have their own advantages in terms of performance. See the further reading section for understanding the advantages and disadvantages of UTF-16 and UTF-32.

Example No. 4

The following code shows how to write in different encoding schemes:

C#
FileStreamfs = newFileStream(@"F:\Encoding\aTest3.me", FileMode.Create);
StreamWritersw = newStreamWriter(fs, Encoding.ASCII);
sw.Write("thisismytext");
sw.Close();
fs.Close();

By default, stream writer writes the text in UTF-8 format and one can specify other encoding as well such as when passing the arguments in StreamWriter constructor you can specify UTF-16 or UTF-32.

StreamWriter handles itself the BOM depending upon the encoding you have selected. Following are the bytes for BOM:

  • UTF-8 EF BB BF
  • UTF-16 Big Endian FE FF
  • UTF-16 Little Endian FF FE
  • UTF-32 Big Endian 00 00 FE FF
  • UTF-32 Little Endian FF FE 00 00

Conclusion

This is a very basic and sufficient article to jump-start understanding and using different encoding when you are reading and writing text using programming. With the help of examples, I tried to give the basic idea what is encoding, different available encoding and how to use them.

Further Readings

  1. Unicode Standard
  2. MSDN Page for Encoding Class

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior)
Pakistan Pakistan
If you want to learn more about object-oriented design, programming using real-life and real-world examples then you should visit:

www.linesperday.com

You will know what it takes to be great at programming and what lies ahead in your career: Is that money, authority or fame?

My name is Muhammad Umair and I write about programming, programmers, and object-oriented design and how you can optimize your programming skills to advance your programming career.

Comments and Discussions

 
SuggestionYou can improve the article.. Pin
Afzaal Ahmad Zeeshan12-Mar-15 14:20
professionalAfzaal Ahmad Zeeshan12-Mar-15 14:20 
GeneralRe: You can improve the article.. Pin
omeecode6-Sep-15 19:56
omeecode6-Sep-15 19:56 
BugUTF-8 misinformation Pin
Frank T. Clark10-Feb-14 7:44
professionalFrank T. Clark10-Feb-14 7:44 
GeneralRe: UTF-8 misinformation Pin
omeecode1-Mar-14 20:09
omeecode1-Mar-14 20:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.