Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C# ascii
I have an ascii text file that contains valid 8-bit character codes.
How do I read this file and have the 8-bit char translated into valid
unicode? I know I could UTF8 encode the file or could read the bytes
and then encode it. But this all assumes that I know about the 8-bit
codes before hand.
 
Is there any method that will read the file and automaticly do the
conversion?
 
James Johnson
Posted 29-Jan-12 10:31am
WBurgMo3.1K
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

Simply read with
 
System.IO.StreamReader reader =
   new System.IO.StreamReader(fileName,  System.Text.Encoding.ASCII);
or, more universally, auto-detect the encoding:
System.IO.StreamReader reader =
   new System.IO.StreamReader(fileName,  true);
It will give you Unicode string(s) based on your ASCII data. In principle, this is all you need. You can write it back with
bool appendOrNot = //something
System.IO.StreamWriter writer =
   new System.IO.StreamWriter(fileName,  appendOrNot, System.Text.Encoding.UTF8);
 
As you text data is, generally speaking, always Unicode, prefer using on output only one of Unicode UTFs. The only text encoding supported in character and string data internally is UTF-16. All other encodings are only supported as persistence; they are represented in memory as arrays of bytes, with no regards to characters boundaries, which can vary (in UTF-8, character size is 1-4 bytes, in UTF-16 — one or two 16-bit words (two words called surrogate pair, in UTF-32 — always one 32-bit word). Please see two very last links above.
 
Please see:
 
http://msdn.microsoft.com/en-us/library/system.io.streamreader.aspx[^],
http://msdn.microsoft.com/en-us/library/system.io.streamwriter.aspx[^],http://msdn.microsoft.com/en-us/library/f5f5x7kt.aspx[^];
 
http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^];
 
you also need to understand how Unicode and BOM work:
http://unicode.org/[^],
http://unicode.org/faq/utf_bom.html[^].
 
BOM (or its absence) is used for auto-detection of encoding mentioned above.
 
[EDIT]
 
Apparently, auto-detecting of the encoding by BOM is needed only in one case: if the encoding is some Unicode UTF, you know what encoding is that, but BOM is not present. Such things happen. This is also explained in the last Unicode article referenced above.
 
—SA
  Permalink  
v4
Comments
Andreas Gieriet at 29-Jan-12 22:42pm
   
You assume here 7-bit ASCII. As WBurgMo refers to "8-bit ASCII" (which does not exist), he must give the code page of the encoding. E.g. in your code, a small modification is needed for that:
 
int codepage = ...; // e.g. 1250 for iso-8859-2
System.IO.StreamReader reader =
 
new System.IO.StreamReader(fileName, new System.Text.Encoding(codepage));

 
See also my solution.
 
Cheers
 
Andi
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

Hello WBurgMo,
 
there is no such thing like "valid 8-bit ASCII code" (see
http://en.wikipedia.org/wiki/ASCII[^]).
 
If you have plain ASCII 7-bit text, you may use the ASCIIEncoding to read the data. If you have some 8-bit extension of the ASCII 7-bit encoding, you must specify the code page as described in http://msdn.microsoft.com/en-us/library/system.text.encoding.aspx[^] (see the constructor that takes the code page as argument).
 
Note: you must give that information about the code page from outside, i.e. there is no way to deduce from the 8-bit ASCII-extended text, what code page it is.
 
Cheers
 
Andi
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 Sergey Alexandrovich Kryukov 9,935
1 OriginalGriff 6,987
2 CPallini 5,845
3 George Jonsson 4,015
4 Gihan Liyanage 3,266


Advertise | Privacy | Mobile
Web04 | 2.8.140926.1 | Last Updated 29 Jan 2012
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100