Click here to Skip to main content
15,881,882 members
Please Sign up or sign in to vote.
4.00/5 (1 vote)
See more:
I'm seriously struggling to parse a line from a UTF-8 file into an array of strings.
My file has content:
Gȯrecki   John
12345678901234

The first name "John" starts at the 10th position. (Here the 2nd character is UTF-8 U+022F.)
In code I need to do
VB
LineRead.Substring(11,4)
to get "John", where it should be
VB
LineRead.Substring(10,4)

with normal characters.

My question is of course how to detect that I need to do a Substring of 11 instead of 10, in this case?
I tried things like
VB
If Not System.Text.Encoding.UTF8.GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead)) = System.Text.Encoding.UTF8.GetByteCount(LineRead) Then 
but that's also the case for "à" which counts as only 1 in String.Length but has 2 bytes in UTF-8...

How to handle common cases like this?
How to prevent splitting up bytes of 1 character into several wrong characters? That way I could progress through the string character by character and count them?
Thanks in advance!
Posted
Updated 11-May-12 1:01am
v2
Comments
phil.o 11-May-12 6:38am    
Not really clear ; in your first code region, you seem to start index at 1 ; index start at 0, which gives you the correct index 10 for john.
Moreover, character count should not give you different results whether there are or not some special characters.
0ddball 11-May-12 7:08am    
The numbers in the first code region are just as a reference, I know index starts at 0 ;)
If you count the characters on the first word it's obviously 7 but String.Length will give u 8 for that word. Also System.Text.Encoding.UTF8.GetBytes(LineRead) will give u 9 bytes for that word and System.Text.Encoding.UTF8.GetChars(System.Text.Encoding.UTF8.GetBytes(LineRead)) will return a character array of 8 elements for that word, not 7...

After hours of looking and finally desperately posting it on here I think I found the solution at http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx[^]
Thanks anyway though!
 
Share this answer
 
Comments
phil.o 11-May-12 8:07am    
Thanks for posting what you found.
Well, it sounds like you may already have a solution, but here are my thoughts, for what its worth.

In .NET strings are represented as a collection of 2-byte unicode characters, the idea being that you can fit just about any set of letters into 16 bits. As such it shouldn't make any difference if you have non-ASCII characters in your string, indexing will work as expected.

It sounds like when you are reading the file, .NET is interpreting it as ASCII and thus creating a new 2 byte character for every single byte in the file which will be fine unless you have a double byte UTF-8 character - it will intepret this as two characters and create 4 bytes and indexing will be out by 1.

The fact that your indexing is out implies that this is the case, and also the surname will be incorrect.

I believe that in text files, right at the front there is usually a little header denoting the file encoding, but if this is missing .NET assumes pure ASCII. So, when you load the file (using StreamReader or whatever) you want to explicitly inform it that the file is UTF-8 encoded. The issue should then just disappear.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900