Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C# VB utf8
I'm seriously struggling to parse a line from a UTF-8 file into an array of strings.
My file has content:
Gȯrecki   John
12345678901234
The first name "John" starts at the 10th position. (Here the 2nd character is UTF-8 U+022F.)
In code I need to do
LineRead.Substring(11,4)
to get "John", where it should be
LineRead.Substring(10,4)
with normal characters.
 
My question is of course how to detect that I need to do a Substring of 11 instead of 10, in this case?
I tried things like
If Not System.Text.Encoding.UTF8.GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead)) = System.Text.Encoding.UTF8.GetByteCount(LineRead) Then 
but that's also the case for "à" which counts as only 1 in String.Length but has 2 bytes in UTF-8...
 
How to handle common cases like this?
How to prevent splitting up bytes of 1 character into several wrong characters? That way I could progress through the string character by character and count them?
Thanks in advance!
Posted 11-May-12 1:34am
0ddball195
Edited 11-May-12 2:01am
v2
Comments
phil.o at 11-May-12 6:38am
   
Not really clear ; in your first code region, you seem to start index at 1 ; index start at 0, which gives you the correct index 10 for john.
Moreover, character count should not give you different results whether there are or not some special characters.
0ddball at 11-May-12 7:08am
   
The numbers in the first code region are just as a reference, I know index starts at 0 ;)
If you count the characters on the first word it's obviously 7 but String.Length will give u 8 for that word. Also System.Text.Encoding.UTF8.GetBytes(LineRead) will give u 9 bytes for that word and System.Text.Encoding.UTF8.GetChars(System.Text.Encoding.UTF8.GetBytes(LineRead)) will return a character array of 8 elements for that word, not 7...
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

After hours of looking and finally desperately posting it on here I think I found the solution at http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx[^]
Thanks anyway though!
  Permalink  
Comments
phil.o at 11-May-12 8:07am
   
Thanks for posting what you found.
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 3

Well, it sounds like you may already have a solution, but here are my thoughts, for what its worth.
 
In .NET strings are represented as a collection of 2-byte unicode characters, the idea being that you can fit just about any set of letters into 16 bits. As such it shouldn't make any difference if you have non-ASCII characters in your string, indexing will work as expected.
 
It sounds like when you are reading the file, .NET is interpreting it as ASCII and thus creating a new 2 byte character for every single byte in the file which will be fine unless you have a double byte UTF-8 character - it will intepret this as two characters and create 4 bytes and indexing will be out by 1.
 
The fact that your indexing is out implies that this is the case, and also the surname will be incorrect.
 
I believe that in text files, right at the front there is usually a little header denoting the file encoding, but if this is missing .NET assumes pure ASCII. So, when you load the file (using StreamReader or whatever) you want to explicitly inform it that the file is UTF-8 encoded. The issue should then just disappear.
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 DamithSL 320
1 OriginalGriff 155
2 Peter Leow 115
3 Afzaal Ahmad Zeeshan 114
4 deepakdynamite 110
0 OriginalGriff 7,510
1 DamithSL 5,519
2 Sergey Alexandrovich Kryukov 5,044
3 Maciej Los 4,961
4 Kornfeld Eliyahu Peter 4,514


Advertise | Privacy | Mobile
Web01 | 2.8.141223.1 | Last Updated 11 May 2012
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100