Click here to Skip to main content
11,799,282 members (69,938 online)
Rate this: bad
Please Sign up or sign in to vote.
See more: C# VB utf8
I'm seriously struggling to parse a line from a UTF-8 file into an array of strings.
My file has content:
Gȯrecki   John
The first name "John" starts at the 10th position. (Here the 2nd character is UTF-8 U+022F.)
In code I need to do
to get "John", where it should be
with normal characters.

My question is of course how to detect that I need to do a Substring of 11 instead of 10, in this case?
I tried things like
If Not System.Text.Encoding.UTF8.GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead)) = System.Text.Encoding.UTF8.GetByteCount(LineRead) Then 
but that's also the case for "à" which counts as only 1 in String.Length but has 2 bytes in UTF-8...

How to handle common cases like this?
How to prevent splitting up bytes of 1 character into several wrong characters? That way I could progress through the string character by character and count them?
Thanks in advance!
Posted 11-May-12 0:34am
Edited 11-May-12 1:01am
phil.o at 11-May-12 6:38am
Not really clear ; in your first code region, you seem to start index at 1 ; index start at 0, which gives you the correct index 10 for john.
Moreover, character count should not give you different results whether there are or not some special characters.
0ddball at 11-May-12 7:08am
The numbers in the first code region are just as a reference, I know index starts at 0 ;)
If you count the characters on the first word it's obviously 7 but String.Length will give u 8 for that word. Also System.Text.Encoding.UTF8.GetBytes(LineRead) will give u 9 bytes for that word and System.Text.Encoding.UTF8.GetChars(System.Text.Encoding.UTF8.GetBytes(LineRead)) will return a character array of 8 elements for that word, not 7...
Rate this: bad
Please Sign up or sign in to vote.

Solution 2

After hours of looking and finally desperately posting it on here I think I found the solution at[^]
Thanks anyway though!
phil.o at 11-May-12 8:07am
Thanks for posting what you found.
Rate this: bad
Please Sign up or sign in to vote.

Solution 3

Well, it sounds like you may already have a solution, but here are my thoughts, for what its worth.

In .NET strings are represented as a collection of 2-byte unicode characters, the idea being that you can fit just about any set of letters into 16 bits. As such it shouldn't make any difference if you have non-ASCII characters in your string, indexing will work as expected.

It sounds like when you are reading the file, .NET is interpreting it as ASCII and thus creating a new 2 byte character for every single byte in the file which will be fine unless you have a double byte UTF-8 character - it will intepret this as two characters and create 4 bytes and indexing will be out by 1.

The fact that your indexing is out implies that this is the case, and also the surname will be incorrect.

I believe that in text files, right at the front there is usually a little header denoting the file encoding, but if this is missing .NET assumes pure ASCII. So, when you load the file (using StreamReader or whatever) you want to explicitly inform it that the file is UTF-8 encoded. The issue should then just disappear.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 473
1 CPallini 410
2 Richard MacCutchan 319
3 phil.o 244
4 Kornfeld Eliyahu Peter 205
0 OriginalGriff 2,475
1 Maciej Los 1,860
2 KrunalRohit 1,496
3 CPallini 1,465
4 Richard MacCutchan 1,149

Advertise | Privacy | Mobile
Web02 | 2.8.151002.1 | Last Updated 11 May 2012
Copyright © CodeProject, 1999-2015
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100