Detect UTF-8 double-byte characters

Question

4.00/5 (1 vote)

See more:

I'm seriously struggling to parse a line from a UTF-8 file into an array of strings.
My file has content:

Gȯrecki   John
12345678901234

The first name "John" starts at the 10th position. (Here the 2nd character is UTF-8 U+022F.)
In code I need to do

VB

LineRead.Substring(11,4)

to get "John", where it should be

VB

LineRead.Substring(10,4)

with normal characters.

My question is of course how to detect that I need to do a Substring of 11 instead of 10, in this case?
I tried things like

VB

If Not System.Text.Encoding.UTF8.GetCharCount(System.Text.Encoding.UTF8.GetBytes(LineRead)) = System.Text.Encoding.UTF8.GetByteCount(LineRead) Then

but that's also the case for "à" which counts as only 1 in String.Length but has 2 bytes in UTF-8...

How to handle common cases like this?
How to prevent splitting up bytes of 1 character into several wrong characters? That way I could progress through the string character by character and count them?
Thanks in advance!

Posted 11-May-12 0:34am

0ddball

Updated 11-May-12 1:01am

v2

Add a Solution

Comments

phil.o 11-May-12 6:38am

Not really clear ; in your first code region, you seem to start index at 1 ; index start at 0, which gives you the correct index 10 for john.
Moreover, character count should not give you different results whether there are or not some special characters.

0ddball 11-May-12 7:08am

The numbers in the first code region are just as a reference, I know index starts at 0 ;)
If you count the characters on the first word it's obviously 7 but String.Length will give u 8 for that word. Also System.Text.Encoding.UTF8.GetBytes(LineRead) will give u 9 bytes for that word and System.Text.Encoding.UTF8.GetChars(System.Text.Encoding.UTF8.GetBytes(LineRead)) will return a character array of 8 elements for that word, not 7...

2 solutions

Solution 3

Well, it sounds like you may already have a solution, but here are my thoughts, for what its worth.

In .NET strings are represented as a collection of 2-byte unicode characters, the idea being that you can fit just about any set of letters into 16 bits. As such it shouldn't make any difference if you have non-ASCII characters in your string, indexing will work as expected.

It sounds like when you are reading the file, .NET is interpreting it as ASCII and thus creating a new 2 byte character for every single byte in the file which will be fine unless you have a double byte UTF-8 character - it will intepret this as two characters and create 4 bytes and indexing will be out by 1.

The fact that your indexing is out implies that this is the case, and also the surname will be incorrect.

I believe that in text files, right at the front there is usually a little header denoting the file encoding, but if this is missing .NET assumes pure ASCII. So, when you load the file (using StreamReader or whatever) you want to explicitly inform it that the file is UTF-8 encoded. The issue should then just disappear.

Posted 11-May-12 1:43am

Rob Philpott

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

0ddball · Accepted Answer · 2012-05-11T01:36:00

Solution 2

After hours of looking and finally desperately posting it on here I think I found the solution at http://msdn.microsoft.com/en-us/library/system.globalization.stringinfo.aspx[^]
Thanks anyway though!

Posted 11-May-12 1:36am

0ddball

Comments

phil.o 11-May-12 8:07am

Thanks for posting what you found.