How to find out that string contain unicode character in C#?

Question

2.33/5 (3 votes)

See more:

I came across situation where i want to check that given string contain unicode character or not?How to do that?

What I have tried:

string input = "Non English";

if(input.Any(c => c > 255))
{
// unicode
}

Is it right way?

Posted 20-Jun-16 21:14pm

Member 11589429

Updated 20-Jun-16 21:35pm

Add a Solution

Comments

Sinisa Hajnal 21-Jun-16 3:20am

This would work...but I believe it would be faster with string.indexof instead of lambda expression.

Member 11589429 21-Jun-16 3:35am

Thanks for reply....

Sergey Alexandrovich Kryukov 21-Jun-16 3:21am

This is not a correct question. A character cannot be Unicode or non-Unicode. This is a cultural entity. From the Unicode standpoint, all characters are Unicode characters. For example, ASCII characters are also Unicode characters.

You can only ask such question if you name some other standard and want to figure out how is it related to Unicode. For example, you can take ASCII and ask a question: "how to find out if a string contain at least one character which is not supported by ASCII"? You don't need to mention the word "Unicode", because the notion of .NET string implies that it is always Unicode (internal representation is UTF-16LE, by the way).

By the way, it is unrelated to English. Do you know that non-ASCII characters are actually used in English? In modern requirements, only pretty illiterate English text can be written without Unicode support. :-)

Can you see the point?

—SA

Member 11589429 21-Jun-16 3:35am

Thanks for reply.

Sergey Alexandrovich Kryukov 21-Jun-16 3:37am

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Kornfeld Eliyahu Peter · Answer 1 · 2016-06-20T21:35:00

Solution 3

In .NET (and for that in C#) string is all!!! Unicode. So there is no such thing non-Unicode char in a string, but there is such thing chars from a specific Unicode range...In your sample you are asking for chars have code point larger than 255, so are not from Basic Latin and not from Latin-1 Supplement...But that's does not mean these are not Unicode chars...
If you want to check if there are chars from a specific range (a specific language?) see here the ranges and check accordingly:
Unicode Character Ranges[^]

Posted 20-Jun-16 21:35pm

Kornfeld Eliyahu Peter

Comments

Sergey Alexandrovich Kryukov 21-Jun-16 3:39am

No, 128 to 255 should not be used. Please see Solution 2.
—SA

Kornfeld Eliyahu Peter 21-Jun-16 3:42am

I do not understand you. In my read OP asked how to tell if a string has chars from a specific (non-English) languages. OP called it - wrongly - Unicode, but it is about ranges as the hole string is Unicode...And the range of 128-255 is a perfectly valid Unicode range...

Sergey Alexandrovich Kryukov 21-Jun-16 3:53am

Yes, but the question is related to the use of this string in non-Unicode application.
You really don't understand: below 127, the meaning of all code points is the same, all standard share this range. Not so in 128-255. The interpretation of each byte depends on encoding. Did you have to support any local languages (not English) before Unicode?
—SA

Kornfeld Eliyahu Peter 21-Jun-16 4:05am

I can't see where OP talking about non-Unicode application...In that context, however, I do understand your concerns about the 128-255 range...

Sergey Alexandrovich Kryukov 21-Jun-16 9:26am

He wasn't talking about it. But see my comment to the question: the formulation is incorrect, illogical. But what is the practical concern? Loss of data. Say, you convert text to ASCII representation, copy it back to Unicode. Some characters will turn '?', because ASCII doesn't support them. But ASCII is defined up to 127. That's all.

But why ASCII? Look at Unicode characters above 127. The used to be used for characters of different languages in different ways; not the same characters as now in Unicode. It depended on "code page", in Microsoft's terms. Hence, the result of round trip depends on the "code page". In other way, when you convert some Unicode text using non-Unicode encoding, the result is uncertain.

Please see my update to my answer.

—SA

Sergey Alexandrovich Kryukov · Answer 2 · 2016-06-20T21:34:00

First of all, please see my comment to the question.

No, using 255 is wrong. You need to consider only ASCII, characters with code point less or equal to 127. Characters with code points 128 and above generally have different numeric representation on different encodings.

I know what could have confused you into thinking of 255: so called "Extended ASCII", a popular non-standard encoding used to be usual in MS DOS applications. Besides, there are several standard encodings based on the same principle. In contrast to Unicode, the characters encoded by bytes with values above 127 did not have unambiguous interpretation. In addition to bytes (text), one would need to add the information on what "code page" or encoding is used. Some of such encoding were standardized, some were not. There was a bloody mess in some of the cultures. You cannot rely on any of such values.

Now, here is a bonus: do you want to know "Unicode characters" :-) (characters not available in ASCII) used in English text? Oh, there are a lot of them: — – … “ ” ‘ ’, æ, ©, †, °, and more…

[EDIT]

However, using character numeric value for check up directly would be bad. Good code should abstract out particular string representation. One way would be this: using the class System.Text.ASCIIEncoding, serialize your string as ASCII, and then deserialize it back to string. After this round trip, you will have two strings, "before" and "after". They are both "Unicode", but in new string some characters will turn two '?'. In other words, if the strings are the same, all characters in source string were in the ASCII range. You can do the same with any other non-Unicode encoding.

—SA

Sinisa Hajnal · Answer 3 · 2016-06-20T21:21:00

Solution 1

You could also try with something like this regex:
[^\u0000-\u00ff]

It says, find any character not between 0 and 255. I don't really like RegExes, but if you ever need to filter in or our additional characters, this makes it easier then conditionals in "normal" code.

Good luck

Posted 20-Jun-16 21:21pm

Sinisa Hajnal

Comments

Sergey Alexandrovich Kryukov 21-Jun-16 3:35am

Sorry, you are making a big mistake. Please see Solution 2.
—SA