Click here to Skip to main content
15,891,248 members
Please Sign up or sign in to vote.
2.33/5 (3 votes)
See more:
I came across situation where i want to check that given string contain unicode character or not?How to do that?

What I have tried:

string input = "Non English";

if(input.Any(c => c > 255))
{
// unicode
}

Is it right way?
Posted
Updated 20-Jun-16 21:35pm
Comments
Sinisa Hajnal 21-Jun-16 3:20am    
This would work...but I believe it would be faster with string.indexof instead of lambda expression.
Member 11589429 21-Jun-16 3:35am    
Thanks for reply....
Sergey Alexandrovich Kryukov 21-Jun-16 3:21am    
This is not a correct question. A character cannot be Unicode or non-Unicode. This is a cultural entity. From the Unicode standpoint, all characters are Unicode characters. For example, ASCII characters are also Unicode characters.

You can only ask such question if you name some other standard and want to figure out how is it related to Unicode. For example, you can take ASCII and ask a question: "how to find out if a string contain at least one character which is not supported by ASCII"? You don't need to mention the word "Unicode", because the notion of .NET string implies that it is always Unicode (internal representation is UTF-16LE, by the way).

By the way, it is unrelated to English. Do you know that non-ASCII characters are actually used in English? In modern requirements, only pretty illiterate English text can be written without Unicode support. :-)

Can you see the point?

—SA
Member 11589429 21-Jun-16 3:35am    
Thanks for reply.
Sergey Alexandrovich Kryukov 21-Jun-16 3:37am    
See also my solution.
—SA

In .NET (and for that in C#) string is all!!! Unicode. So there is no such thing non-Unicode char in a string, but there is such thing chars from a specific Unicode range...In your sample you are asking for chars have code point larger than 255, so are not from Basic Latin and not from Latin-1 Supplement...But that's does not mean these are not Unicode chars...
If you want to check if there are chars from a specific range (a specific language?) see here the ranges and check accordingly:
Unicode Character Ranges[^]
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 21-Jun-16 3:39am    
No, 128 to 255 should not be used. Please see Solution 2.
—SA
Kornfeld Eliyahu Peter 21-Jun-16 3:42am    
I do not understand you. In my read OP asked how to tell if a string has chars from a specific (non-English) languages. OP called it - wrongly - Unicode, but it is about ranges as the hole string is Unicode...And the range of 128-255 is a perfectly valid Unicode range...
Sergey Alexandrovich Kryukov 21-Jun-16 3:53am    
Yes, but the question is related to the use of this string in non-Unicode application.
You really don't understand: below 127, the meaning of all code points is the same, all standard share this range. Not so in 128-255. The interpretation of each byte depends on encoding. Did you have to support any local languages (not English) before Unicode?
—SA
Kornfeld Eliyahu Peter 21-Jun-16 4:05am    
I can't see where OP talking about non-Unicode application...In that context, however, I do understand your concerns about the 128-255 range...
Sergey Alexandrovich Kryukov 21-Jun-16 9:26am    
He wasn't talking about it. But see my comment to the question: the formulation is incorrect, illogical. But what is the practical concern? Loss of data. Say, you convert text to ASCII representation, copy it back to Unicode. Some characters will turn '?', because ASCII doesn't support them. But ASCII is defined up to 127. That's all.

But why ASCII? Look at Unicode characters above 127. The used to be used for characters of different languages in different ways; not the same characters as now in Unicode. It depended on "code page", in Microsoft's terms. Hence, the result of round trip depends on the "code page". In other way, when you convert some Unicode text using non-Unicode encoding, the result is uncertain.

Please see my update to my answer.

—SA
First of all, please see my comment to the question.

No, using 255 is wrong. You need to consider only ASCII, characters with code point less or equal to 127. Characters with code points 128 and above generally have different numeric representation on different encodings.

I know what could have confused you into thinking of 255: so called "Extended ASCII", a popular non-standard encoding used to be usual in MS DOS applications. Besides, there are several standard encodings based on the same principle. In contrast to Unicode, the characters encoded by bytes with values above 127 did not have unambiguous interpretation. In addition to bytes (text), one would need to add the information on what "code page" or encoding is used. Some of such encoding were standardized, some were not. There was a bloody mess in some of the cultures. You cannot rely on any of such values.

Now, here is a bonus: do you want to know "Unicode characters" :-) (characters not available in ASCII) used in English text? Oh, there are a lot of them: — – … “ ” ‘ ’, æ, ©, †, °, and more…

[EDIT]

However, using character numeric value for check up directly would be bad. Good code should abstract out particular string representation. One way would be this: using the class System.Text.ASCIIEncoding, serialize your string as ASCII, and then deserialize it back to string. After this round trip, you will have two strings, "before" and "after". They are both "Unicode", but in new string some characters will turn two '?'. In other words, if the strings are the same, all characters in source string were in the ASCII range. You can do the same with any other non-Unicode encoding.

—SA
 
Share this answer
 
v3
You could also try with something like this regex:
[^\u0000-\u00ff]

It says, find any character not between 0 and 255. I don't really like RegExes, but if you ever need to filter in or our additional characters, this makes it easier then conditionals in "normal" code.

Good luck
 
Share this answer
 
Comments
Sergey Alexandrovich Kryukov 21-Jun-16 3:35am    
Sorry, you are making a big mistake. Please see Solution 2.
—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900