How to use the utf encoding in c# ?

Question

1.00/5 (2 votes)

See more:

, +

I have application (winform) use a sql localdb, the probleme when I insert arabic word i have in the colomn of the name this character "??????" ,I use this code:

C#

var arabic = Encoding.GetEncoding(864);
var bytes = arabic.GetBytes(libelléTextEdit.Text);
libelléTextEdit.Text = arabic.GetString(bytes);

not work for me ?
How to solve this problem ?

Posted 2-Jul-15 15:17pm

Mohammed-cd7

Updated 19-Aug-21 8:02am

Add a Solution

Comments

Sergey Alexandrovich Kryukov 2-Jul-15 23:12pm

It has nothing to do with UTF and encoding "864". Listen, this is such a can of worms...
Can you simply forget all that and start from scratch? You should not use this obsolete encoding, only use Unicode. .NET directly support only Unicode, and only Unicode you should use for Arabic. Unicode is not encoding (there is a very confusing Microsoft term "Unicode", which means UTF-16LE, which is not really Unicode, but one of UTFs...), Unicode is abstract one-to-one correspondence between abstract set of mathematical integer values and characters, as cultural entities, that is, abstracted from computer representation (from bytes). And UTFs are only needed when you have to persist Unicode in stream.

In other words, you should learn how Unicode works, in database, use Unicode data types. You don't need to use those UTF bytes at all; Unicode is transparently supported. Tell us what you want to achieve, exactly. You need some elementary background, to put yourself in some condition to be able to understand help...

—SA

Afzaal Ahmad Zeeshan 2-Jul-15 23:20pm

Also, Unicode is supported in .NET framework and char keyword represents a character in Unicode and not in other standards; such as ASCII. It is of 2-byte size and thus can surely represent the integral representation of the character.

Now it is the job of framework (in this example Windows Forms) and the font-family to determine the glyph for the character. Which, in his case is not. I suggest that he changes the font family to Segoe UI and tries that for a while. I have also included article of mine in the following solution that talks about Unicode, code pages, how characters are shown and what happens if a character cannot be mapped... Much more in the article.

Please see Solution 1 also.

Sergey Alexandrovich Kryukov 2-Jul-15 23:31pm

I just explained the problems I can see with your explanation.
By the way, '?' appears when Unicode data (and not particular UTF, but char/string data) is represented as non-Unicode encoding somewhere in the middle. When such string is represented as Unicode again, it's too late: the data is already lost.
—SA

Afzaal Ahmad Zeeshan 2-Jul-15 23:40pm

Isn't it because of the problem when characters glyphs are not present on the font-family character set?

Your later sentence is true, data is already lost. But isn't font family the only reason? Because the same data can be represented in Segoe UI and not in Monospace or Consolas (as in my article).

Sergey Alexandrovich Kryukov 2-Jul-15 23:54pm

No. Certainly not. If font-family was a problem it shows a box, not '?'. And Perso-Arabic is supported nearly everywhere. There is a non-Unicode encoding. And it could be this damn 864 - totally inappropriate. Or it could be due to, say, varchar. It's simple: if you transcode Unicode string in encoding not representing some characters, all software generates '?'. As simple as that. So the answer would be: use Unicode and only Unicode. There is no need to use any UTF (it's better to stay away unless stream is used). For example, nvarchar is automatically mapped to .NET Unicode strings via ADO.NET.
—SA

3 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Afzaal Ahmad Zeeshan · Answer 1 · 2015-07-02T17:15:00

Solution 1

Unicode is supported in .NET framework, and you can easily use Unicode's UTF-8, UTF-16 or other encodings (or character sets) of Unicode in your applications. The thing is that you need to be sure whether your application is able to represent the glyphs or not. If you application's font family includes the glyphs (for example) of Arabic then you can use them. Otherwise, it would always show "????" for each character, or it would show some other similar character.

They do not mean that your application or .NET framework doesn't support Unicode, instead it means that the character was not mapped to the code pages to represent it in glyph.

I have written an article that fully describes Unicode, how characters are mapped, and why are these "???" shown if a characters isn't mapped properly. Also, the solution provided there was that you should use such a font family that supports the character's code page. Which in most cases (and my recommendation also) is Segoe UI.

Read more on my article: Reading and writing Unicode data in .NET[^]

I hope the article helps you out. :)

Posted 2-Jul-15 17:15pm

Afzaal Ahmad Zeeshan

Updated 2-Jul-15 17:16pm

v2

Comments

Sergey Alexandrovich Kryukov 2-Jul-15 23:28pm

What glyphs are you talking about? The inquirers tries to use Arabic. This is Perso-Arabic script, is supported by nearly all modern systems by default, without any additional installations, and by most major fonts. And, most likely, there is no a need to use any UTF. .NET strings are abstracted from internal bytewise representation of Unicode; and this is very important to understand.

Your article does not clearly explain what is Unicode.
In fact, "Different Unicodes?" part is very confusing. There is only one Unicode, which has nothing to do with binary (or per-byte) representation of text. Unicode (excluding UTF part of the standard) is abstract mathematical one-to-one mapping, fully abstracted from computer representation. If you explained it, it would help many.

On UTF-16, you don't mention "surrogate pairs", which is a very important point. And you never mentioned LE vs BE encodings. You did not clearly explain that all UTFs support at least 20-bit integers (16 "plains" 0..0xFFFF each), even UTF-8. A careful reader can figure it out, but not all readers are careful enough.

Let's me ask you this: can I hope for considerable improvements of your article?

—SA

Afzaal Ahmad Zeeshan 2-Jul-15 23:39pm

That is because .NET framework supports Unicode character mapping to code pages by default and thus when you write Arabic (or any other language) data your characters are supported in all applications (running on .NET framework).

I am sorry for this, I would update the article and explain Unicode in a little more depth. Also, in Different Unicodes, I did not mean to share the theme of different Unicodes, instead to clarify that there is no such thing. Only one Unicode that is a standard used by frameworks, UTF-8 or UTF-16 is just the encoding used by different platforms; that denote the bytes used by each characters.

I would definitely make changes to the article, add the topics that you have shared with me here, and remove the ambiguous content from the article.

Thank you. :-)

Sergey Alexandrovich Kryukov 3-Jul-15 0:13am

I summarize all the comments in Solution 2, please see.

I understand that "different Unicode" was a figure of speech, but it can become not confusing only if you clearly explain what is Unicode in narrow sense of this word and isolate this notion from UTFs. It would be really good if you fix the article. Please notify me when you do it.

—SA

Afzaal Ahmad Zeeshan 3-Jul-15 15:02pm

I have compiled the article in order to demonstrate the topics that you have told me, surrogate-pairs in UTF-16, why they are used and what they are. Also, I have added the Endianness in the article.

I would trouble you for this 20-but integers thing. I have looked around, but found nothing that might (in any way) refer to how these encodings support at least 20-bit integers. A little explanation, please?

Sergey Alexandrovich Kryukov 3-Jul-15 15:35pm

I thought you know yourself. You mentioned that UTF-8 is multi-byte. This is a canny algorithm of reading N bytes. Read one byte, it gives you the decision to read second one (or not), and so on. So, UTF-8 uses not 8-bit, but 1 to 3 bytes. With UTF-16, a surrogate pair gives 32 bit, but all UTF-16 bits are effectively used. The range for characters in a plane is less then 0..FFFF, because of the reserved ranges for low and high word of the surrogate pair. And with UTF-32, 32 is less then 20. :-)

I don't know why Unicode standard reserved only 16 plains (room for 16*0x10000) characters; technically, it could be more. Maybe it was decided that more won't be needed. But I know that way too many characters still await standardization. Please see https://en.wikipedia.org/wiki/Universal_Character_Set_characters.

—SA

Afzaal Ahmad Zeeshan 3-Jul-15 18:26pm

Thank you very much, I have updated the article post and added the topics that you have told me.

1) Explanation of Unicode as a standard and what is the Unicode in .NET framework, which is a UTF indeed,

2) I have added surrogate-pair, what-are, why-need and other things about surrogate-pairs,

3) I have added Endianness, LE and BE. Also, apart from that I have changed the title since that was nothing but ambiguous.

I have updated the title and added a new section under Points of Interest section.

Edit

You did mention, most readers are not careful enough so I might have missed this in the mist. Thank you for clarifying the point for me.

Sergey Alexandrovich Kryukov 3-Jul-15 22:20pm

Sure. :-)
—SA

Sergey Alexandrovich Kryukov · Answer 2 · 2015-07-02T18:11:00

Most of my answer is already in all my comments to the question and Solution 1.

Short summary is: 1) Don't use anything non-Unicode; 2) don't even use UTFs, unless you persist the data in any file or stream; which has nothing to do with databases and ASP.NET (the pages should use UTF-8, but everything except HTTP-EQUIV is done automatically); 3) use only Unicode string data types such as nvarchar instead of varchar; ADO.NET will automatically map such data to .NET strings, but with varchar you will loose data and get '?' instead of Arabic letters (and Eastern Arabic numerals and punctuation).

You are trying to use Perso-Arabic script which is supported via Unicode by default by nearly all modern system; you don't need to install anything.

Finally: you really need to understand what Unicode does.

—SA

Mohammed-cd7 · Answer 3 · 2015-07-04T04:39:00

Solution 3

Thank you for all your solution ,i just add N inside the data in the query and is inserted successfully.

Posted 4-Jul-15 4:39am

Mohammed-cd7