How to get exact UTF8 string in C#

Question

1.00/5 (2 votes)

See more:

I have a UTF8 string ('Sarandë')
I want to get it exactly but I only get this tring ('Sarandë')
I try to use Encoding with some Internet's example. But they are not working.
Pls help.

Detail: I want to get string value ('Sarandë') on website and add to sqlite database.
My field type is NVARCHAR(100)
But Value I insert is ('Sarandë').
I converted it to bybe[] and used Text.Encoding but It's not working
My page I want to get data: http://www.infodriveindia.com/traderesources/port.aspx?&GridInfo=Ports010[^]

Posted 23-Sep-14 6:00am

ductuan_itpro

Updated 24-Sep-14 5:38am

v4

Add a Solution

Comments

SteveyJDay 23-Sep-14 12:05pm

The strings look the same to me... Please post the code you using.

Sergey Alexandrovich Kryukov 23-Sep-14 12:46pm

Chances are, OP wanted to see something like "SarandÃ«".
Please see my answer.
—SA

3 solutions

Solution 2

See http://www.w3schools.com/jsref/jsref_decodeuri.asp[^].

Posted 24-Sep-14 5:51am

Richard MacCutchan

Comments

ductuan_itpro 24-Sep-14 11:57am

This is not URI, Its a web content, I want to get it exactly

Richard MacCutchan 24-Sep-14 12:19pm

It doesn't matter where it comes from, the principle is the same.

Solution 3

Your next question was recently auto-removed, due to some abuse reports. I'll answer again.

In that question, you asked about HTML representation using HTML entities. One of the answers explained the API for entities. I added the explanation of the background: in my Solution 1 (on this page) I explained what Unicode standardizes and explained what "code point" is.

Now, HTML character entity has nothing to do with UTFs. Instead of encoding code point with bytes (UTF-8, for example, uses variable number of bytes per characters using some intricate algorithm which you don't have to know), HTML character entity encodes code point itself, in this case, #235. If you run CharMap.EXE ("Character Map", the application bundled with all version of Windows) and select code point 0235 (U+00EB), you will see the character 'ë', "Latin Small Letter E With Diaeresis".

I hope it explains things.

Let's see: I explained the basics of how Unicode works, how UTFs work and how HTML works with character entities. You need to put it all together in your mind, and probably read on the topic, maybe starting from http://www.Unicode.org[^].

You need to come to some understanding first, instead of trying to solve some really imaginary problem.

—SA

Posted 24-Sep-14 18:54pm

Sergey Alexandrovich Kryukov

Updated 24-Sep-14 19:47pm

v2

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Sergey Alexandrovich Kryukov · Accepted Answer · 2014-09-23T06:24:00

There is no such thing as "UTF string" per se. String in .NET is always a Unicode string, never anything else.

If you have a string and encode it as UTF (any UTF), it's the array of bytes. Different encodings can give you different bytes from the same string. Unicode defines one-to-one correspondence characters (understood as cultural entities, abstracted from their graphical glyphs and other detail) and integers called "code points" (understood in their abstract mathematical sense, without any concerns of how they are represented by computers). UTFs define how code points are represented in bytes.

Now, .NET's internal representation is UTF16LE, but all API is full abstracted from this information. In other works, I would formulate it as "any program based on assumption of any particular presentation of the string in System.String object is incorrect".

You need to use System.Text.Encoding. Perhaps you are doing it wrong. All you need is the understanding of what Unicode is. Hope my explanation will help you to sort out your problem.

[EDIT]

I think you wanted to see something like "SarandÃ«". But why would you need it?
This happens if you have this word as UTF data, save it as a plain text file without BOM and then mistakenly open it as ANSI/ASCII text.

As the letter 'ë' is presented in UTF-8 in two bytes (0xC3 followed by 0xAB), you can see those two bytes as ANSI, ASCII or non-standard "extended ASCII" presentation. It all:
1) makes no sense;
2) not always possible.

But you can do it by using Encoding.UTF8.GetBytes from a string (which I think you have): http://msdn.microsoft.com/en-us/library/ds4kkd55%28v=vs.110%29.aspx[^].

And then interpret each by the way you want, as ASCII or anything else. But why? :-)

See also: http://www.unicode.org/faq/utf_bom.html[^].

—SA