Click here to Skip to main content
15,881,938 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
I have a UTF8 string ('Sarandë')
I want to get it exactly but I only get this tring ('Sarandë')
I try to use Encoding with some Internet's example. But they are not working.
Pls help.

Detail: I want to get string value ('Sarandë') on website and add to sqlite database.
My field type is NVARCHAR(100)
But Value I insert is ('Sarandë').
I converted it to bybe[] and used Text.Encoding but It's not working
My page I want to get data: http://www.infodriveindia.com/traderesources/port.aspx?&GridInfo=Ports010[^]
Posted
Updated 24-Sep-14 5:38am
v4
Comments
SteveyJDay 23-Sep-14 12:05pm    
The strings look the same to me... Please post the code you using.
Sergey Alexandrovich Kryukov 23-Sep-14 12:46pm    
Chances are, OP wanted to see something like "Sarandë".
Please see my answer.
—SA

There is no such thing as "UTF string" per se. String in .NET is always a Unicode string, never anything else.

If you have a string and encode it as UTF (any UTF), it's the array of bytes. Different encodings can give you different bytes from the same string. Unicode defines one-to-one correspondence characters (understood as cultural entities, abstracted from their graphical glyphs and other detail) and integers called "code points" (understood in their abstract mathematical sense, without any concerns of how they are represented by computers). UTFs define how code points are represented in bytes.

Now, .NET's internal representation is UTF16LE, but all API is full abstracted from this information. In other works, I would formulate it as "any program based on assumption of any particular presentation of the string in System.String object is incorrect".

You need to use System.Text.Encoding. Perhaps you are doing it wrong. All you need is the understanding of what Unicode is. Hope my explanation will help you to sort out your problem.

[EDIT]

I think you wanted to see something like "Sarandë". But why would you need it?
This happens if you have this word as UTF data, save it as a plain text file without BOM and then mistakenly open it as ANSI/ASCII text.

As the letter 'ë' is presented in UTF-8 in two bytes (0xC3 followed by 0xAB), you can see those two bytes as ANSI, ASCII or non-standard "extended ASCII" presentation. It all:
1) makes no sense;
2) not always possible.

But you can do it by using Encoding.UTF8.GetBytes from a string (which I think you have): http://msdn.microsoft.com/en-us/library/ds4kkd55%28v=vs.110%29.aspx[^].

And then interpret each by the way you want, as ASCII or anything else. But why? :-)

See also: http://www.unicode.org/faq/utf_bom.html[^].

—SA
 
Share this answer
 
v5
Comments
ductuan_itpro 24-Sep-14 10:45am    
I've already explain my trouble, hopefully more solutions
Sergey Alexandrovich Kryukov 24-Sep-14 12:35pm    
Don't you think I already answered in full? Your follow-up questions will be welcome.
—SA
Afzaal Ahmad Zeeshan 24-Sep-14 15:20pm    
I think you did. What I believe is that the OP wants to learn the basics about the technologies he is using. For you a +5, a nice answer.
Sergey Alexandrovich Kryukov 24-Sep-14 15:40pm    
Than you, Afzaal.
Hopefully you are helping to convince OP that the root of the problem is just the understanding of how it all works.
—SA
Afzaal Ahmad Zeeshan 24-Sep-14 15:53pm    
Exactly, the only problem here is the basic understanding of the Unicode itself, rather than "getting the string".
 
Share this answer
 
Comments
ductuan_itpro 24-Sep-14 11:57am    
This is not URI, Its a web content, I want to get it exactly
Richard MacCutchan 24-Sep-14 12:19pm    
It doesn't matter where it comes from, the principle is the same.
Your next question was recently auto-removed, due to some abuse reports. I'll answer again.

In that question, you asked about HTML representation using HTML entities. One of the answers explained the API for entities. I added the explanation of the background: in my Solution 1 (on this page) I explained what Unicode standardizes and explained what "code point" is.

Now, HTML character entity has nothing to do with UTFs. Instead of encoding code point with bytes (UTF-8, for example, uses variable number of bytes per characters using some intricate algorithm which you don't have to know), HTML character entity encodes code point itself, in this case, #235. If you run CharMap.EXE ("Character Map", the application bundled with all version of Windows) and select code point 0235 (U+00EB), you will see the character 'ë', "Latin Small Letter E With Diaeresis".

I hope it explains things.

Let's see: I explained the basics of how Unicode works, how UTFs work and how HTML works with character entities. You need to put it all together in your mind, and probably read on the topic, maybe starting from http://www.Unicode.org[^].

You need to come to some understanding first, instead of trying to solve some really imaginary problem.

—SA
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900