convert utf8 string to unicode - VB.NET

Question

1.00/5 (3 votes)

See more:

Hello good people,

As the title says - I have a Non-English string I got from the web (by URLDownloadToFile())and I am trying to convert it to a readable, MySQL friendly, Unicode. This is the code I found in the MSDN but somehow it fails to do the work for me (strLine is the input string). Could anybody please tell me what is wrong??? thanks a lot

VB

Dim utf8 As Encoding = Encoding.UTF8
Dim unicode As Encoding = Encoding.Unicode
Dim utf8Bytes As Byte() = utf8.GetBytes(strLine)
Dim unicodeBytes As Byte() = Encoding.Convert(utf8, unicode, utf8Bytes)
Dim unicodeChars(unicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length) - 1) As Char
unicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, unicodeChars, 0)
Dim unicodeString As New String(unicodeChars)

Posted 20-Jun-13 5:12am

g77777

Updated 20-Jun-13 5:48am

Marc A. Brown

v3

Add a Solution

Comments

Marc A. Brown 20-Jun-13 11:47am

Please expand. What results are you getting? Something a bit more descriptive than "it fails to do the work for me" please.

g77777 20-Jun-13 12:01pm

thanks for answering. I don't get anything readable just gibrish, just a bunch of characters while the original string was letters.

Sergey Alexandrovich Kryukov 20-Jun-13 18:07pm

Nothing in your questions justifies that this is really UTF-8. It depends how you got your string.
—SA

Marc A. Brown 21-Jun-13 9:12am

Perhaps so but taking him at his word that it is UTF-8, how're the proposed solutions? :)
And it would appear that there's a phantom univoter about. My solution got univoted by someone with fairly high rep, but without comment. I'm not concerned about my rep points but if my answer is right (or at least close, and the OP's comment to my solution indicates that it is), it sucks to see it downvoted since anyone else searching for a solution to the same issue may discount the solution because of that.

2 solutions

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

TnTinMn · Answer 1 · 2013-06-20T17:59:00

You have a downloaded text file in UTF-8 format. You must you have read the file to get that string. So why not set the encoding when you read it and let reader do the conversion?

VB

' set monospaced font
TextBox1.Font = New System.Drawing.Font("DejaVu Sans Mono", 10, _
                                        System.Drawing.FontStyle.Regular, _
                                        System.Drawing.GraphicsUnit.Point)

Dim fs As IO.FileStream = IO.File.OpenRead(Utf8_FilePath)
Dim sr As New IO.StreamReader(fs, System.Text.Encoding.UTF8)

While sr.Peek <> -1
   TextBox1.AppendText(sr.ReadLine() & vbCrLf)
End While
sr.Close()

Even less code:

VB

' set monospaced font
TextBox1.Font = New System.Drawing.Font("DejaVu Sans Mono", 10, _
                                        System.Drawing.FontStyle.Regular, _
                                        System.Drawing.GraphicsUnit.Point)


' ref: http://msdn.microsoft.com/en-us/library/system.io.file.opentext.aspx
' Opens an existing UTF-8 encoded text file for reading.
Dim sr As IO.StreamReader = IO.File.OpenText(Utf8_FilePath)

While sr.Peek <> -1
   TextBox1.AppendText(sr.ReadLine() & vbCrLf)
End While
sr.Close()

Code was tested against this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt[^]

Marc A. Brown · Answer 2 · 2013-06-20T07:21:00

Solution 1

Take a look at this[^]. The guy was originally doing pretty much what you are trying with poor results but his answer seems to have solved his problem. Now, his code is in C#, but you should be able to convert it easily enough.

Here's the code from the post:

C#

private byte[] GetRawBytes(string str)
{
  int charcount = str.Length;
  byte[] byttemp = new byte[charcount];
  
  for (int i = 0; i < charcount; i++)
  {
    byttemp[i] = (byte)str[i];
  }

  return byttemp;
}

private string UTF8toUnicode(string str)
{
  byte[] bytUTF8;
  byte[] bytUnicode;
  string strUnicode = String.Empty;

  bytUTF8 = GetRawBytes(str);
  bytUnicode = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8);
  strUnicode = Encoding.Unicode.GetString(bytUnicode);
  return strUnicode;
}

Posted 20-Jun-13 7:21am

Marc A. Brown

Comments

g77777 20-Jun-13 17:35pm

it almost worked...using " byttemp(i) = CByte(Asc(str(i)))" in vb.net managed to get all of the original string but with an additional character which was translated as '?' so MySql didn't like it and rejected my string...any suggestions (BTW AscW caused overflow)

Marc A. Brown 20-Jun-13 18:01pm

I don't do VB any more, so I'm out of practice. :) Do you have to have the Asc() call in there?

Marc A. Brown 20-Jun-13 18:04pm

Instead of doing CByte(Asc(str(i))), try Convert.ToByte(str(i)).

Sergey Alexandrovich Kryukov 21-Jun-13 10:30am

Marc, I'm sorry I did not explain it to you soon enough and sorry for the trouble. I will of course explain it and my vote.

Unfortunately, your solution is confusing at best and based on misconception.

Here is the idea: there is no such thing as "UTF-8" string. UTFs are always presented as raw bytes, such as with "byte[]" type. You could also understand that internally all .NET strings are represented in memory as UTF-16LE (in weird and incorrect jargon of Microsoft APIs it is named "Unicode", which is extremely confusing, because the term "Unicode" it unrelated to its UTFs as well as any particular machine representation of characters), but the idea is: one should never use this knowledge in code, as all API allows you to use string without knowing their machine representation, which is a very good abstraction. When you manipulate with strings, you always use string types, when you safe and load string data, you use UTFs.

You probably consider the case then the parameter "string str" in your second function was obtained as a string, wrongfully loaded from some stream with wrong UTF. It would be good to explain, but, whatever happened, it could not be a "UTF-8" string. And, importantly, it could not solve the problem. It could be late as the user already had "?" instead of characters. Anyway, when you typecast a string element in GetRawBytes to byte, you by some reason assume that an character is a byte. This is not so. It could work only if the input text contained only "one-byte characters" (even though it technically occupies 16-bit words, for the code points below 256 only first byte would be informative), but in this case the problem would not need a solution at all. The problem OP has is related to some "non-English string", and in this case UTF-8 would use 2-3 bytes per character, and UTF-16 would use 1 or (rarely) 2 15-bit words.

If you are thinking of Unicode as of encoding (many thinks this is a 16-bit encoding), this is wrong. This is not an encoding (in .NET sense) at all. UTFs are. And all UTFs except UTF-32 use variable-size characters, but using different mechanisms.

Now, you would ask me, how about the correct solution? It really depends on the previous steps, how OP got the string. In some cases, the problem could be solved, in others part of data could be lost, the there is no solution. I need to know what happened exactly. Besides, it's sometimes possible to look at the file and take a correct guess.

—SA

Marc A. Brown 21-Jun-13 11:22am

Sergey,
Ah, so *you* were the phantom. :) As I said above, I'm not concerned about the 1-vote for myself but for someone else looking for a solution that appears to have (mostly) worked for the OP. Thanks for the explanation -- that's very helpful.
I actually already knew, from poking around for a solution, that a UTF-8 "string" is actually a sequence of bytes rather than a string. Now in looking at it further, it would appear that you're correct that, with the exception of characters that correspond to ASCII, this solution fail. It would seem that a proper solution would require a bit-wise reading of the first byte of data to determine the number of bytes for each character, then reading that number of bytes, then repeating until the entire source was read.
Simpler would be to declare me emperor of the world and I'll outlaw non-English languages. ;)
--mab

Sergey Alexandrovich Kryukov 21-Jun-13 16:05pm

Yes. Sorry, man, I tried to provide an explanation a bit later. You see, I seriously think that "mostly worked" is much worse than nothing (in this case and many others), because it leads in wrong direction and creates an illusion of progress. I think we can continue with this discussion only if OP gives us comprehensive information on the artifact used as input: what is its content, at least, what was the steps to get it.

Beyond that, getting correct content from, say, a Web page, could be damn simple: one downloads a Web page as is, in a binary way, gets a file. One should look at the the Content-Type of original Web page and figure out what is it. WARNING: is one saves it in a file, it may not prescribe the BOM, and encoding information is missing from HTTP-EQUIV, so this information might be lost, and some trial-and-error could help. Changes are, this is UTF-8, but some local authors use obsolete local encodings (to save space, by the way), then the situation is somewhat worse. In such cases, some Web browsers have "guess engines" which are very helpful.

Then, when a file is downloaded, one should read it with correct encoding, from the very beginning, otherwise the data can be lost.

Basically, that's all.

—SA