Click here to Skip to main content
15,124,187 members
Please Sign up or sign in to vote.
1.00/5 (3 votes)
See more:
Hello good people,

As the title says - I have a Non-English string I got from the web (by URLDownloadToFile())and I am trying to convert it to a readable, MySQL friendly, Unicode. This is the code I found in the MSDN but somehow it fails to do the work for me (strLine is the input string). Could anybody please tell me what is wrong??? thanks a lot

VB
Dim utf8 As Encoding = Encoding.UTF8
Dim unicode As Encoding = Encoding.Unicode
Dim utf8Bytes As Byte() = utf8.GetBytes(strLine)
Dim unicodeBytes As Byte() = Encoding.Convert(utf8, unicode, utf8Bytes)
Dim unicodeChars(unicode.GetCharCount(unicodeBytes, 0, unicodeBytes.Length) - 1) As Char
unicode.GetChars(unicodeBytes, 0, unicodeBytes.Length, unicodeChars, 0)
Dim unicodeString As New String(unicodeChars)
Posted
Updated 20-Jun-13 6:48am
v3
Comments
Marc A. Brown 20-Jun-13 11:47am
   
Please expand. What results are you getting? Something a bit more descriptive than "it fails to do the work for me" please.
g77777 20-Jun-13 12:01pm
   
thanks for answering. I don't get anything readable just gibrish, just a bunch of characters while the original string was letters.
Sergey Alexandrovich Kryukov 20-Jun-13 18:07pm
   
Nothing in your questions justifies that this is really UTF-8. It depends how you got your string.
—SA
Marc A. Brown 21-Jun-13 9:12am
   
Perhaps so but taking him at his word that it is UTF-8, how're the proposed solutions? :)
And it would appear that there's a phantom univoter about. My solution got univoted by someone with fairly high rep, but without comment. I'm not concerned about my rep points but if my answer is right (or at least close, and the OP's comment to my solution indicates that it is), it sucks to see it downvoted since anyone else searching for a solution to the same issue may discount the solution because of that.

You have a downloaded text file in UTF-8 format. You must you have read the file to get that string. So why not set the encoding when you read it and let reader do the conversion?

VB
' set monospaced font
TextBox1.Font = New System.Drawing.Font("DejaVu Sans Mono", 10, _
                                        System.Drawing.FontStyle.Regular, _
                                        System.Drawing.GraphicsUnit.Point)

Dim fs As IO.FileStream = IO.File.OpenRead(Utf8_FilePath)
Dim sr As New IO.StreamReader(fs, System.Text.Encoding.UTF8)

While sr.Peek <> -1
   TextBox1.AppendText(sr.ReadLine() & vbCrLf)
End While
sr.Close()
Even less code:
VB
' set monospaced font
TextBox1.Font = New System.Drawing.Font("DejaVu Sans Mono", 10, _
                                        System.Drawing.FontStyle.Regular, _
                                        System.Drawing.GraphicsUnit.Point)


' ref: http://msdn.microsoft.com/en-us/library/system.io.file.opentext.aspx
' Opens an existing UTF-8 encoded text file for reading.
Dim sr As IO.StreamReader = IO.File.OpenText(Utf8_FilePath)

While sr.Peek <> -1
   TextBox1.AppendText(sr.ReadLine() & vbCrLf)
End While
sr.Close()

Code was tested against this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt[^]
   
v2
Comments
g77777 22-Jun-13 1:47am
   
Thanks for the answer. It is irrelevant for my case as I'm supposed to get the string directly from a website. (I am writing a web service) The file thing is just for the debugging.
Take a look at this[^]. The guy was originally doing pretty much what you are trying with poor results but his answer seems to have solved his problem. Now, his code is in C#, but you should be able to convert it easily enough.

Here's the code from the post:
C#
private byte[] GetRawBytes(string str)
{
  int charcount = str.Length;
  byte[] byttemp = new byte[charcount];
  
  for (int i = 0; i < charcount; i++)
  {
    byttemp[i] = (byte)str[i];
  }

  return byttemp;
}

private string UTF8toUnicode(string str)
{
  byte[] bytUTF8;
  byte[] bytUnicode;
  string strUnicode = String.Empty;

  bytUTF8 = GetRawBytes(str);
  bytUnicode = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, bytUTF8);
  strUnicode = Encoding.Unicode.GetString(bytUnicode);
  return strUnicode;
}
   
Comments
g77777 20-Jun-13 17:35pm
   
it almost worked...using " byttemp(i) = CByte(Asc(str(i)))" in vb.net managed to get all of the original string but with an additional character which was translated as '?' so MySql didn't like it and rejected my string...any suggestions (BTW AscW caused overflow)
Marc A. Brown 20-Jun-13 18:01pm
   
I don't do VB any more, so I'm out of practice. :) Do you have to have the Asc() call in there?
Marc A. Brown 20-Jun-13 18:04pm
   
Instead of doing CByte(Asc(str(i))), try Convert.ToByte(str(i)).
Sergey Alexandrovich Kryukov 21-Jun-13 10:30am
   
Marc, I'm sorry I did not explain it to you soon enough and sorry for the trouble. I will of course explain it and my vote.

Unfortunately, your solution is confusing at best and based on misconception.

Here is the idea: there is no such thing as "UTF-8" string. UTFs are always presented as raw bytes, such as with "byte[]" type. You could also understand that internally all .NET strings are represented in memory as UTF-16LE (in weird and incorrect jargon of Microsoft APIs it is named "Unicode", which is extremely confusing, because the term "Unicode" it unrelated to its UTFs as well as any particular machine representation of characters), but the idea is: one should never use this knowledge in code, as all API allows you to use string without knowing their machine representation, which is a very good abstraction. When you manipulate with strings, you always use string types, when you safe and load string data, you use UTFs.

You probably consider the case then the parameter "string str" in your second function was obtained as a string, wrongfully loaded from some stream with wrong UTF. It would be good to explain, but, whatever happened, it could not be a "UTF-8" string. And, importantly, it could not solve the problem. It could be late as the user already had "?" instead of characters. Anyway, when you typecast a string element in GetRawBytes to byte, you by some reason assume that an character is a byte. This is not so. It could work only if the input text contained only "one-byte characters" (even though it technically occupies 16-bit words, for the code points below 256 only first byte would be informative), but in this case the problem would not need a solution at all. The problem OP has is related to some "non-English string", and in this case UTF-8 would use 2-3 bytes per character, and UTF-16 would use 1 or (rarely) 2 15-bit words.

If you are thinking of Unicode as of encoding (many thinks this is a 16-bit encoding), this is wrong. This is not an encoding (in .NET sense) at all. UTFs are. And all UTFs except UTF-32 use variable-size characters, but using different mechanisms.

Now, you would ask me, how about the correct solution? It really depends on the previous steps, how OP got the string. In some cases, the problem could be solved, in others part of data could be lost, the there is no solution. I need to know what happened exactly. Besides, it's sometimes possible to look at the file and take a correct guess.

—SA
Marc A. Brown 21-Jun-13 11:22am
   
Sergey,
Ah, so *you* were the phantom. :) As I said above, I'm not concerned about the 1-vote for myself but for someone else looking for a solution that appears to have (mostly) worked for the OP. Thanks for the explanation -- that's very helpful.
I actually already knew, from poking around for a solution, that a UTF-8 "string" is actually a sequence of bytes rather than a string. Now in looking at it further, it would appear that you're correct that, with the exception of characters that correspond to ASCII, this solution fail. It would seem that a proper solution would require a bit-wise reading of the first byte of data to determine the number of bytes for each character, then reading that number of bytes, then repeating until the entire source was read.
Simpler would be to declare me emperor of the world and I'll outlaw non-English languages. ;)
--mab
Sergey Alexandrovich Kryukov 21-Jun-13 16:05pm
   
Yes. Sorry, man, I tried to provide an explanation a bit later. You see, I seriously think that "mostly worked" is much worse than nothing (in this case and many others), because it leads in wrong direction and creates an illusion of progress. I think we can continue with this discussion only if OP gives us comprehensive information on the artifact used as input: what is its content, at least, what was the steps to get it.

Beyond that, getting correct content from, say, a Web page, could be damn simple: one downloads a Web page as is, in a binary way, gets a file. One should look at the the Content-Type of original Web page and figure out what is it. WARNING: is one saves it in a file, it may not prescribe the BOM, and encoding information is missing from HTTP-EQUIV, so this information might be lost, and some trial-and-error could help. Changes are, this is UTF-8, but some local authors use obsolete local encodings (to save space, by the way), then the situation is somewhat worse. In such cases, some Web browsers have "guess engines" which are very helpful.

Then, when a file is downloaded, one should read it with correct encoding, from the very beginning, otherwise the data can be lost.

Basically, that's all.

—SA

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900