Character representation in languages such as chinese in C++

Question

0.00/5 (No votes)

See more:

Hi,

I know in normal language char size 1 byte.what is the maximum size of char in languages such as chinese.Is w_char is it's corresponding representation?

Thanks in advance

Posted 15-Jul-11 0:21am

kutz 2

Add a Solution

5 solutions

Solution 3

What you "know" is not true!

First, how much byte a character takes is not a pure characteristic of the language. And of course, there is not such thing as "normal" language. Core Unicode does not define how many bytes each character has; it defines the set of code points and a correspondence between a character as a cultural phenomena, abstracted from its concrete glyph and a set on integer values understood in its mathematical sense abstracted from its computer presentation.

Encodings called UTFs define how to represent each code point in byte. Only UTF-32 has fixed 4 bytes per characters. Byte-oriented UTF-8 uses interesting algorithm which makes a character take 1, 2, 3 or 4 bytes with the actual length depending in the value of previous byte(s), and UTF-16 is not a 16-bit code (!), a length of the character can be either 16 or 32 bits (in case of bytes outside Base Multi-lingual Plane (BMP) expressed in a surrogate pair — two 16-bit words). Also, UTF-16 and UTF-32 encodings can be little endian or big endian.

Now, about "normal" languages. Which language do you want to consider. American English perhaps? All expressed in ASCII, code points, 0 to 127, right? Think again! It depends on what you consider a "language". How about fully-fledged punctuation used in this language? Consider, for example, correct typography for dash and quotation marks: —, – “ ”. Try to type them in your keyboard. The code points are 0x2013, 0x2014, 0x201C and 0x201D. Try to squeeze them in one byte — good luck!

See http://unicode.org/[^], http://unicode.org/faq/utf_bom.html[^].

Please don't make false statements, understand things by yourself first.

—SA

Posted 15-Jul-11 14:28pm

Sergey Alexandrovich Kryukov

Comments

thatraja 15-Jul-11 23:23pm

Unicode rocks, 5!

Sergey Alexandrovich Kryukov 16-Jul-11 15:18pm

Thank you, Raja.
--SA

Solution 2

The standard C functions always worked for me - when working with Chinese (Big5), Japanese, and Korean encoded byte streams. They are dependent on locale and OS support.

#include <cstdlib>
#include <climits>
MB_LEN_MAX // Maximum size of multibyte character (any locale)
MB_CUR_MAX // Current maximum size supported
mblen()    // length of MB character
mbtowc()   // MB character to WC character
wctomb()   // WC character to MB character
mbstowcs() // MB string to WC string
wcstombs() // WC string to MB string

Posted 15-Jul-11 4:49am

John R. Shaw

Comments

Sergey Alexandrovich Kryukov 15-Jul-11 20:30pm

This is made obsolete by introduction if Unicode. This is not a problem; the problem is that OP makes a false statement from the very beginning and does not understand how encodings work.

Please see my answer.
--SA

John R. Shaw 16-Jul-11 12:13pm

Actually they are not obsolete. I know Unicode encoding in depth (bmp, surrogates, etc.) - I have done internationalization. Encoding in Big5 and Traditional Chinese as well as Shift-Jis are still around and need conversion into Unicode. The newer C standard (see Draft N1494) does provide new Unicode specific functions in clause 7.27 and under the hood the STL usually calls the C-functions.

Sergey Alexandrovich Kryukov 18-Jul-11 23:35pm

Well, I did not mean Big5 and Shift-Jis does not exist; but they are morally obsolete; as you say, "need conversion to Unicode". Not visa versa, right. Obsolete is obsolete; this is an accurate term.
--SA

Solution 4

char size entirely depends on encoding type.
as example
if it is UNICODE, then simply two bytes. it can be either big endian or little endian
if it is utf8 then not simple, it varies from one byte to six bytes. To get better picture read utf8 [wiki] design part[^]

Posted 15-Jul-11 16:08pm

Mohibur Rashid

Solution 5

You can use the type wchar_t (1 short = 2 bytes) to encode the Unicode representation (65536 different characters). This representation is well supported by the common development environments.

Posted 16-Jul-11 4:40am

YDaoust

Updated 16-Jul-11 4:42am

v3

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Espen Harlinn · Accepted Answer · 2011-07-15T00:32:00

Solution 1

Take a look at ICU:ICU - International Components for Unicode[^]

The ICU libraries, provides first class support for Unicode[^], while wchar_t[^] only provides support for wider character encodings.

You can encode chineese characters in UTF-8, UTF-16, UCS-32, GB 18030, Code page 936, and others.

Best regards
Espen Harlinn

Posted 15-Jul-11 0:32am

Espen Harlinn

Comments

Sergey Alexandrovich Kryukov 15-Jul-11 20:31pm

Good reference, my 5.
Actually, coding is not a problem just yet; the problem is that OP makes a false statement from the very beginning and does not understand how encodings work.

Please see my answer.
--SA

Espen Harlinn 16-Jul-11 4:36am

Thank you, Sergey!

thatraja 15-Jul-11 23:24pm

Bookmarked, 5!

Espen Harlinn 16-Jul-11 4:36am

Thank you, thatraja!