Click here to Skip to main content
Click here to Skip to main content

UTF-8 UTILITY FUNCTIONS IN C++ (Platform Independent Code)

By , 6 Apr 2007
 

Introduction

Recently I needed to validate some files provided by an external customer for UTF-8 format. I Googled for some sample applications and code, but I couldn't find one. Then I decided to write one myself.

UTF-8 Format

UTF-8 (Unicode Transformation Format -8) as the name suggests, is a variable length-encoding format for Unicode. Unicode contains the characters required to represent practically all known languages. This includes most of the languages in the world including most of the Indian languages like Malayalam, Bengali, Gujarati, Oriya, Tamil, Telugu and Kannada.

Unicode defines integer numbers to characters. But how this has to be stored/ encoded is not defined. This has been defined in many of the encoding formats like UCS-2, UTF-7, UTF-8, UTF-16 etc.

What makes UTF-8 attractive compared to other encoding is the fact that all the standard ASCII characters will continue to be the same in UTF format also. That means code written to handle ASCII characters will remain as it is.

Few points I would like to highlight here.

  • A plain ASCII file (Binary data from 0x00 to 0x7f) is a valid UTF-8 file because of the fact that the UTF-8 encoded string remains the same for characters in the range 0x00 to 0x7f.
  • 0xEE and 0xFF are two characters, which are not possible at all in a UTF-8 file.
  • From the first byte of a UTF-encoded character, we can find out the total number of bytes for the UTF-8 character.
  • It is possible to encode all the 231 UCS characters to UTF.
  • First byte of a non-ASCII character (>0x007f) will be in the range of 0xC0 to 0xFD.
  • All the bytes in the sequence for a non-ASCII character will be above 0x80. That means, there won't be any ASCII character byte in any of the multi byte encoded UTF sequence.
  • Byte streams are stored in big endian format.

Use this application if you want to convert between UTF-8 and ASCII.

Calculation of Byte Sequence for a Character

U-00000000 – U-0000007F

0xxxxxxx

U-00000080 – U-000007FF

110xxxxx 10xxxxxx

U-00000800 – U-0000FFFF

1110xxxx 10xxxxxx 10xxxxxx

U-00010000 – U-001FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U-00200000 – U-03FFFFFF

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 – U-7FFFFFFF

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Overlong Sequences

Care must be taken when you get an input with overlong sequence. UTF-8 decoder must not accept character coded with more bytes than necessary.

For example, character 'A' (0x41) should be encoded to 0x41 itself. Other long run possibilities are:

0xC1 0x81

0xE0 0x81 0x81

0xF0 0x80 0x81 0x81

0xF8 0x80 0x80 0x81 0x81

0xFC 0x80 0x80 0x80 0x81 0x81 

These sequences, with a normal decoder will decode it to 0x41 itself. But these are not permitted in UTF-8 and should be considered as invalid UTF character sequences.

Using the Code

Following are the functions available:

  1. /*************************************************************************
    * @f Fnct            : convertHex2UTF
    * @r Return            : single character UTF string.
    * Description       : Convert stl hex character string to corresponding
                            UTF character string. Do not misunderstand this function
                      with a stream converter. This function converts only one
                      character.
                      For example
                      "7f" return "7f"
                      "80" return "c280"
                      "fffd" return "efbfbd"
    
    * @author            : Boby thomas
    **************************************************************************/
    string convertHex2UTF(string);
  2. /*************************************************************************
    * @f Fnct            : convertUTF2Hex
    * @r Return            : Hex value corresponding to the UTF chracter.
                            "error" on invalid character.
    * Description       : Returns the hex value corresponding to a UTF character.
                            Do not misunderstand this function with a stream converter.
                      This function convert only one UTF-8 character.
                      For example
                      "7f" return "7f"
                      "c280" return "80"
                      "efbfbd" return "fffd"
    
    * @author            : Boby thomas
    **************************************************************************/
    string convertUTF2Hex(string); 
  3. /*************************************************************************
    * @f Fnct            : findLengthUTF
    * @r Return            : single character. Normaly first character of a UTF stream.
                            -1 for invalid UTF entry.
    * Description       : Returns the number of characters in the UTF string.
                            Say for example 0xc2  will return 2 since one more byte
                            following this will constitute the UTF character.
    * @author            : Boby thomas
    **************************************************************************/
    long findLengthUTF(string sUTFFirstByte); 
  4. /*************************************************************************
    * @f Fnct            : generateUTFFileDetails
    * @r Return            : true - file could be a UTF file.
                            (No invalid UTF character in the file)
    * Description       : This function evaluate a file for validity. Returns false
                            if there a single occurrence of a nonpossible character.
                            Writes a file utfdetails_<filename> with all the utf
                            character details.
    * @author            : Boby thomas
    **************************************************************************/
    bool generateUTFFileDetails(string sFileName);
  5. /*************************************************************************
    * @f Fnct            : hex2binary
    * @r Return            : Binary string.
    * Description       : Convert stl string of hex values to a binary string.
    * @author            : Boby thomas
    **************************************************************************/
    
    string hex2binary(string sAscii); 
  6. /*************************************************************************
    * @f Fnct            : binary2hex
    * @r Return            : Hexadecimal string.
    * Description       : Convert stl binary string to of hex value string.
                            Accept binary string of any length.
    * @author            : Boby thomas
    **************************************************************************/
    string binary2hex(string sBinary);
    

Conclusion

The above article gives a basic introduction to UTF-8 and provides some utility functions. Google for more details about UTF-8. Please send me your valuable suggestions and comments at bobypt@gmail.com.

History

  • 6th April, 2007: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Boby Thomas P
Australia Australia
Member

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralUTF8 to IntegermemberRicky Gai23 Sep '08 - 21:49 
Hello,
 
From a HEX dump, the following three bytes
form a single Chinese character '版':
 
0xE7 0x89 0x88 = 版 = amp#29256 ( in XHTML )
 
How to get the value of 29256 ?
 
Please advise.
 
Regards,
Ricky Gai.
GeneralRe: UTF8 to IntegermemberBoby Thomas P23 Sep '08 - 22:42 
In UTF, the character length is variable as number os characters which can be encoded is huge. A table for the info of length is like this.
 
Unicode Byte1 Byte2 Byte3 Byte4
U+000000-U+00007F 0xxxxxxx
U+000080-U+0007FF 110xxxxx 10xxxxxx
U+000800-U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+010000-U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
 

From above table it is clear that first character in the stream determines how many bytes are there for the character.
 
In this example, it is E or 1110 indicating that it is three bytes. Just ignore the first two bits in the remaining bytes and you write the binary stream to get the unicode hex value.
 
1110 0111 1000 1001 1000 1000 becomes
xxxx 0111 xx00 1001 xx00 1000 becomes
 
0111 00 1001 00 1000 rearranging
0111 0010 0100 1000 -> 0x7248 -> 29256
 

Hope it helps.
 
Regards,
Boby
 
Regards,
Boby

Generalutf_utilmemberAlbrecht8 Dec '07 - 23:01 
I was able to translate utf-util on XP with MinGW and STLport, after adding to utf_functions.h and changing line 138 in utf_functions.cpp to lVal += pow((double)2,(double)lPower));
 
A minor problem is orientation of the ucs2 hexbytes. UCS-2 comes in two flavors marked by FFFE and FEFF.
 
Thanks for the useful software.
 
Kind regards
 
Klaus

Generaljust a minor pointmemberpeterchen28 Aug '08 - 3:35 
lVal += pow((double)2,(double)lPower));
 
it is recommended to use
1 << lPower
instead.
 
using floating point arithmetics is not only a few orders of magnitude slower, you also run into rounding problems for 64 bit values.
 
We are a big screwed up dysfunctional psychotic happy family - some more screwed up, others more happy, but everybody's psychotic joint venture definition of CP
blog: TDD - the Aha! | Linkify!| FoldWithUs! | sighist


GeneralAny2UTF8memberTomazZ9 Apr '07 - 21:34 
Hello.
 

I am searching for function like: Any2UTF8
 
--------------------------
CStringUTF8 sStr;
sStr.AnyToUTF8((bytes8bit)"abcščžćđ", Encoding8bit::WINDOWS_1250);
--------------------------
 

Any suggestions?
 

Thank you in advance.
TomazZ
GeneralRe: Any2UTF8 [modified]memberBoby Thomas P10 Apr '07 - 0:04 
If you want to convert from any code page to UTF, you need details of all the characters in the codepage.
 
For example, a byte oxfc(german ü) in code page Windows-1250, remains the same in utf8- oxfc.That is in utf8 c3bc.
 
But if the text is ecoded in code page IBM-PC 860, it has to be 0x207F. That is in utf8 e281bf.
 
That means a program, when it receives 0xfc, it needs to have an idea what in stands for in that particular code page.
 
If you want to convert only from Windows-1250 to utf8 programmatically, you can keep the details somewhere and code for it. But if you want to generalise, you need complete info about all the codepages.
 
But if you just want to convert files(NOT programmatically) from one codepage to utf8, I would suggest you to use perl. Perl latest releases supports most of the codepages.

 

-- modified at 3:09 Wednesday 11th April, 2007
 
Regards,
Boby

GeneralRe: Any2UTF8memberTomazZ10 Apr '07 - 0:23 
Thanx Boby.
 
I know how Any2UTF8 should be written.
Soo I am searching for Any2UTF8 function already written.
 
I'll check opensource web spiders.
They should convert any HTML encodings to UTF8 before indexing pages.
 

Regards,
TomazZ
GeneralRe: Any2UTF8memberDenis Kiriaev10 Apr '07 - 21:04 
Under Windows you can use a pair MultiByteToWideChar and WideCharToMultiByte functions to perform this. And for Linux/Unix I believe you can find sources of Wine with these functions.
 
Regards,
Denis Kiryaev

Generalsf.net library at rescue !memberKochise17 Apr '07 - 21:57 
http://sourceforge.net/projects/libcharguess/
 
Kochise
 
In Code we trust !

GeneralUTF-8 can have max 4 bytesmemberMihai Nita6 Apr '07 - 6:59 
This is the case since Unicode 3.1 (2000). Now is 2007 and version it 5.0
 
http://unicode.org/versions/corrigendum1.html
The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for the use of five- and six-byte sequences to encode characters that are outside the range of the Unicode character set; those five- and six-byte sequences are illegal for the use of UTF-8 as a transformation of Unicode characters.

GeneralRe: UTF-8 can have max 4 bytesmemberBoby Thomas P6 Apr '07 - 7:08 
Thanks for the info. Point here is UTF-8 can encode unicode and more. -Smile | :)
But since UTF encoder / decoder is not bothered of what the incoming byte stream stands for, the tool holds good.
Anyway thanks for the info.
 

 
Regards,
Boby

GeneralRe: UTF-8 can have max 4 bytesmemberMihai Nita6 Apr '07 - 20:45 
Boby Thomas P wrote:
the tool holds good

 
It is not about the tool holding or not, is about respecting the standard.
I would argue that the tool should treat that as an error.
UTF-8 means *Unicode* ..., so accepting input, or generating output outside the Unicode range is an error.
 
See http://unicode.org/faq/utf_bom.html
"A conformant process must not interpret illegal or ill-formed byte sequences as characters"
"the Unicode Technical Committee has tightened the definition of UTF-8 over time to more strictly enforce unique sequences and to prohibit encoding of certain invalid characters"
 
"More flexible" in this case is a bad thing, because it means "non conformant"

GeneralRe: UTF-8 can have max 4 bytesmemberBoby Thomas P9 Apr '07 - 3:52 
This tool considers characters outside unicode also as valid UTF character. Length can be upto 6 bytes.

 

 
Regards,
Boby

GeneralRe: UTF-8 can have max 4 bytesmemberMihai Nita9 Apr '07 - 17:16 
Boby Thomas P wrote:
This tool considers characters outside unicode also as valid UTF character

 
This is exactly the point.
Doing that is wrong, and the tool is non-compliant.

GeneralRe: UTF-8 can have max 4 bytesmember_Olivier_15 Apr '07 - 6:59 
Mihai Nita wrote:
Doing that is wrong, and the tool is non-compliant.

 
+1 !
especially as Boby presents those tools in the introduction as "validating" UTF-8 output !
 
longer-than-necessary UTF-8 encodings are invalid
 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 6 Apr 2007
Article Copyright 2007 by Boby Thomas P
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid