Click here to Skip to main content
15,867,568 members
Articles / Programming Languages / C++

UTF-8 UTILITY FUNCTIONS IN C++ (Platform Independent Code)

Rate me:
Please Sign up or sign in to vote.
3.09/5 (14 votes)
6 Apr 2007CPOL2 min read 82.6K   1.7K   33   15
This article describes the basics of UTF-8 and provides some utility functions for handling UTF-8. The code can be compiled for Windows as well as Linux.

Introduction

Recently I needed to validate some files provided by an external customer for UTF-8 format. I Googled for some sample applications and code, but I couldn't find one. Then I decided to write one myself.

UTF-8 Format

UTF-8 (Unicode Transformation Format -8) as the name suggests, is a variable length-encoding format for Unicode. Unicode contains the characters required to represent practically all known languages. This includes most of the languages in the world including most of the Indian languages like Malayalam, Bengali, Gujarati, Oriya, Tamil, Telugu and Kannada.

Unicode defines integer numbers to characters. But how this has to be stored/ encoded is not defined. This has been defined in many of the encoding formats like UCS-2, UTF-7, UTF-8, UTF-16 etc.

What makes UTF-8 attractive compared to other encoding is the fact that all the standard ASCII characters will continue to be the same in UTF format also. That means code written to handle ASCII characters will remain as it is.

Few points I would like to highlight here.

  • A plain ASCII file (Binary data from 0x00 to 0x7f) is a valid UTF-8 file because of the fact that the UTF-8 encoded string remains the same for characters in the range 0x00 to 0x7f.
  • 0xEE and 0xFF are two characters, which are not possible at all in a UTF-8 file.
  • From the first byte of a UTF-encoded character, we can find out the total number of bytes for the UTF-8 character.
  • It is possible to encode all the 231 UCS characters to UTF.
  • First byte of a non-ASCII character (>0x007f) will be in the range of 0xC0 to 0xFD.
  • All the bytes in the sequence for a non-ASCII character will be above 0x80. That means, there won't be any ASCII character byte in any of the multi byte encoded UTF sequence.
  • Byte streams are stored in big endian format.

Use this application if you want to convert between UTF-8 and ASCII.

Calculation of Byte Sequence for a Character

U-00000000 – U-0000007F

0xxxxxxx

U-00000080 – U-000007FF

110xxxxx 10xxxxxx

U-00000800 – U-0000FFFF

1110xxxx 10xxxxxx 10xxxxxx

U-00010000 – U-001FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U-00200000 – U-03FFFFFF

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 – U-7FFFFFFF

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Overlong Sequences

Care must be taken when you get an input with overlong sequence. UTF-8 decoder must not accept character coded with more bytes than necessary.

For example, character 'A' (0x41) should be encoded to 0x41 itself. Other long run possibilities are:

0xC1 0x81

0xE0 0x81 0x81

0xF0 0x80 0x81 0x81

0xF8 0x80 0x80 0x81 0x81

0xFC 0x80 0x80 0x80 0x81 0x81 

These sequences, with a normal decoder will decode it to 0x41 itself. But these are not permitted in UTF-8 and should be considered as invalid UTF character sequences.

Using the Code

Following are the functions available:

  1. /*************************************************************************
    * @f Fnct            : convertHex2UTF
    * @r Return            : single character UTF string.
    * Description       : Convert stl hex character string to corresponding
                            UTF character string. Do not misunderstand this function
                      with a stream converter. This function converts only one
                      character.
                      For example
                      "7f" return "7f"
                      "80" return "c280"
                      "fffd" return "efbfbd"
    
    * @author            : Boby thomas
    **************************************************************************/
    string convertHex2UTF(string);
  2. /*************************************************************************
    * @f Fnct            : convertUTF2Hex
    * @r Return            : Hex value corresponding to the UTF chracter.
                            "error" on invalid character.
    * Description       : Returns the hex value corresponding to a UTF character.
                            Do not misunderstand this function with a stream converter.
                      This function convert only one UTF-8 character.
                      For example
                      "7f" return "7f"
                      "c280" return "80"
                      "efbfbd" return "fffd"
    
    * @author            : Boby thomas
    **************************************************************************/
    string convertUTF2Hex(string); 
  3. /*************************************************************************
    * @f Fnct            : findLengthUTF
    * @r Return            : single character. Normaly first character of a UTF stream.
                            -1 for invalid UTF entry.
    * Description       : Returns the number of characters in the UTF string.
                            Say for example 0xc2  will return 2 since one more byte
                            following this will constitute the UTF character.
    * @author            : Boby thomas
    **************************************************************************/
    long findLengthUTF(string sUTFFirstByte); 
  4. /*************************************************************************
    * @f Fnct            : generateUTFFileDetails
    * @r Return            : true - file could be a UTF file.
                            (No invalid UTF character in the file)
    * Description       : This function evaluate a file for validity. Returns false
                            if there a single occurrence of a nonpossible character.
                            Writes a file utfdetails_<filename> with all the utf
                            character details.
    * @author            : Boby thomas
    **************************************************************************/
    bool generateUTFFileDetails(string sFileName);
  5. /*************************************************************************
    * @f Fnct            : hex2binary
    * @r Return            : Binary string.
    * Description       : Convert stl string of hex values to a binary string.
    * @author            : Boby thomas
    **************************************************************************/
    
    string hex2binary(string sAscii); 
  6. /*************************************************************************
    * @f Fnct            : binary2hex
    * @r Return            : Hexadecimal string.
    * Description       : Convert stl binary string to of hex value string.
                            Accept binary string of any length.
    * @author            : Boby thomas
    **************************************************************************/
    string binary2hex(string sBinary);

Conclusion

The above article gives a basic introduction to UTF-8 and provides some utility functions. Google for more details about UTF-8. Please send me your valuable suggestions and comments at bobypt@gmail.com.

History

  • 6th April, 2007: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Software Developer (Senior) DWS
Australia Australia

Comments and Discussions

 
GeneralUTF8 to Integer Pin
Ricky Gai23-Sep-08 21:49
Ricky Gai23-Sep-08 21:49 
GeneralRe: UTF8 to Integer Pin
Boby Thomas P23-Sep-08 22:42
Boby Thomas P23-Sep-08 22:42 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.