Click here to Skip to main content
14,689,405 members
Articles » Web Development » Applications & Tools » General
Posted 6 Apr 2007


33 bookmarked

UTF-8 UTILITY FUNCTIONS IN C++ (Platform Independent Code)

Rate me:
Please Sign up or sign in to vote.
3.09/5 (14 votes)
6 Apr 2007CPOL
This article describes the basics of UTF-8 and provides some utility functions for handling UTF-8. The code can be compiled for Windows as well as Linux.


Recently I needed to validate some files provided by an external customer for UTF-8 format. I Googled for some sample applications and code, but I couldn't find one. Then I decided to write one myself.

UTF-8 Format

UTF-8 (Unicode Transformation Format -8) as the name suggests, is a variable length-encoding format for Unicode. Unicode contains the characters required to represent practically all known languages. This includes most of the languages in the world including most of the Indian languages like Malayalam, Bengali, Gujarati, Oriya, Tamil, Telugu and Kannada.

Unicode defines integer numbers to characters. But how this has to be stored/ encoded is not defined. This has been defined in many of the encoding formats like UCS-2, UTF-7, UTF-8, UTF-16 etc.

What makes UTF-8 attractive compared to other encoding is the fact that all the standard ASCII characters will continue to be the same in UTF format also. That means code written to handle ASCII characters will remain as it is.

Few points I would like to highlight here.

  • A plain ASCII file (Binary data from 0x00 to 0x7f) is a valid UTF-8 file because of the fact that the UTF-8 encoded string remains the same for characters in the range 0x00 to 0x7f.
  • 0xEE and 0xFF are two characters, which are not possible at all in a UTF-8 file.
  • From the first byte of a UTF-encoded character, we can find out the total number of bytes for the UTF-8 character.
  • It is possible to encode all the 231 UCS characters to UTF.
  • First byte of a non-ASCII character (>0x007f) will be in the range of 0xC0 to 0xFD.
  • All the bytes in the sequence for a non-ASCII character will be above 0x80. That means, there won't be any ASCII character byte in any of the multi byte encoded UTF sequence.
  • Byte streams are stored in big endian format.

Use this application if you want to convert between UTF-8 and ASCII.

Calculation of Byte Sequence for a Character

U-00000000 – U-0000007F


U-00000080 – U-000007FF

110xxxxx 10xxxxxx

U-00000800 – U-0000FFFF

1110xxxx 10xxxxxx 10xxxxxx

U-00010000 – U-001FFFFF

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

U-00200000 – U-03FFFFFF

111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

U-04000000 – U-7FFFFFFF

1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Overlong Sequences

Care must be taken when you get an input with overlong sequence. UTF-8 decoder must not accept character coded with more bytes than necessary.

For example, character 'A' (0x41) should be encoded to 0x41 itself. Other long run possibilities are:

0xC1 0x81

0xE0 0x81 0x81

0xF0 0x80 0x81 0x81

0xF8 0x80 0x80 0x81 0x81

0xFC 0x80 0x80 0x80 0x81 0x81 

These sequences, with a normal decoder will decode it to 0x41 itself. But these are not permitted in UTF-8 and should be considered as invalid UTF character sequences.

Using the Code

Following are the functions available:

  1. /*************************************************************************
    * @f Fnct            : convertHex2UTF
    * @r Return            : single character UTF string.
    * Description       : Convert stl hex character string to corresponding
                            UTF character string. Do not misunderstand this function
                      with a stream converter. This function converts only one
                      For example
                      "7f" return "7f"
                      "80" return "c280"
                      "fffd" return "efbfbd"
    * @author            : Boby thomas
    string convertHex2UTF(string);
  2. /*************************************************************************
    * @f Fnct            : convertUTF2Hex
    * @r Return            : Hex value corresponding to the UTF chracter.
                            "error" on invalid character.
    * Description       : Returns the hex value corresponding to a UTF character.
                            Do not misunderstand this function with a stream converter.
                      This function convert only one UTF-8 character.
                      For example
                      "7f" return "7f"
                      "c280" return "80"
                      "efbfbd" return "fffd"
    * @author            : Boby thomas
    string convertUTF2Hex(string); 
  3. /*************************************************************************
    * @f Fnct            : findLengthUTF
    * @r Return            : single character. Normaly first character of a UTF stream.
                            -1 for invalid UTF entry.
    * Description       : Returns the number of characters in the UTF string.
                            Say for example 0xc2  will return 2 since one more byte
                            following this will constitute the UTF character.
    * @author            : Boby thomas
    long findLengthUTF(string sUTFFirstByte); 
  4. /*************************************************************************
    * @f Fnct            : generateUTFFileDetails
    * @r Return            : true - file could be a UTF file.
                            (No invalid UTF character in the file)
    * Description       : This function evaluate a file for validity. Returns false
                            if there a single occurrence of a nonpossible character.
                            Writes a file utfdetails_<filename> with all the utf
                            character details.
    * @author            : Boby thomas
    bool generateUTFFileDetails(string sFileName);
  5. /*************************************************************************
    * @f Fnct            : hex2binary
    * @r Return            : Binary string.
    * Description       : Convert stl string of hex values to a binary string.
    * @author            : Boby thomas
    string hex2binary(string sAscii); 
  6. /*************************************************************************
    * @f Fnct            : binary2hex
    * @r Return            : Hexadecimal string.
    * Description       : Convert stl binary string to of hex value string.
                            Accept binary string of any length.
    * @author            : Boby thomas
    string binary2hex(string sBinary);


The above article gives a basic introduction to UTF-8 and provides some utility functions. Google for more details about UTF-8. Please send me your valuable suggestions and comments at


  • 6th April, 2007: Initial post


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Boby Thomas P
Software Developer (Senior) DWS
Australia Australia

Comments and Discussions

GeneralUTF8 to Integer Pin
Ricky Gai23-Sep-08 22:49
MemberRicky Gai23-Sep-08 22:49 
GeneralRe: UTF8 to Integer Pin
Boby Thomas P23-Sep-08 23:42
MemberBoby Thomas P23-Sep-08 23:42 
Generalutf_util Pin
Albrecht9-Dec-07 0:01
MemberAlbrecht9-Dec-07 0:01 
Generaljust a minor point Pin
peterchen28-Aug-08 4:35
Memberpeterchen28-Aug-08 4:35 
GeneralAny2UTF8 Pin
TomazZ9-Apr-07 22:34
MemberTomazZ9-Apr-07 22:34 
GeneralRe: Any2UTF8 [modified] Pin
Boby Thomas P10-Apr-07 1:04
MemberBoby Thomas P10-Apr-07 1:04 
GeneralRe: Any2UTF8 Pin
TomazZ10-Apr-07 1:23
MemberTomazZ10-Apr-07 1:23 
GeneralRe: Any2UTF8 Pin
Denis Kiriaev10-Apr-07 22:04
MemberDenis Kiriaev10-Apr-07 22:04 library at rescue ! Pin
Kochise17-Apr-07 22:57
MemberKochise17-Apr-07 22:57 
GeneralUTF-8 can have max 4 bytes Pin
Mihai Nita6-Apr-07 7:59
MemberMihai Nita6-Apr-07 7:59 
GeneralRe: UTF-8 can have max 4 bytes Pin
Boby Thomas P6-Apr-07 8:08
MemberBoby Thomas P6-Apr-07 8:08 
GeneralRe: UTF-8 can have max 4 bytes Pin
Mihai Nita6-Apr-07 21:45
MemberMihai Nita6-Apr-07 21:45 
GeneralRe: UTF-8 can have max 4 bytes Pin
Boby Thomas P9-Apr-07 4:52
MemberBoby Thomas P9-Apr-07 4:52 
GeneralRe: UTF-8 can have max 4 bytes Pin
Mihai Nita9-Apr-07 18:16
MemberMihai Nita9-Apr-07 18:16 
GeneralRe: UTF-8 can have max 4 bytes Pin
_Olivier_15-Apr-07 7:59
Member_Olivier_15-Apr-07 7:59 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.