Click here to Skip to main content
Email Password   helpLost your password?

Table of Contents

Introduction

Many C++ developers miss an easy and portable way of handling Unicode encoded strings. The original C++ Standard (known as C++98 or C++03) is Unicode agnostic, and while some work is being done to introduce Unicode to the next incarnation called C++0x, for the moment, nothing of the sort is available. In the meantime, developers use third party libraries like ICU, OS specific capabilities, or simply roll out their own solutions.

In order to easily handle UTF-8 encoded Unicode strings, I came up with a small generic library. For anybody used to working with STL algorithms and iterators, it should be easy and natural to use. The code is freely available for any purpose - check out the license at the beginning of the utf8.h file. If you run into bugs or performance issues, please let me know and I'll do my best to address them.

The purpose of this article is not to offer an introduction to Unicode in general, and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out the Unicode Home Page or some other source of information for Unicode. Also, it is not my aim to advocate the use of UTF-8 encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from C++, I am sure you have good reasons for it.

Examples of use

Introductory sample

To illustrate the use of the library, let's start with a small but complete program that opens a file containing UTF-8 encoded text, reads it line by line, checks each line for invalid UTF-8 byte sequences, and converts it to UTF-16 encoding and back to UTF-8:

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include "utf8.h"
using namespace std;
int main(int argc, char** argv)
{
    if (argc != 2) {
        cout << "\nUsage: docsample filename\n";
        return 0;
    }

    const char* test_file_path = argv[1];
    // Open the test file (contains UTF-8 encoded text)
    ifstream fs8(test_file_path);
    if (!fs8.is_open()) {
    cout << "Could not open " << test_file_path << endl;
    return 0;
    }

    unsigned line_count = 1;
    string line;
    // Play with all the lines in the file
    while (getline(fs8, line)) {
        // check for invalid utf-8 (for a simple
        // yes/no check, there is also utf8::is_valid function)
        string::iterator end_it = 
          utf8::find_invalid(line.begin(), line.end());
        if (end_it != line.end()) {
            cout << "Invalid UTF-8 encoding detected at line " 
                 << line_count << "\n";
            cout << "This part is fine: " 
                 << string(line.begin(), end_it) << "\n";
        }

        // Get the line length (at least for the valid part)
        int length = utf8::distance(line.begin(), end_it);
        cout << "Length of line " << line_count 
             << " is " << length <<  "\n";

        // Convert it to utf-16
        vector<unsigned short> utf16line;
        utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));

        // And back to utf-8
        string utf8line; 
        utf8::utf16to8(utf16line.begin(), utf16line.end(), 
                       back_inserter(utf8line));

        // Confirm that the conversion went OK:
        if (utf8line != string(line.begin(), end_it))
            cout << "Error in UTF-16 conversion at line: " 
                 << line_count << "\n";        

        line_count++;
    }
    return 0;
}

In the previous code sample, for each line, we performed a detection of invalid UTF-8 sequences with find_invalid; the number of characters (more precisely - the number of Unicode code points, including the end of line and even BOM if there is one) in each line was determined with the use of utf8::distance; finally, we have converted each line to UTF-16 encoding with utf8to16 and back to UTF-8 with utf16to8.

Checking if a file contains valid UTF-8 text

Here is a function that checks whether the content of a file is valid UTF-8 encoded text without reading the content into the memory:

bool valid_utf8_file(iconst char* file_name)
{
    ifstream ifs(file_name);
    if (!ifs)
        return false; // even better, throw here

    istreambuf_iterator<char> it(ifs.rdbuf());
    istreambuf_iterator<char> eos;

    return utf8::is_valid(it, eos);
}

Because the function utf8::is_valid() works with input iterators, we were able to pass an istreambuf_iterator to it and read the contents of the file directly without loading it to the memory first.

Note that other functions that take input iterator arguments can be used in a similar way. For instance, to read the contents of a UTF-8 encoded text file and convert the text to UTF-16, just do something like:

utf8::utf8to16(it, eos, back_inserter(u16string));

Ensure that a string contains valid UTF-8 text

If we have some text that "probably" contains UTF-8 encoded text and we want to replace any invalid UTF-8 sequence with a replacement character, something like the following function may be used:

void fix_utf8_string(std::string& str)
{
    std::string temp;
    utf8::replace_invalid(str.begin(), str.end(), back_inserter(temp));
    str = temp;
}

The function will replace any invalid UTF-8 sequence with a Unicode replacement character. There is an overloaded function that enables the caller to supply their own replacement character.

Reference

Functions from the utf8 namespace

utf8::append

Available in version 1.0 and later.

Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.

template <typename octet_iterator>
octet_iterator append(uint32_t cp, octet_iterator result);

Example of use:

unsigned char u[5] = {0,0,0,0,0};
unsigned char* end = append(0x0448, u);
assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 
                     && u[3] == 0 && u[4] == 0);

Note that append does not allocate any memory - it is the burden of the caller to make sure there is enough memory allocated for the operation. To make things more interesting, append can add anywhere between 1 and 4 octets to the sequence. In practice, you would most often want to use std::back_inserter to ensure that the necessary memory is allocated.

In case of an invalid code point, a utf8::invalid_code_point exception is thrown.

utf8::next

Available in version 1.0 and later.

Given the iterator to the beginning of the UTF-8 sequence, it returns the code point and moves the iterator to the next position.

template <typename octet_iterator> 
uint32_t next(octet_iterator& it, octet_iterator end);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = next(w, twochars + 6);
assert (cp == 0x65e5);
assert (w == twochars + 3);

This function is typically used to iterate through a UTF-8 encoded string.

In case of an invalid UTF-8 sequence, a utf8::invalid_utf8 exception is thrown.

utf8::peek_next

Available in version 2.1 and later.

Given the iterator to the beginning of the UTF-8 sequence, it returns the code point for the following sequence without changing the value of the iterator.

template <typename octet_iterator> 
uint32_t peek_next(octet_iterator it, octet_iterator end);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = peek_next(w, twochars + 6);
assert (cp == 0x65e5);
assert (w == twochars);

In case of an invalid UTF-8 sequence, a utf8::invalid_utf8 exception is thrown.

utf8::prior

Available in version 1.02 and later.

Given a reference to an iterator pointing to an octet in a UTF-8 sequence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bit representation of the code point.

template <typename octet_iterator> 
uint32_t prior(octet_iterator& it, octet_iterator start);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
unsigned char* w = twochars + 3;
int cp = prior (w, twochars);
assert (cp == 0x65e5);
assert (w == twochars);

This function has two purposes: one is to iterate backwards through a UTF-8 encoded string. Note that it is usually a better idea to iterate forward instead, since utf8::next is faster. The second purpose is to find the beginning of a UTF-8 sequence if we have a random position within a string.

it will typically point to the beginning of a code point, and start will point to the beginning of the string to ensure we don't go backwards too far. it is decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence beginning with that octet is decoded to a 32 bit representation and returned.

In case pass_end is reached before a UTF-8 lead octet is hit, or if an invalid UTF-8 sequence is started by the lead octet, an invalid_utf8 exception is thrown.

utf8::previous

Deprecated in version 1.02 and later.

Given a reference to an iterator pointing to an octet in a UTF-8 sequence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bit representation of the code point.

template <typename octet_iterator> 
uint32_t previous(octet_iterator& it, octet_iterator pass_start);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
unsigned char* w = twochars + 3;
int cp = previous (w, twochars - 1);
assert (cp == 0x65e5);
assert (w == twochars);

utf8::previous is deprecated, and utf8::prior should be used instead, although existing code can continue using this function. The problem is the parameter pass_start that points to the position just before the beginning of the sequence. Standard containers don't have the concept of "pass start" and the function can not be used with their iterators.

it will typically point to the beginning of a code point, and pass_start will point to the octet just before the beginning of the string to ensure we don't go backwards too far. it is decreased until it points to a lead UTF-8 octet, and then the UTF-8 sequence beginning with that octet is decoded to a 32 bit representation and returned.

In case pass_end is reached before a UTF-8 lead octet is hit, or if an invalid UTF-8 sequence is started by the lead octet, an invalid_utf8 exception is thrown.

utf8::advance

Available in version 1.0 and later.

Advances an iterator by the specified number of code points within an UTF-8 sequence.

template <typename octet_iterator, typename distance_type> 
void advance (octet_iterator& it, distance_type n, octet_iterator end);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
unsigned char* w = twochars;
advance (w, 2, twochars + 6);
assert (w == twochars + 5);

This function works only "forward". In case of a negative n, there is no effect.

In case of an invalid code point, a utf8::invalid_code_point exception is thrown.

utf8::distance

Available in version 1.0 and later.

Given the iterators to two UTF-8 encoded code points in a sequence, returns the number of code points between them.

template <typename octet_iterator> 
typename std::iterator_traits<octet_iterator>::difference_type distance (
         octet_iterator first, octet_iterator last);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::distance(twochars, twochars + 5);
assert (dist == 2);

This function is used to find the length (in code points) of a UTF-8 encoded string. The reason it is called distance, rather than, say, length is mainly because developers are used to length as an O(1) function. Computing the length of a UTF-8 string is a linear operation, and it looked better to model it after the std::distance algorithm.

In case of an invalid UTF-8 sequence, a utf8::invalid_utf8 exception is thrown. If last does not point to the past-of-end of a UTF-8 sequence, a utf8::not_enough_room exception is thrown.

utf8::utf16to8

Available in version 1.0 and later.

Converts a UTF-16 encoded string to UTF-8.

template <typename u16bit_iterator, typename octet_iterator>
octet_iterator utf16to8 (u16bit_iterator start, 
      u16bit_iterator end, octet_iterator result);

Example of use:

unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
vector<unsigned char> utf8result;
utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));
assert (utf8result.size() == 10);

In case of an invalid UTF-16 sequence, a utf8::invalid_utf16 exception is thrown.

utf8::utf8to16

Available in version 1.0 and later.

Converts a UTF-8 encoded string to UTF-16.

template <typename u16bit_iterator, typename octet_iterator>
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, 
                          u16bit_iterator result);

Example of use:

char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
vector <unsigned short> utf16result;
utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, 
         back_inserter(utf16result));
assert (utf16result.size() == 4);
assert (utf16result[2] == 0xd834);
assert (utf16result[3] == 0xdd1e);

In case of an invalid UTF-8 sequence, a utf8::invalid_utf8 exception is thrown. If end does not point to the past-of-end of a UTF-8 sequence, a utf8::not_enough_room exception is thrown.

utf8::utf32to8

Available in version 1.0 and later.

Converts a UTF-32 encoded string to UTF-8.

template <typename octet_iterator, typename u32bit_iterator>
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, 
                         octet_iterator result);

Example of use:

int utf32string[] = {0x448, 0x65E5, 0x10346, 0};
vector<unsigned char> utf8result;
utf32to8(utf32string, utf32string + 3, back_inserter(utf8result));
assert (utf8result.size() == 9);

In case of an invalid UTF-32 string, a utf8::invalid_code_point exception is thrown.

utf8::utf8to32

Available in version 1.0 and later.

Converts a UTF-8 encoded string to UTF-32.

template <typename octet_iterator, typename u32bit_iterator>
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, 
                          u32bit_iterator result);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
vector<int> utf32result;
utf8to32(twochars, twochars + 5, back_inserter(utf32result));
assert (utf32result.size() == 2);

In case of an invalid UTF-8 sequence, a utf8::invalid_utf8 exception is thrown. If end does not point to the past-of-end of a UTF-8 sequence, a utf8::not_enough_room exception is thrown.

utf8::find_invalid

Available in version 1.0 and later.

Detects an invalid sequence within a UTF-8 string.

template <typename octet_iterator> 
octet_iterator find_invalid(octet_iterator start, octet_iterator end);

Example of use:

char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
char* invalid = find_invalid(utf_invalid, utf_invalid + 6);
assert (invalid == utf_invalid + 5);

This function is typically used to make sure a UTF-8 string is valid before processing it with other functions. It is especially important to call it before doing any of the unchecked operations on it.

utf8::is_valid

Available in version 1.0 and later.

Checks whether a sequence of octets is a valid UTF-8 string.

template <typename octet_iterator> 
bool is_valid(octet_iterator start, octet_iterator end);

Example of use:

char utf_invalid[] = "\xe6\x97\xa5\xd1\x88\xfa";
bool bvalid = is_valid(utf_invalid, utf_invalid + 6);
assert (bvalid == false);

is_valid is a shorthand for find_invalid(start, end) == end;. You may want to use it to make sure that a byte sequence is a valid UTF-8 string without the need to know where it fails if it is not valid.

utf8::replace_invalid

Available in version 2.0 and later.

Replaces all invalid UTF-8 sequences within a string with a replacement marker.

template <typename octet_iterator, typename output_iterator>
output_iterator replace_invalid(octet_iterator start, octet_iterator end, 
                output_iterator out, uint32_t replacement);
template <typename octet_iterator, typename output_iterator>
output_iterator replace_invalid(octet_iterator start, 
                octet_iterator end, output_iterator out);

Example of use:

char invalid_sequence[] = "a\x80\xe0\xa0\xc0\xaf\xed\xa0\x80z";
vector<char> replace_invalid_result;
replace_invalid (invalid_sequence, invalid_sequence + 
        sizeof(invalid_sequence), back_inserter(replace_invalid_result), '?');
bvalid = is_valid(replace_invalid_result.begin(), 
                  replace_invalid_result.end());
assert (bvalid);
char* fixed_invalid_sequence = "a????z";
assert (std::equal(replace_invalid_result.begin(), 
        replace_invalid_result.end(), fixed_invalid_sequence));

replace_invalid does not perform in-place replacement of invalid sequences. Rather, it produces a copy of the original string with the invalid sequences replaced with a replacement marker. Therefore, out must not be in the [start, end] range.

If end does not point to the past-of-end of a UTF-8 sequence, a utf8::not_enough_room exception is thrown.

utf8::is_bom

Available in version 1.0 and later.

Checks whether a sequence of three octets is a UTF-8 byte order mark (BOM).

template <typename octet_iterator> 
bool is_bom (octet_iterator it);

Example of use:

unsigned char byte_order_mark[] = {0xef, 0xbb, 0xbf};
bool bbom = is_bom(byte_order_mark);
assert (bbom == true);

The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text.

Types from the utf8 namespace

utf8::iterator

Available in version 2.0 and later.

Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.

template <typename octet_iterator>
class iterator;
Member functions

Example of use:

char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
utf8::iterator<char*> it(threechars, threechars, threechars + 9);
utf8::iterator<char*> it2 = it;
assert (it2 == it);
assert (*it == 0x10346);
assert (*(++it) == 0x65e5);
assert ((*it++) == 0x65e5);
assert (*it == 0x0448);
assert (it != it2);
utf8::iterator<char*> endit (threechars + 9, 
                threechars, threechars + 9);
assert (++it == endit);
assert (*(--it) == 0x0448);
assert ((*it--) == 0x0448);
assert (*it == 0x65e5);
assert (--it == utf8::iterator<char*>(threechars, 
                    threechars, threechars + 9));
assert (*it == 0x10346);

The purpose of the utf8::iterator adapter is to enable easy iteration as well as the use of STL algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of the utf8::next() and utf8::prior() functions.

Note that the utf8::iterator adapter is a checked iterator. It operates on the range specified in the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators require both iterator objects to be constructed against the same range; otherwise, an exception is thrown. Typically, the range will be determined by the sequence container functions begin and end, i.e.:

std::string s = "example";
utf8::iterator i (s.begin(), s.begin(), s.end());

Functions from the utf8::unchecked namespace

utf8::unchecked::append

Available in version 1.0 and later.

Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.

template <typename octet_iterator>
octet_iterator append(uint32_t cp, octet_iterator result);

Example of use:

unsigned char u[5] = {0,0,0,0,0};
unsigned char* end = unchecked::append(0x0448, u);
assert (u[0] == 0xd1 && u[1] == 0x88 && u[2] == 0 
                     && u[3] == 0 && u[4] == 0);

This is a faster but less safe version of utf8::append. It does not check for validity of the supplied code point, and may produce an invalid UTF-8 sequence.

utf8::unchecked::next

Available in version 1.0 and later.

Given the iterator to the beginning of a UTF-8 sequence, it returns the code point and moves the iterator to the next position.

template <typename octet_iterator>
uint32_t next(octet_iterator& it);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = unchecked::next(w);
assert (cp == 0x65e5);
assert (w == twochars + 3);

This is a faster but less safe version of utf8::next. It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::peek_next

Available in version 2.1 and later.

Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.

template <typename octet_iterator>
uint32_t peek_next(octet_iterator it);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
int cp = unchecked::peek_next(w);
assert (cp == 0x65e5);
assert (w == twochars);

This is a faster but less safe version of utf8::peek_next. It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::prior

Available in version 1.02 and later.

Given a reference to an iterator pointing to an octet in a UTF-8 sequence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point, and returns the 32 bits representation of the code point.

template <typename octet_iterator>
uint32_t prior(octet_iterator& it);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars + 3;
int cp = unchecked::prior (w);
assert (cp == 0x65e5);
assert (w == twochars);

This is a faster but less safe version of utf8::prior. It does not check for validity of the supplied UTF-8 sequence, and offers no boundary checking.

utf8::unchecked::previous (deprecated, see utf8::unchecked::prior)

Deprecated in version 1.02 and later.

Given a reference to an iterator pointing to an octet in a UTF-8 sequence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.

template <typename octet_iterator>
uint32_t previous(octet_iterator& it);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars + 3;
int cp = unchecked::previous (w);
assert (cp == 0x65e5);
assert (w == twochars);

The reason this function is deprecated is just the consistency with the "checked" versions, where prior should be used instead of previous. In fact, unchecked::previous behaves exactly the same as unchecked::prior.

This is a faster but less safe version of utf8::previous. It does not check for validity of the supplied UTF-8 sequence, and offers no boundary checking.

utf8::unchecked::advance

Available in version 1.0 and later.

Advances an iterator by the specified number of code points within an UTF-8 sequence.

template <typename octet_iterator, typename distance_type>
void advance (octet_iterator& it, distance_type n);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
char* w = twochars;
unchecked::advance (w, 2);
assert (w == twochars + 5);

This function works only "forward". In case of a negative n, there is no effect.

This is a faster but less safe version of utf8::advance. It does not check for validity of the supplied UTF-8 sequence, and offers no boundary checking.

utf8::unchecked::distance

Available in version 1.0 and later.

Given the iterators to two UTF-8 encoded code points in a sequence, returns the number of code points between them.

template <typename octet_iterator>
typename std::iterator_traits<octet_iterator>::difference_type 
         distance (octet_iterator first, octet_iterator last);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
size_t dist = utf8::unchecked::distance(twochars, twochars + 5);
assert (dist == 2);

This is a faster but less safe version of utf8::distance. It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::utf16to8

Available in version 1.0 and later.

Converts a UTF-16 encoded string to UTF-8.

template <typename u16bit_iterator, typename octet_iterator>
octet_iterator utf16to8 (u16bit_iterator start, 
               u16bit_iterator end, octet_iterator result);

Example of use:

unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
vector<unsigned char> utf8result;
unchecked::utf16to8(utf16string, utf16string + 5, 
                    back_inserter(utf8result));
assert (utf8result.size() == 10);

This is a faster but less safe version of utf8::utf16to8. It does not check for validity of the supplied UTF-16 sequence.

utf8::unchecked::utf8to16

Available in version 1.0 and later.

Converts a UTF-8 encoded string to UTF-16.

template <typename u16bit_iterator, typename octet_iterator>
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, 
                          u16bit_iterator result);

Example of use:

char utf8_with_surrogates[] = "\xe6\x97\xa5\xd1\x88\xf0\x9d\x84\x9e";
vector <unsigned short> utf16result;
unchecked::utf8to16(utf8_with_surrogates, utf8_with_surrogates + 9, 
                    back_inserter(utf16result));
assert (utf16result.size() == 4);
assert (utf16result[2] == 0xd834);
assert (utf16result[3] == 0xdd1e);

This is a faster but less safe version of utf8::utf8to16. It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::utf32to8

Available in version 1.0 and later.

Converts a UTF-32 encoded string to UTF-8.

template <typename octet_iterator, typename u32bit_iterator>
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, 
                         octet_iterator result);

Example of use:

int utf32string[] = {0x448, 0x65e5, 0x10346, 0};
vector<unsigned char> utf8result;
utf32to8(utf32string, utf32string + 3, back_inserter(utf8result));
assert (utf8result.size() == 9);

This is a faster but less safe version of utf8::utf32to8. It does not check for validity of the supplied UTF-32 sequence.

utf8::unchecked::utf8to32

Available in version 1.0 and later.

Converts a UTF-8 encoded string to UTF-32.

template <typename octet_iterator, typename u32bit_iterator>
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, 
                          u32bit_iterator result);

Example of use:

char* twochars = "\xe6\x97\xa5\xd1\x88";
vector<int> utf32result;
unchecked::utf8to32(twochars, twochars + 5, back_inserter(utf32result));
assert (utf32result.size() == 2);

This is a faster but less safe version of utf8::utf8to32. It does not check for validity of the supplied UTF-8 sequence.

Types from the utf8::unchecked namespace

utf8::iterator

Available in version 2.0 and later.

Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.

template <typename octet_iterator>
class iterator;
Member functions

Example of use:

char* threechars = "\xf0\x90\x8d\x86\xe6\x97\xa5\xd1\x88";
utf8::unchecked::iterator<char*> un_it(threechars);
utf8::unchecked::iterator<char*> un_it2 = un_it;
assert (un_it2 == un_it);
assert (*un_it == 0x10346);
assert (*(++un_it) == 0x65e5);
assert ((*un_it++) == 0x65e5);
assert (*un_it == 0x0448);
assert (un_it != un_it2);
utf8::::unchecked::iterator<char*> un_endit (threechars + 9);  
assert (++un_it == un_endit);
assert (*(--un_it) == 0x0448);
assert ((*un_it--) == 0x0448);
assert (*un_it == 0x65e5);
assert (--un_it == utf8::unchecked::iterator<char*>(threechars));
assert (*un_it == 0x10346);

This is an unchecked version of utf8::iterator. It is faster in many cases, but offers no validity or range checks.

Points of interest

Design goals and decisions

The library was designed to be:

  1. Generic: for better or worse, there are many C++ string classes out there, and the library should work with as many of them as possible.
  2. Portable: the library should be portable both across different platforms and compilers. The only non-portable code is a small section that declares unsigned integers of different sizes: three typedefs. They can be changed by the users of the library if they don't match their platform. The default setting should work for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives.
  3. Lightweight: follow the "pay only for what you use" guideline.
  4. Unintrusive: avoid forcing any particular design or even programming style on the user. This is a library, not a framework.

Alternatives

In case you want to look into other means of working with UTF-8 strings from C++, here is the list of solutions I am aware of:

  1. ICU Library. It is very powerful, complete, feature-rich, mature, and widely used. Also, big, intrusive, non-generic, and doesn't play well with the Standard Library. I definitely recommend looking at ICU even if you don't plan to use it.
  2. Glib::ustring. A class specifically made to work with UTF-8 strings, and also feels like std::string. If you prefer to have yet another string class in your code, it may be worth a look. Be aware of the licensing issues, though.
  3. Platform dependent solutions: Windows and POSIX have functions to convert strings from one encoding to another. That is only a subset of what my library offers, but if that is all you need, it may be good enough, especially given the fact that these functions are mature and tested in production.

Conclusion

Until Unicode becomes officially recognized by the C++ Standard Library, we need to use other means to work with UTF-8 strings. Template functions I describe in this article may be a good step in this direction.

  1. The Unicode Consortium.
  2. ICU Library.
  3. UTF-8 at Wikipedia.
  4. UTF-8 and Unicode FAQ for Unix/Linux.
You must Sign In to use this message board.
 
 
Per page   
 FirstPrevNext
Generalconvert_from_utf32
N a v a n e e t h
2:12 27 Oct '09  
While I like the append(uint32_t , octet_iterator ) method, it would be helpful if there is a method that returns a string from the code point rather than requesting an iterator and appending to it. This will help users to get a string directly without worrying about the available memory or using std::back_inserter. Something like:
std::string convert_from_utf32(uint32_t cp)
{
std::string str;
append(cp, std::back_inserter(str));
return str;
}
Also is there any reason behind using unsigned char?

Smile

Best wishes,
Navaneeth

GeneralRe: convert_from_utf32
Nemanja Trifunovic
4:43 27 Oct '09  
N a v a n e e t h wrote:
it would be helpful if there is a method that returns a string from the code point rather than requesting an iterator and appending to it.

I agree, but remember there are many libraries/frameworks that use their own string classes and I don't want to force them into including std::string. Maybe I could add a separate helper header file with convenience functions like that.


N a v a n e e t h wrote:
Also is there any reason behind using unsigned char?

You mean unsigned int for code points? It does not really matter, because Unicode needs only 21 bit, but the nature of the beast is that all code points are positive and I wanted to reflect this in the function signature.


GeneralRe: convert_from_utf32
N a v a n e e t h
8:59 27 Oct '09  
Thanks. That makes sense. Smile

Best wishes,
Navaneeth

Generalonly convert between UTFs?
edger
21:51 13 Jul '09  
if I understand it correctly, this library deals with issues under the constraints of UNICODE context: the std::string is assumed to be UTF-8, while std::wstring is assumed to be UTF-16/32.

if the std::string holds non-UNICODE bytes,say GB2332, this library is irrelevant. right?
GeneralRe: only convert between UTFs?
mokanel
23:32 13 Jul '09  
Hello there,

C++ streams have no concept of encoding characteristics, each element is considered an independent entity.
For example, wstrings are storage for fixed-width wide characters (its length is in Windows 16 bits and in Linux is 32 bits, depending on the compiler options).
In now way you can put an equal sign between std::string and UTF-8, they are different things.

Search on internet or have a quick view here in which regards the conversion between UTF encoded strings and C++ standard strings:
http://www.cplusplus.com/forum/beginner/7233/[^]
GeneralRe: only convert between UTFs?
Nemanja Trifunovic
1:55 14 Jul '09  
edger wrote:
if I understand it correctly, this library deals with issues under the constraints of UNICODE context

Correct. To be more precise, it focuses on UTF-8 encoding and enables operations on UTF-8 encoded strings, such as iteration, finding length, etc. There is also conversion to and from UTF16/32


edger wrote:
if the std::string holds non-UNICODE bytes,say GB2332, this library is irrelevant. right?

That is correct.


GeneralVersion 2.2 released.
Nemanja Trifunovic
4:59 7 Jul '09  
Please take a look at the project page at SourceForge.net (link at the top of the article). With this version, stream and stream buffer iterators can be used with some of the functions, such as conversions utf8<>utf16 and utf8<>utf32 and checks for validity.

I sent the updated article text to the CP editors and hope to see it soon.

Enjoy.


GeneralRe: Version 2.2 released.
N a v a n e e t h
18:33 8 Jul '09  
Great one Nemanja. I enjoyed reading your code.


GeneralRe: Version 2.2 released.
Nemanja Trifunovic
7:15 9 Jul '09  
Thanks!


QuestionWarning in Visual C++
koloko
7:44 4 May '09  
Hi Nemanja,

First thank you for this great library, it helps a lot. I'd like to make two additions:

1) In your example, you should update your main function with int main(int argc, char *argv[])
2) These two functions work perfectly :

wstring WSTRING2UTF8 (wstring data)
{
wstring temp;
utf8::utf16to8(donnes.begin() , donnes.end() , back_inserter(temp));
data = temp;
return data;
}

wstring UTF82WSTRING (wstring data)
{
wstring temp;
utf8::utf8to16(data.begin() , data.end() , back_inserter(temp));
data = temp;
return data;
}

But the second one gives me this warning in Visual C++ :

z:\user\documents\visual studio 2008\projects\library\utf8/checked.h(148) : warning C4244: 'argument' : conversion from 'wchar_t' to 'utf8::uint8_t', possible loss of data
1> z:\user\documents\visual studio 2008\projects\library\utf8/checked.h(228) : see reference to function template instantiation 'utf8::uint32_t utf8::next(octet_iterator &,octet_iterator)' being compiled
1> with
1> [
1> octet_iterator=std::_String_iterator,std::allocator> 1> ]
1> z:\user\documents\visual studio 2008\projects\bookshelves\bookshelves\Fonctions.h(169) : see reference to function template instantiation 'u16bit_iterator utf8::utf8to16,std::_String_iterator<_Elem,_Traits,_Alloc>>(octet_iterator,octet_iterator,u16bit_iterator)' being compiled
1> with
1> [
1> u16bit_iterator=std::back_insert_iterator,
1> _Container=std::wstring,
1> _Elem=wchar_t,
1> _Traits=std::char_traits,
1> _Alloc=std::allocator,
1> octet_iterator=std::_String_iterator,std::allocator> 1> ]





AnswerRe: Warning in Visual C++
Nemanja Trifunovic
8:15 4 May '09  
Hi, and thanks for your feedback.


koloko wrote:
1) In your example, you should update your main function with int main(int argc, char *argv[])

Ooops - thanks for the catch Smile


koloko wrote:
But the second one gives me this warning in Visual C++ :

As far as I can see the problem is that you are storing utf-8 encoding in wstring. You should really use string, i.e. something like (untested):

string WSTRING2UTF8 (const wstring& data)
{
string ret;
utf8::utf16to8(data.begin() , data.end() , back_inserter(ret));
return ret;
}

wstring UTF82WSTRING (const string& data)
{
wstring ret;
utf8::utf8to16(data.begin() , data.end() , back_inserter(ret));
return ret;
}



GeneralRe: Warning in Visual C++
koloko
18:13 4 May '09  
Hey, thanks a lot. No more warning !
Generalutf16to8 and next() seem not to play well together (VS2005)
Bill Davy
23:17 5 Jan '09  
	//
	// Make a UTF8 string.
// unsigned short utf16string[] = {0x41, 0x0448, 0x65e5, 0xd834, 0xdd1e};
vector<unsigned char> utf8result;

utf8::utf16to8(utf16string, utf16string + 5, back_inserter(utf8result));

assert (utf8result.size() == 10);

// // Process the UTF8 string, character by character.
// vector<unsigned char>::iterator pByte = utf8result.begin();
vector<unsigned char>::iterator pEndByte = utf8result.end();

for(size_t i=0; i < 5; ++i)
{
utf8::uint32_t cp;
try {
cp = utf8::next(pByte, pEndByte);
assert (cp == utf16string[i]); // <<< Fails here
}
catch(...)
{
cerr << "Caught exception\n";
return;
}
if ( cp == 0 )
break;
}

When the indicated assertion fails, i==3, cp==0x0001d11e

Also, best to turn off 564 bit checking as std::advance() in validate_next() causes an argument warning (perhaps octet_difference_type is not good).
GeneralRe: utf16to8 and next() seem not to play well together (VS2005)
Nemanja Trifunovic
4:55 6 Jan '09  
Hi Bill.

The problem with your test code is that you assume there are 5 characters in the utf16to8string, and there are really four - 0xd834 and 0xdd1e compose a surrogate pair. To determine the length of a string after you convert it to UTF-8, please use utf8::distance() function.

As for the warning you see - would you send me the exact text of the warning and the compiler you use?

Thanks!


QuestionCan I use this with VC6?
Neville Franks
20:23 4 Mar '08  
Hi Nemanja,
I'm trying to use your UTF-8 code with VC6 and am getting a lot of compile errors such as:

-----
d:\libs\utf8_v2_1\source\utf8\core.h(88) : error C2039: 'difference_type' : is not a member of 'iterator_traits&lt;_It&gt;'
d:\libs\utf8_v2_1\source\utf8\core.h(88) : error C2146: syntax error : missing ';' before identifier 'sequence_length'
----

Am I right in assuming it won't work with VC6?

Neville Franks, Author of Surfulater www.surfulater.com "Save what you Surf" and ED for Windows www.getsoft.com

GeneralRe: Can I use this with VC6?
Nemanja Trifunovic
4:22 5 Mar '08  
Neville Franks wrote:
Am I right in assuming it won't work with VC6?

Hi Neville.

While I don't support VC6 (don't even have it anywhere), it should be pretty easy to tweak the library to work with VC6. See the thread from SourceForge[^]

I don't have Visual C++ 6.0, so I am afraid I can't fix it. However, if
the only problem is the lack of std::iterator::difference_type, I guess you
can work around it simply by using int instead of it.

Define a preprocessor macro, say utf8_difference_type, and define it as
int if _MSC_VER < 1300 and std::iterator::difference_type otherwise. Then
just replace every occurence of std::iterator::difference_type with
utf8_difference_type in the code.

Now, when I think of it, it would be probably even better to use ptrdiff_t instead of int.

In any case, if you make any changes, I strongly recommend you run my unit-tests that can be found at the SVN repository of the SourceForge project[^].

Hope it helps.


GeneralRe: Can I use this with VC6?
Neville Franks
13:42 5 Mar '08  
Hi Nemanja,
Thanks for replying. For now I've found another solution with code I already had, so I'm ok. Your advice will be usefull should I use your code in the future though.

Neville Franks, Author of Surfulater www.surfulater.com "Save what you Surf" and ED for Windows www.getsoft.com

GeneralComparing characters/strings...
milaks
17:24 18 Jan '07  
Hello.
Although I'm new with UTF-8 use in my programs, I've been looking for some simple, portable and lightweight library and this maybe it.
But, can someone please tell me how can I with this library, compare UTF-8 characters (I guess I cannot use char type since UTF-8 character can be 1-4 bytes long) and also UTF-8 strings (if it's not too much to ask, few examples would be nice Smile ) ?

Thanks in advance.
AnswerRe: Comparing characters/strings...
Nemanja Trifunovic
3:04 19 Jan '07  
To compare two UTF-8 encoded characters, the easiest way would be to use utf8::next() function:

if (next(ch1, end1) == next (ch2, end2))...

For strings, if you are testing for equality only, it is enough to make sure their memory representation is identical - so you can use std::equal for this.

Version 2 of the utf-8 cpp library contains iterator adapters which simplify these operation quite a bit, but it is only in Beta stage now, and I don't recommend using it for the production code.


GeneralRe: Comparing characters/strings...
milaks
5:51 19 Jan '07  
Excuse me, but I don't completely understand. Will next(ch1, end1) return first or second character, that is will it, in above example, test first character also?

As for the strings, how can I test not only if two string are equal but also if one string contains other, something like strncmp(str1, str2, strlen(str2))?

Thanks.
GeneralRe: Comparing characters/strings...
Nemanja Trifunovic
11:05 19 Jan '07  

milaks wrote:
Will next(ch1, end1) return first or second character, that is will it, in above example, test first character also?

next(ch1, end1) will return the Unicode code unit (if you are not familiar with the terminology, be sure to check out unicode.org) of the character starting at ch1. By comparing the code units you can tell whether the two characters are different.

milaks wrote:
something like strncmp(str1, str2, strlen(str2))?

strncmp works just fine with utf-8 strings - you don't need my library for it. Just be aware that strlen returns the number of bytes, not characters, but that is exactly what you want.


GeneralRe: Comparing characters/strings...
milaks
13:08 19 Jan '07  
Thank you for your help.
This is enough for me to know to start using it Smile

Хвала на одличној библиотеци Wink
GeneralRe: Comparing characters/strings...
Nemanja Trifunovic
14:03 19 Jan '07  
milaks wrote:
Хвала на одличној библиотеци

Молим лепо. Wink


GeneralRe: Comparing characters/strings...
milaks
15:35 19 Mar '07  
After a while, one more question about comparison Smile

In UTF-8 encoded strings I can, for example, use C's strncmp() function to compare characters/bytes from strings, but what if I am to perform case-insensitive characters/strings comparison, what should I use for that?

I guess that toupper() and tolower() macros from cctype wont work.

Best regards.
GeneralRe: Comparing characters/strings...
Nemanja Trifunovic
3:24 20 Mar '07  
milaks wrote:
I guess that toupper() and tolower() macros from cctype wont work.

Not sure, but toupper and tolower() are locale-dependent, so you can try setting a locale for your system to utf-8.

For case-insensiive comparison, you should be able to use strcasecmp (on Windows, stricmp IIRC) after you set the right (utf-8) locale.



Last Updated 8 Jul 2009 | Advertise | Privacy | Terms of Use | Copyright © CodeProject, 1999-2010