This article is about reading and writing Unicode to character streams in UTF-8 encoding. And as a consequence is about an often mis-known aspect of the C++
STL / Iostream library: locales.
The documentation that come with the STL itself, although
technically perfect does not help so much in
understanding the relation between the object that are involved a even
simple expression like a_stream
>> variable;
, also
because some of the detail are hidden by the
underlying logic.
Also, the behavior of the STL and the relation with the
operating system are not always so evidents, letting certain operations
sometime "mysterious"
This article goes into some aspect of the Unicode encoding,
STL locales, and the relation with Windows.
The code we are referring is thought as serving VC8 (I used
C++ 2005 Express), as well as MINGW (and how this compiler is
distributed for Unicode if far from been obvious)
Summary
- Introduction
- A little bit about Unicode
- In the beginning
- The incoming of Code-pages
- Unicode
- Popular Unicode encodings
- UTF-8 encoding scheme
- UTF-16 encoding schemes
- Windows and Unicode
- Windows and localization
- C and Unicode
- C++ and Unicode
- Streams buffers and locales
- Going to UTF-8
- MinGW declarations
- gel::stdx::utf8cvt<bool>
- Invalid characters
- Trivial functions
- do_in
- do_out
- Using the facet
- The supplied code
- Testing sequence
- A practical sample
- Other MinGW and VC8 notable differences
- When to use
A little bit about Unicode
The history of characters encoding is not the most linear
as it could be: a number of assumption made at certain time that had
been reverted later made -and still make- some confusions.
In the beginning
In the beginning there where the typewriters (mechanical
machines to type characters on paper) and the teletypes, that -in
essence- where typewriters with wires between the
keyboard and the paper.
In the need to grant interoperability between the two halves
of these machines, the ANSI committee defined the ASCII character set.
This set was designed to provide a binary representation for
the 26 Latin characters, either in small and capital format, some
punctuation and accents, and to some "commands" of typical use in
teletyping like CR, LF, FF, etc. and was designed to stay into 7 bit,
to help the hardware manufacturer in having the 8th bit available for
error checking (parity).
The missing of accented character was not a big deal, since
teletyping is over-impressive: an "à" was simply written as
" a BS ' " (BS is the backspace)
Different countries where also not necessary needed to be
supported: ASCII is the "American Standard Code ..." Who's not American
(Or is not comfortable with the American standard), just used another
encoding scheme.
Interactive computers (with a keyboard and a monitor)
complicated a while this aspect:
- The circulation of software was anymore a country limited
affair and...
- The nature of a display required a character matrix
associating a code representing the displayed "character"
at a given position, hence...
- tricks like the use of backspace to get composed characters
where no more effective.
A number of attempt to extend the ASCII character set where
done first by hardware manufacturer.
IBM -when introducing the first PC- came with an 8 bit
char-set having 256 symbols matching the ASCII ones from 32 to 126 (the
"ASCII printable") and adding some accented
letters, some mathematical symbols, some
semi-graphics.
All of those some-s where a result of a
compromise that -in fact- didn't match everyone's needs: it just
attempt to satisfy 90% of the user o 90% of the IBM served countries at
that time. But it fitted 8 bits.
The incoming of Code-pages
To better solve the problem, the concept of Code-page
was introduced.
Essentially the correspondence between codes and glyphs
was made configurable, so that every country could configure the 2nd
half of the char-set with the characters that it more needed.
Interoperability was assured only by the first 128 codes.
Later DOS versions - and earlier Windows - used the 8 bit
ANSI code, with a number of code-pages for a variety of "editions".
The drawback of this method was that was essentially
impossible to hold texts mixing different very heterogeneous languages:
mixing up Arabic and Japanese was practically impossible.
And reading a French text with an Arabic PC was sometimes a pleasure,
and even French to Italian leaded to strange mis-writings
due to the fact that same accented characters had different
coding.
Also, a problem was still present with languages that require
more that 128 specific symbols (think to Chinese): for them multi-byte
code-pages had been introduced giving MBCS.
Unicode
Unicode was introduced mainly to try to cleanup all of this
mess: assuming that the world cannot fit into 8 bits, it gave a
distinct ID to every encoded symbol.
This is known as UCS - Universal Character Set.
In its first definition it was containing less than 65536
characters, and this made many software developer confident that 16bits
where enough to represent them all.
This is known as UCS-2.
The actual situation sees an UCS defined up to 0x10FFFF (although with many still unassigned elements), thus
requiring 21 bits.
UCS-4 (4 bytes) using unsigned int
(often typedef
-ined as dchar_t
)
as characters certainly fits everything, but for many languages it is a
wasting of space.
Also, many communication channels drive bytes, not short
or int
, and the way bytes are ordered into shorts
and ints depends on the architectures of processors
and machines, hence a pure binary dump of dchars or wchars is not practicable for files that are designated to communication or interoperation between different machines or devices.
Popular Unicode encodings
To track the above problems, a number of encodings,
attempting to keep safe the interoperability with legacy environments
have been deployed for 7, 8, 16 and 32 bit environments.
In particular, with the windows environments, where Unicode
was originally deployed as UCS-2 (16 bits) and communications are still
working on bytes, the 8 and 16 bits are particularly useful and
comfortable.
These encodings are known as UTF-8 and UTF-16:
- The first is important in files, where an
interchangeability with also non-windows environments is required,
since it does not depend on machine endiannes.
- The second is important since it replaced UCS-2 inside the
Win32 APIs, thus granting support also for character not fitting the
first 655536 codes, by using a "multi-word" scheme. (Note that UCS-2
and UTIF-16 are not the same: They coincide only for the first
65536-2048 code-points)
UTF-8 encoding scheme
The encoding used to represent Unicode into bytes is based on
rules that define how to break-up the bit-string representing an UCS
into bytes.
- If an UCS fits 7 bits, its coded as 0xxxxxxx. This makes
ASCII character represented by themselves
- If an UCS fits 11 bits, it is coded as 110xxxxx 10xxxxxx
- If an UCS fits 16 bits, it is coded as 1110xxxx 10xxxxxx
10xxxxxx
- If an UCS fits 21 bits, it is coded as 11110xxx 10xxxxxx
10xxxxxx 10xxxxxx
- If an UCS fits 26 bits, it is coded as 111110xx 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx
- If an UCS fits 31 bits, it is coded as 1111110x 10xxxxxx
10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
As far the actual UCS space concerned, no encoding should
exist for more than 21 bits, hence the last two rules don't have an actual
application, and -in fact- actual Unicode specifications consider them
invalid.
It is clearly an encoding that privileges the low
codes resulting in shorter encoding, against the hight
codes.
Wikipedia has a good article about UTF-8 that shows the trade-offs
against UCS-4 and UTIF-16.
It must be however to take into account that all markups used
to represent texts into pages (think to HTML) or
data into messages (think to XML) are ASCII.
This may balance the longest encoding of -for example- Chinese text
strings.
Also, endiannes is irrelevant, being all the codes "bytes",
with no need to define an order.
For those reasons, UTF-8 became popular as a format to store
texts across the Internet, since they remain the same independently on
who read/writes them.
UTF-16 encoding schemes
The encoding was introduced after the definition of UCS-2,
that was in turn the "as-is" representation of UCS
up to 16 bits, essentially after the discovery that 16 bits where not
enough to encode everything.
UTF-16, in essence, takes advantage of an unassigned "band"
in UCS (from 0xD800 to 0xDFFF) to represent what cannot stay into 16
bits.
Of course, there is the strong suspect that such
unassigned band had been left after discovering that no space remained
to encode what was in progress to be encoded
In essence, characters are encoded as:
- If an UCS fits 16 bits, it is encode as itself (note that
codes in the range described above must be avoided completely)
- If an UCS fits 21 bits, 0x10000 is subtracted (thus giving
20 bits), and the result is broken into two 10 bits sequence to be
or-ed to 0xD800 (most significant) and 0xDC00 (less
significant)
It is important to note how UTF-8 can be wider than the
actual Unicode (can go up to 31 bits), but UTIF-16 is stuck to 21 bits
as a maximum. It will be interesting to see what inventions
will be done when Unicode will become wider than the actual specified
21 bits... (if it ever will)
Windows and Unicode
The Windows operating systems evolved from the original IBM
character set (or better, OEM char-set, since different manufacturer
may have differentiated) towards the ANSI char-set and code-pages.
This refers to 8-bit characters and language-dependent
encodings, and was used mainly in Win16.
With the incoming of Win32, Unicode was adopted first as pure
UCS-2, and then extended to support UTF-16 surrogates.
The API binaries that manipulate characters where doubled and
renamed by adding an "A" for "ANSI" and a "W" for "Wide" (For example MessageBoxA
and MessageBoxW
) the firsts taking char
s
based parameters and the seconds taking wchar_t
based parameters.
A number of preprocessor "magics" are then defined in <tchar.h>
where, depending on the definition of the UNICODE and MBCS preprocessor
symbols the traditional API names are mapped into the corresponding A
or W
.
Windows and localization
To take care of the difference various countries and cultures
may have in representing numbers, dates, currency etc. Windows
introduced the concept of Locale as a set of informations that can be
retrieved by APIs and that can be user customizable, to help programs
to adapt to user habitudes.
Unfortunately this is sometimes misleaded
causing not only text, but also structured data to be represented in localized
form also on communication and storage media, with all the problems
about misinterpretation of dates etc. (what date is 11/10 ... or should
it be 10/11 ?)
All those informations are stored in the
system registry. The OS provides a set of default value for the various
countries, but users can override them by providing their own
specifications.
For example, Italy uses '.' as thousand
separator and ',' as decimal separator.
It is however frequent for Italian user to replace the '.' with an '''
to have number as 10'000 less prone to reading errors than 10.000
especially where it is not clear where it comes from (and hence ... it
could be just 10).
Inside Windows, UTF-8 to 16 and 16 to 8 conversions are
possible through the WideCharToMultiByte
and MultiByteToWideChar
functions, by specifying CP_UTF8
as the codepage
parameter.
It is anyway a string-to-string conversion, not an
encode/decode of a stream.
C and Unicode
C language completely pre-dates either Unicode and Windows,
and -in fact- does not provide any direct support for Unicode, but a
number of library function had been adapted to take care of
internationalization.
In this environment, a number of character oriented
functions like atof
gained their corespondent
_wtof
and -with the same preprocessor magic of
<tchar.h>
- a _ttof
is defined as one or the other depending on the definition of a UNICODE
or MBCS
symbol.
What char
and wchar_t
effectively represent depends on the code-page used, that -in turn with
a "locale" - defines the way numbers
are represented and how characters are encoded.
Unfortunately, the way C library is implemented defines
"locale" characteristics based on a set of static data, selectable with
a set_locale
function.
Such data have noting to deal with the ones provided by the operating
system concept of "Locale" and are not "user customizable" (think to
the case of the thousand separator, replaced by an '''.
There is -however- the possibility to covert an UTF-8 string into
UTF-16 with the mbcstowcs
function,
specifying a locale having a UTF-8 codepage.
That's far more difficult than to be said, since library
doumentation are not so generous in those kind of information.
For example you can discover that encoding can be specified
for example in fopen
: es.
fopen("newfile.txt", "rw, ccs=<encoding>");
where <encoding> can be
"UTF-8", although it's not documented as a standard.
But as you move to C++ it is practically impossible to re-find a
similar functionality in fstream
-s.
C++ and Unicode
C++ approach to i/o is based on the "stream" concept. How
streams relates to files it is not so obvious since the STL
documentation, does not provide a plain description on that. You have
to read about a number of details of a variety of classes before having
a clue about what architecture is behind.
So let's go in the detail as much is sufficient to understand
where is the clue.
Streams buffers and locales
This classes has a collaborative role in playing input
output.
- Streams are responsible to provide the interface for insertion and extraction
operators and formatting manipulators.
- Buffers are responsible to provide the transit storage for
the elements that are read and written from the external source
and to manage such reading and writing
- Locale provides the functionality to handle
"representation" and "conversion". They do so by a number of "facets"
that deals with the conversion of numbers into character strings (with
the
numget
and numput
facets) and of the translation of characters from the program
representation from and to the external representation (with the codecvt
facet).
All that is a set of family of classes that manage different
character representations (char
or wchar_t
)
and different nature of external streams (files or strings).
In their abstract definition, streams are rooted from a virtual ios_base
(character type independent), than from basic_istream<.>
and basic_ostream<.>
, than from
those two into basic_iostream<.>
,
thus giving this hierarchy:
All streams must hold a "buffer" derived
form basic_streambuf<.>
and a locale
,
initialized by default as the C++ global locale (initialized in turn as
the classic "C" locale).
In particular file streams are noting more than basic
streams initialized with a basic_filebuf<.>
,
that overrides the basic_streambuf<.>
virtual function to manage file i/o, plus some pass-through function
like open
, close
etc.
Similarly, string streams are basic streams
initialized with basic_stringbuf<.>
.
The template parameters defines the type of "elements"
used by the stream internally to the program. In windows environments
they are normally char
for ANSI oriented
character representation and to wchar_t
for
Unicode (UTF-16) oriented representation.
But something strange happens with file streams: try this:
#include <fstream>
#include <windows.h>
#pragma comment(lib, "kernel32.lib")
#pragma comment(lib, "user32.lib")
#pragma comment(lib, "gdi32.lib")
int main()
{
std::wofstream fs("testout.txt");
const wchar_t* txt = L"some Unicode text òàè逧";
MessageBoxW(0,txt,L"verify",MB_OK);
fs << txt << std::flush;
MessageBoxW(0,fs.good()? L"Good": L"Bad",L"verify",MB_OK);
return 0;
}
The call to the Unicode MessageBoxW
confirm the proper string (should end with the § symbol,
and have the Euro glyph as second-last).
Here's the dump of txt
from the
debugger
0x0041770C 73 00 6f 00 6d 00 65 00 20 00 55 00 6e 00 69 00 s.o.m.e. .U.n.i.
0x0041771C 63 00 6f 00 64 00 65 00 20 00 74 00 65 00 78 00 c.o.d.e. .t.e.x.
0x0041772C 74 00 20 00 f2 00 e0 00 e8 00 e9 00 ac 20 a7 00 t. .ò.à.è.é.¬ §.
That's Unicode represented by UTF-16 in LE form (73-00, in WORD
format is 0x0073 and is just the plain 0x73 ('s') ASCII, while AC-20 is
0x20AC that is the Euro symbol €, that cannot be represented as
a single bye)
Now look the output file content with and hex editor.
you should get (I used Notepad++ with the Hexedit plug-in)
"000000000 73 6F 6D 65 20 55 6E 69-63 6F 64 65 20 74 65 78 |some Unicode tex|"
"000000010 74 20 F2 E0 E8 E9 |t òàèé |"
That's ANSI, with the text that appear truncated by the € symbol (and fs.good()
is false).
This -at least- with VC8.
Doing the same test with another compiler (MinGW 3.4.5, I
used Codelite as IDE) with that same source (note that MinGW use UTF-8
for sources, while VC8 use ANSI. String literals don't survive, hence a
different file is needed) thing are even worst:
t1.cpp: In function `int main()':
t1.cpp:9: error: `wofstream' is not a member of `std'
t1.cpp:9: error: expected `;' before "fs"
t1.cpp:10:23: converting to execution character set:
t1.cpp:12: error: `fs' was not declared in this scope
In fact all the wchar_t
related
stuff is under a conditional compilation driven by the _GLIBCXX_USE_WCHAR_T
symbol.
Introducing this workaround (essentially defining in the
regular std
namespace the missing types) it compiles.
#ifdef __MINGW32_VERSION
#ifndef _GLIBCXX_USE_WCHAR_T
namespace std
{
typedef basic_ios<wchar_t> wios;
typedef basic_streambuf<wchar_t> wstreambuf;
typedef basic_istream<wchar_t> wistream;
typedef basic_ostream<wchar_t> wostream;
typedef basic_iostream<wchar_t> wiostream;
typedef basic_stringbuf<wchar_t> wstringbuf;
typedef basic_istringstream<wchar_t> wistringstream;
typedef basic_ostringstream<wchar_t> wostringstream;
typedef basic_stringstream<wchar_t> wstringstream;
typedef basic_filebuf<wchar_t> wfilebuf;
typedef basic_ifstream<wchar_t> wifstream;
typedef basic_ofstream<wchar_t> wofstream;
typedef basic_fstream<wchar_t> wfstream;
}
#endif
#endif
But running it still show fs
go bad
and no output produced (the file is created, but remains empty).
Debugging shows that the basic_streambuf::xsputn
function catch an exception and sets the stream as bad. That exception
is produced here
template<typename _Facet>
inline const _Facet&
__check_facet(const _Facet* __f)
{
if (!__f)
__throw_bad_cast();
return *__f;
}
where the actual type for _Facet
is
std::codecvt<wchar_t,char,int>
. char
?! where does it comes from?.
Going back to the VC, we find this strange
note, in the basic_filebuf
(the derivation for
basic_streambuf
for file streams)
documentation:
Objects of type basic_filebuf are
created with an internal buffer of type char * regardless of
the char_type
specified by the type parameter Elem. This means that a Unicode string
(containing wchar_t
characters) will be converted to an ANSI string (containing char characters)
before it is written to the internal buffer. To store Unicode strings
in the buffer, create a new buffer of type wchar_t and set it
using the basic_streambuf::pubsetbuf()
method. To see an example that demonstrates this behavior, see below.
In essence, it seems nobody wants to say clearly
that -independently on what we use in a program- outputs going to a FILE
(yes, the old C FILE
, that's what basic_filebuf
writes to and read from, there no magic behind that), by default is always attempted to be
converted into char
-s, using the actual
global locale facets.
In VC8 this happens with a std::codecvt<wchar_t,char,mbstate_t>
facet that's part of the default global locale, operating on an internal char buffer (as stated in the note), in MINGW
no such facet is declared, and hence the locale cannot provide it
(hence the exception).
Going to UTF-8
... And that's the key to go towards UTF-8: let's provide to
the basic_filebuf
the facet it wants.
Not another facet, with proper type and
id, (locales can support whatever number of "facets")
since that's not what the buffer class is seeking for.
We have to derive the proper std::codecvt<wchar_t,char,mbstate_t>
and -where no such type is defined- we have to define it.
MinGW declarations
If we assume to work with MinGW with no wchar_t
support enabled, we have to make the base facet to exist.
Since all facet are defined as template, that's easy:
#ifdef __MINGW32_VERSION
#ifndef _GLIBCXX_USE_WCHAR_T
namespace std
{
template<>
class codecvt<wchar_t,char,mbstate_t>:
public __codecvt_abstract_base<wchar_t,char,mbstate_t>
{
protected:
explicit codecvt(size_t refs=0)
:__codecvt_abstract_base<wchar_t,char,mbstate_t>(refs)
{}
public:
static locale::id id;
};
typedef basic_ios<wchar_t> wios;
typedef basic_streambuf<wchar_t> wstreambuf;
typedef basic_istream<wchar_t> wistream;
typedef basic_ostream<wchar_t> wostream;
typedef basic_iostream<wchar_t> wiostream;
typedef basic_stringbuf<wchar_t> wstringbuf;
typedef basic_istringstream<wchar_t> wistringstream;
typedef basic_ostringstream<wchar_t> wostringstream;
typedef basic_stringstream<wchar_t> wstringstream;
typedef basic_filebuf<wchar_t> wfilebuf;
typedef basic_ifstream<wchar_t> wifstream;
typedef basic_ofstream<wchar_t> wofstream;
typedef basic_fstream<wchar_t> wfstream;
}
#endif
#endif
We are defining a specialization of codecvt<InnerType,OuterType,StateType>
for <wchar_t,char,mstate_t>
,
that is exactly what the compiler is searching.
And we supply a locale::id
static
object as required by the STL implementation. This requires a cpp file
to instantiate the static object (<rant>I hate globals
...</rant>).
At this point we can -either for MinGW and for VC8- derive codecvt<InnerType,OuterType,StateType>
overriding
the virtual function to implement a wchar_t
to char
conversion where wchar_t
is UTF-16 and char
is UTF-8.
gel::stdx::utf8cvt<bool>
It is the codecvt
derivation,
implemented as a translator UTF-16 <-> UCS
<-> UTF-8, using the mbstate_t
parameter as a carry between function invocation.
Invalid characters
First of all, we have to decide what to do in case of invalid
characters: sequence that may be present in input but that are not
valid UTF or even valid Unicode codepoints.
According to Unicode specifications, invalid characters or sequences must be treated as "errors". But what does it mean treated is left to many interpretations.
If our purpose is to validate the input we will probably like something that makes us aware of something going wrong; but if we are just reading a text we are probably more suited to something that doesn't stop reading just because of a miswritten char.
That's what the bool
template parameter is for: if set to true
the implementation will have a strict behavior and upon every reading or writing of erroneous or illegal sequences or characters will throw a gel::stdx::utf_error
exception, derived from std::runtime_error
.
The STL implementation of basic_streambuf
should catch
this and set owner stream as "bad", thus blocking it. The logic will
have so no difference from the one normally used by regular stream
processing.
If the bool
parameter is set to false, the respect of Unicode restrictions are relaxed and invalid sequence are processed coherently with the algorithm.
It is so possible to support up to 28bit codepoints (we need 4 bits to
manage the conversion steps), and read overlog UTF-8 sequence as if
"legal".
Trivial functions
Overriding codecvt
is trivial for at least three functions:
-
do_always_noconv
always return
false, since a conversion needs to be done.
-
do_max_length
return 6 since
this is the longest UTF-8 possibility. Proper Unicode will never
produce more than 4 char
s, but an arbitrary wchar_t
can go over.
-
do_encoding
always return -1,
since the conversion is state dependent.
More complex is do_length
: there is
the need to almost decode to understand what about the length of the
converting sequence. we found simpler to return a cautelative value (min(_Len2, (size_t)(_Last1-_First1)
).
The consequence is probably a wider buffer allocation by the
filebuffrer
classes, but -as it can be experiment- tt seems that either
MinGW and VC8 buffers implementations don't call this function.
do_in
UTF-8 (external) to 16 (internal) is done using _Next1
and _Next2
as iterators walking from _First1
and _First2
to _Last1
and _Last2
(couldn't it be different?)
incremented as something is read or written. If a buffer end is
reached, the function return partilal
if
there is a residual state,
or ok
elsewhere.
The _State
value is used as follow:
- The less significant 28 bit stores the incoming character
(Unicode requires 21, hence we can do more than actual Unicode)
- Bits 28-29-30 are used to contain the counter of the
remaining
char
forming a character
This is the carry of UTF-8 to UCS conversion.
- Bit 31 is used as a carry for the UCS to UTF-16 conversion
(in case of surrogate)
The first part of the function (under if(!(_Sate
& 0x80000000))
) is executed to convert UTF-8 in
UCS (the if
protects against an UCS to UTF-16
in progress)
The second part (under if(!(_State &
0x70000000))
) is executed when no UTF-8
residual exist (hence _Sate & 0x0FFFFFFF
contains a complete UCS) and saves the UCS as a 16-bit only or as a
21bit surrogate (10 + 10bits + 0x100000), by saving the first part,
setting the _State bit 31, and upon that setting, saving the second
part on the next loop.
Note that the concept of "netx
loop" depends on the politic adopted by the basic_streambuf
inside the STL implementation.
It may be inside the main for
loop
or it may be upon a further invocation of the function. Since we use _State
as a carry between loops, and state is allocated and maintained outside
the function (is passed by the caller) we can essentially not care
about the buffers residual length.
do_out
UTF-16 (internal) to 8 (external) is done similarly.
_State
bit 31 is used as a carry
flag indicating a partial reading / writing.
When a complete UCS had been read (from UTF-16) the first
UTF-8 byte is saved. Subsequent bytes are written with a call to unshift
,
that is also called by the basic_streambuf
itself to complete the output when no more input is present.
Using the facet
We need an std::locale
where to
replace the codecvt<wchar,char,mbstate_t>
facet with a gel::stdx::utf8cvt
to be imbue
-d
into the stream buffer.
This very cryptic assertion, merely means this:
std::locale utf8_locale(std::locale(), new gel::stdx::utf8cvt<true>);
std::wfstream fs;
fs.imbue(utf8_locale);
fs.open(yourfile, mode);
utf8_locale is -in this case- take from the global locale,
with an utf8cvt
given to it.
The stream
is just whatever wstream
(wfstream
in this example, that means basic_fstream<wchar_t>
).
Of course, instead of imbue, we can replace the
global locale by calling
std::locale::global(utf8_locale);
just after the utf8_locale
declaration.
We can also use a locale taken from another specific locale
when it is the case, by simply create a locale like
std::locale utf8_it(std::locale("It"), new gel::stdx::utf8cvt);
Thus having a UTF-8 italian locale.
Since the frequent use of the UTF_8 neutral locale, I declare
a gel::stdx::utf8_locale<true>
as an always
accessible global object.
The supplied code
The "gel" directory contains both the header ("stdutif.h") and the cpp source ("gel.cpp") needed to properly instantiate the global and static variables.
The parent directory contains a tester program ("tester.cpp")
and all the stuff to arrange either a VS8 project and a Codelite
project (to test also a MinGW compilation) and the grandparent
directory contains the VS8 solution ans well as a Codelite workspace.
The tester program is a console application accepting up to two
filename in the command line (the first as input file and the second as
output).
The default value for the input is testerin.txt and for the output is testerout.txt.
This application perform a number of congruence tests about the act
of reading and writing with the UTF-8 facet, logging its activity on
the standatrd output (std::cout
) and using the ANSI and Unicode Win32 API MessageBox[A/W] to test how the read file is displayed,
Testing sequence
The testing program proceed as follows:
- The file is read as an
std::ifstream
(thus, char
based) into a std::string
, and displayed by MessageBoxA
.
Since this is all ANSI, an UTF-8 will be displayed correctly only for the ASCII compatible characters.
- The file is read as an
std::wifstream
(thus wchar_t
based) into a std::wstring
using the normal locale and displayed by MessageBoxW
.
This will be treated by VC8 as an ANSI to Unicode conversion, and from MinGW as an "impossible to read" stream.
- The file is read as a
std::wifstream
imbue-d with a gel::stdx::utf8_locale
, displayed by MessageBoxW
and an output file is written as std::wofstream
, imbue-d with a gel::stdx::utf8_locale
, with the string just read.
- Both the files are read as raw bytes sequence (as
std::ifstream
) and compared by by byte to check there are no difference.
A practical sample
As a sample, the file "testerin.txt" is provided in the tester project directory, with the following content:
Tester text:
grade paragraph eacute egrave euro
°§èé€
The file has been create with Notepad++ and saved as UTF-8 without BOM.
Running the tester program,
step 1 completes with the following message box:
Note that the bytes read and displayed as ANSI representing the 5 Unicode as 2,2,2,2,3 bytes.
Step 2 completes in the following way:
VC8 version |
MinGW version |
|
|
Note that MinGW report a blank display for the reason described in the article (missing of a default codecvt<wchar_t,char,mbstate_t>
),
while VC8 attempts an ANSI to Unicode conversion. It keep safe the
ASCII part, but the non-ASCII translation is codepage (hence system)
dependent.
Step 3 completes as follow:
Thats the correct display from the UTF-8 file.
Step 4, at this point checks that the correctly read file is also correctly written bye comparing the two byte sequences:
The entire execution is logged in the console as such:
(Note: if something where different between the input and output file, the "=" will be replaced by "x".
Also, note how UTF-8 reading appears shorter than ANSI reading, because of the compacting of 2 and 3 bytes characters).
Other MinGW and VC8 notable differences
Debugging the two version the do_in function manifest a different calling algorithm from the two implementation of the STL.
VC8 calls the function repeatedly passing to it one byte for each
call (thus reflecting the while(fs.get(c))
in the tester loop).
MinGW, in turn, calls the function once supplying the entire
sequence of bytes (and letting the function inner loop do to the job).
When to use
Finally some recommendations, based on my opinion and experience:
- This is not the panacea to solve everything. Nor is a
professional implementation code. It just want to fill a gap I hope
somebody else will soon fill with better structured code, may be even
iside the STL implementation itself.
- Yes, boost have this, I know. But this is just 10KB long and
boost dosn't document such facet (it seems it has been included because
o a private exigence of an author, needing it to do something else: you
have to hack boost code to discover it)
- This facet can make your code slower, of course, because of the
cost of the conversion. If you don't need to understand the content of
a stream contained text, but just move it around, don't convert it:
just treat it as "raw bytes".
If you have to parse a ASCII delimited markup (like HTML or XML), with
no need to interact with the marked-up text don't convert it. just
treat UTF-8 as "false ASCII".
- If you have to get some UTF-8 text and pass it to windows API for
user interaction, then yes, you've to convert it into Unicode, since
ANSI cannot correctly represent it. And this may be a way to do that.
The other is reading as raw bytes and convert the strings once in
memory by MultiByteToWideChar (and specifying UTF-8 as codepage).
- The ideal situation for this kind of approach (and that's what it
has been designed for) is the reading / writing of relatively small
configuration files saved as UTF-8 for interoperablity between systems.
or for simple encoding/deconding of network messages across sockets.