Click here to Skip to main content
Click here to Skip to main content

Unicode Output to the Windows Console

By , 25 Mar 2009
 

Who Should Read this Article?

If you create Windows console programs and want to be able to print wide strings properly, this is something for you.

More than the actual proficiency in C++, it is important that you understand what Unicode is and what wide strings are.

Introduction

It's hard to emphasize enough the importance of making Unicode aware applications.

The novices in C/C++ should be taught from the beginning not to use main, strlen, printf, etc. It should be pointed out to them from the beginning that modern Windows systems internally work with 16-bit Unicode, aka wide strings. Therefore wmain, wcslen, wprintf, etc. (or even better: the TCHAR paradigm) should be used instead.

When new C++ projects are created in Visual Studio, they follow the TCHAR paradigm. It means that, instead of the above, _tmain, _tcsclen, _tprintf, etc. are used. They are typedefs that have different meaning depending on the character set chosen in the project settings. This paradigm is created so that the same code could be built for old (Windows 95, Windows 98) and new versions of Windows (NT, XP and newer). Since programming for these old Windows versions does not make sense any more, we could simply use the wide versions of functions. Yet, following the TCHAR paradigm still makes sense, because it can make the code more portable to operating systems that do not use wide strings, like Linux.

All this works fine. The problem arises when you write a console application. The application can read wide command line arguments properly. I do not know if input of wide string via standard input works OK because I never needed to use it. But I needed to output them and it did not work. I tried CRT functions like wprintf and STL objects like wcout. Neither of them worked. I searched for a suitable solution and could not find it.

I set up the cmd window to use Lucida Console font (and you should do it too, otherwise any attempt to see Unicode characters in it is bound to fail!). I realized that it is possible to print wide strings directly to the console using functions from conio.h (_cputts, _tcprintf, etc.). Very nice!

Yet... When someone is using a console application, she/he expects to be able to redirect its output. It does not work if output goes directly to the console. It must go to stdout or stderr.

It seems Microsoft was not consistent in this. While the whole system works with wide strings, the console output does not, and in .NET, the default output code page is UTF-8! But it gave me the idea. I also noticed that text files encoded in UTF-8 can be properly printed to the console (using `type` for example), provided the console code page is set to UTF-8 using the command `chcp 65001`. Now I wanted to use UTF-8 from C++.

Using the Code

Setting and Resetting the Codepage

We must prepare the console for UTF-8. We first store the current console output codepage in a variable:

UINT oldcp = GetConsoleOutputCP();

Then we change the console output codepage to UTF-8, which is the equivalent of `chcp 65001`:

SetConsoleOutputCP(CP_UTF8);

Before exiting the program, we must be polite and bring the console back to the state as it was before. We must:

SetConsoleOutputCP(oldcp);

When We Want to Print Out Wide Strings in the Program, We Will Do it Like this

Suppose we have a wide string containing Unicode characters, say:

wchar_t s[] = L"èéøÞǽлљΣæča";
If you write that in Visual Studio, when you attempt to save the file you will be prompted to save it in some Unicode format. "Unicode - Codepage 1200" will be OK.

We convert it to UTF-8:

First we call WideCharToMultiByte with the 6th argument set to zero. That way, the function will tell us how many bytes it is going to need to store the converted string.

int bufferSize = WideCharToMultiByte(CP_UTF8, 0, s, -1, NULL, 0, NULL, NULL);

We allocate a buffer: 

char* m = new char[bufferSize]; 

The second call to WideCharToMultiByte does the actual conversion:

WideCharToMultiByte(CP_UTF8, 0, s, -1, m, bufferSize, NULL, NULL);

Print it to stdout. Notice the capital S. It tells the wprint function to expect narrow string:

wprintf(L"%S", m); 

Release the buffer: 

delete[] m; 

Now the output goes to stdout. If redirected to a file, the file will be encoded as UTF-8.

This Is It 

It is not a big deal and cannot be compared to the articles that require much more work. Yet I hope it can be useful because it tries to solve a problem that is widely neglected. Last time I checked, I could not find the solution for this problem in Java either.

In my example code, I packed everything I spoke about here in small ostream and wostream overrides. They are not perfect and I'm pretty sure they could be coded better. I would do it if I knew more about iostream programming. Yet they can be useful for those who want the solution out of the box and easy to use. But it should be pointed out that they are not thread safe. There are more comments in the code.

History 

This article is completely rewritten, mainly because the comments of Member 2901525 made me understand that the code is not perfect enough to be offered without some more explanation. The article itself was very short, looked sketchy and earned some low marks. I forgot to mention that Lucida Console font must be used in the cmd window. Member 2901525 noticed a weak point in the code and I changed this. Otherwise there are no significant changes in the code.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

swuk
Croatia Croatia
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
Questionredirecting?memberosirisgothra30 Nov '12 - 6:51 
ok, this works great outputing and redrecting in MBCS, but suppose I wanted change the encoding depending on whether or not it was being redirected to a file or not, do you know how to detect this right off easily? or not so easy? the trouble is that its fine but I would like the output to be UTF-16 in file format, with a BOM, which would require some intervention, its just the detecting when that is the problem, if i knew that I would know how to do the rest.
so how does one know when the stream is being redirected from the console (i think most refer to it as "CON")
-O-
 
Paradisim Studio of the Arts
http://paradisim.net
 
InverseGoogle: Simple Black Combined Torrent/Google Pic/Google Web Search on one page: http://inversegoogle.paradisim.net

QuestionSet The Font FirstmemberQwertie22 Oct '12 - 12:18 
Something strange and surprising about Unicode output is that you must set the font to "Lucida Console" BEFORE the program actually outputs anything. If the font is not "Lucida Console" when your code writes to the console, the UTF-8 code page will be ignored and the output will be written byte-by-byte (so a single UTF-8 character can come out as 2, 3 or 4 garbage characters.)
 
If you set the font too late, your output is still garbage (at least on my machine, Windows 7 x64.)
QuestionConsole Input?memberpeterchen16 Jan '12 - 3:49 
Hi, old article, but I thought I try:
 
How can I read unicode input from the console?
 
Even after setting the code page to UTF8, using
 
   SetConsoleCP(65001);
   SetConsoleOutputCP(65001);
 
as soon as I enter non-ASCII characters, the input string is empty (I tried std::cin.getline and _getts_s)
FILETIME to time_t

| FoldWithUs! | sighist | WhoIncludes - Analyzing C++ include file hierarchy

GeneralMajor problem with the codemembersmartnut00721 Apr '09 - 11:53 
I have only read the article and not perused the source code .
Thank you for attempting to solve the issue .
But , here are the problems
 
1) In windows wchar_t is a max or 16 bits . It cannot handle utf-8 .
example L"www.原來我不帥.tw"
here the utf-8 encoded non latin language characters are three bytes each .
2) The article claims unicode output to console . I would prefer a more specific title . utf-8 output .
 
hi there .. am a student at USC

GeneralRe: Major problem with the codememberswuk22 Apr '09 - 5:38 
Thank you for your feedback Smile | :)
 
Try to redirect output to a file:
unicode_console.exe > tw.txt
 
Then open tw.txt in Notepad++ and choose "Format / Encode in UTF8".
You will see your string www.原來我不帥.tw correctly.
 
This proves that the characters are correctly printed to stdout but the console can not show them correctly. This usually happens if the glyphs for the particular language are not properly installed. Please, try to play with "Control Panel / Regional and Language Options / Advanced / Code page conversion tables", and let me know if that helped.
 
I did not go into that details in the article - I just mentioned that the user should set the console to use Lucida Console - because the subject of the article is not setting up the console but the C++ code to correctly print to it. There are surely many articles on the Web that explain that in details.
GeneralAnother approachmemberAnna-Jayne Metcalfe26 Mar '09 - 1:06 
Here's another approach, using the built-in character conversion capabilities of ATL::CString (available in VS2002 onwards):
 
LPCWSTR pszMsg = _T("Whatever");
 
std::cout << CStringA(pszMsg).GetString() << std::endl;
 
One of the CStringA constructors automatically converts wide strings to ANSI, whereupon a GetString() call will return a pointer to an ANSI string - which std::cout can of course use without modification.
 
Note however that this will only produce meaningful output for text which has a valid ANSI representation - so for multibyte character languages such as Japanese the approach in the article is probably your best bet.
 
Anna Rose | [Rose]
 
Having a bad bug day?
 
Tech Blog | Anna's Place | Tears and Laughter
 
"If mushy peas are the food of the devil, the stotty cake is the frisbee of God"

GeneralRe: Another approach [modified]memberswuk26 Mar '09 - 9:36 
> LPCWSTR pszMsg = _T("Whatever");
 
This does not make much sense. It should be one one of the following:
LPCSTR pszMsg = "Whatever";
LPCWSTR pszMsg = L"Whatever";
LPCTSTR pszMsg = _T("Whatever");
> for text which has a valid ANSI representation
 
If I try to print some text which is Croatian and therefore has a valid ANSI representation in codepage 1250:
std::cout << CStringA(L"Žeđam kad pojedem ćevapčiće.").GetString() << std::endl;
to see the correct output I would first have to change my console output codepage to 1250. But on your computer even that would probably not help because, I suppose, CStringA is performing the translation to ANSI according your ANSI system codepage, which is probably 1252.
For the same reason if I would try to print this Italian text on my computer (which has 1250 ANSI system codepage):
std::cout << CStringA(L"Lei è più fragile di me, non sopporterà la ... ").GetString() << std::endl;
not even changing the codepage to 1252 would help.
 
Alternatively to changing the console output codepage, converting from ANSI to OEM could help. But, again, what works for me wouldn't work for you and vice versa.
 
Also, for the text not to have a valid ANSI representation it is not necessary to be Japanese or alike. It's enough to have mixed language content, for example Croatian and Italian in the same text.
 
Shortly:
ANSI is a very poor extension to ASCII and leads to problems with every language except English. It should never be used. Neither in console nor in GUI applications.
 
---------
 
I am happy that you have considered looking at the code, now that the article is rewritten.
Wink | ;)
 
modified on Thursday, March 26, 2009 4:53 PM

GeneralRe: Another approachmemberAnna-Jayne Metcalfe26 Mar '09 - 11:39 
As I said, it depends on the text - and for Unicode apps written in English (my native tongue) it will work fine - otherwise you will need to take code pages into consideration or use a more elaborate approach.
 
I agree ANSI is crap - which is why I always compile to Unicode, so you can substitute _T("") for L"" in the above example.
 
Anna Rose | [Rose]
 
Having a bad bug day?
 
Tech Blog | Anna's Place | Tears and Laughter
 
"If mushy peas are the food of the devil, the stotty cake is the frisbee of God"

GeneralRe: Another approachmemberswuk27 Mar '09 - 10:35 
> for Unicode apps written in English
Your apps are not "written in English". They are written in C++. When you say "written in English" you probably mean "written *for* English" i.e. "apps that are written to (be able to) handle only English text, only English path names etc.", right? Don't you see that there is no point to use Unicode in such an application? You are only wasting memory.
 
So you can translate "Unicode apps written in English" as "Unicode apps that use Unicode just to waste memory".
 
Of course you do not need "to take code pages into consideration or use a more elaborate approach" for such applications. You do not need anything for them because they should not be written at all.
 
Sorry for being harsh, but as I said: "It' hard to emphasize enough the importance of making Unicode aware applications." And it seems I was right because some people (and it seems that you are one of them) simply miss the point. "Unicode aware applications" are apps that use Unicode with purpose - with purpose Unicode is designed for: internationalization. "Unicode apps written in English" do not fall into that category.
 
Wink | ;)
GeneralRe: Another approachmemberAnna-Jayne Metcalfe27 Mar '09 - 11:13 
swuk wrote:
Your apps are not "written in English". They are written in C++. When you say "written in English" you probably mean "written *for* English" i.e. "apps that are written to (be able to) handle only English text, only English path names etc.", right? Don't you see that there is no point to use Unicode in such an application? You are only wasting memory.
 
So you can translate "Unicode apps written in English" as "Unicode apps that use Unicode just to waste memory".
 
Of course you do not need "to take code pages into consideration or use a more elaborate approach" for such applications. You do not need anything for them because they should not be written at all.

 
First of all, don't be so pedantic. I know they are written in C++, and so do you.
 
Secondly, Unicode is the way Windows (since NT), handles strings internally (for example any ANSI strings you pass to Windows will be converted to Unicode by the OS, and back to ANSI in the other direction) so it makes sense from a performance perspective to use it over ANSI regardless - even if you don't plan to localise yet. The memory overhead is, for most applications, not at all significant compared to other data structures.
 
That's the approach I take, so if (or when) the time comes to localise any of our products less work will be required since everything is already written to be Unicode compatible. FWIW the one product we're likely to localise (to German) in due course does not use the console anyway.
 
swuk wrote:
Sorry for being harsh, but as I said: "It' hard to emphasize enough the importance of making Unicode aware applications." And it seems I was right because some people (and it seems that you are one of them) simply miss the point. "Unicode aware applications" are apps that use Unicode with purpose - with purpose Unicode is designed for: internationalization. "Unicode apps written in English" do not fall into that category.

 
Disagree totally. Sorry, but I believe in taking the long view as I've explained.
 
Anna Rose | [Rose]
 
Having a bad bug day?
 
Tech Blog | Anna's Place | Tears and Laughter
 
"If mushy peas are the food of the devil, the stotty cake is the frisbee of God"

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 25 Mar 2009
Article Copyright 2009 by swuk
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid