Click here to Skip to main content
14,388,894 members
Rate this:
Please Sign up or sign in to vote.
Hi every one, I come in need of big help. I have a file that I want to open but the problem is that the file can be in a UTF directory (path can be in a cyrilic or latin). So I did an extensive search, read and tried almost 10 stack solution but came out empty, at this point am really desperate.

Here is my exact problem:

I get with a path, for example:
čovećž/test_file.txt

The way I can open this is with _wfopen, but the problem with this is that this function takes wchar_t.
And it can work if I code the path with unicode:
wchar_t path[100] = _T("\u010d\u006f\u0076\u0065\u0107\u017e/test_file.txt");


Once I knew I needed a Unicoded wchar_t string I tried converting it.

Things I tried bellow:


I am asking anyone to help mi out with this, either convert the string to unicode or use some other function (not wfopen)!!
You can also use BOOST lib, I already got it set up.

Targeted platform is: Windows only!

I would like if somebody can code an example, because links to articles won't do much, because I think I read EVERYTHING that is on this topic. :(

So basically I need this:
https://www.branah.com/unicode-converter[^]

Thank you in advance.

What I have tried:

Manual conversion:
I tried converting it manually, going through the string and changing chars with unicode codes. But in C/C++ most of the characters are the same for example ć = č = š. So this did not work.

Then I turned to stackoverflow:

I tried this function:

std::wstring s2ws(const std::string& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
    wchar_t* buf = new wchar_t[len];
    MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
    std::wstring r(buf);
    delete[] buf;
    return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();

Did not work, then someone submitted this:
#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>

int main() {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;

    std::string s = "test";

    std::cout << std::hex << std::setfill('0');
    std::cout << "Input `char` data: ";
    for (char c : s) {
      std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
    }
    std::cout << '\n';

    std::wstring ws = convert.from_bytes(s);

    std::cout << "Output `wchar_t` data: ";
    for (wchar_t wc : ws) {
      std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
}


But this only works for ASCII code, but then i put put a Unicode char (ć) it converts into 003f, but it should be 0107.

So at this point I started thinking it wasn't possible in C++, and I installed BOOST and tried some things with wpath
boost::filesystem::wpath dirPath

And it did not work..
Posted
Updated 11-Feb-16 5:32am
v3
Comments
Jochen Arndt 11-Feb-16 10:56am
   
The simplest solution would be to build your application as Unicode (which is the standard nowadays). Then you don't have to bother with conversions (at least for path names and Windows API functions).
Marko Ilic 11-Feb-16 11:00am
   
How do you mean build it as Unicode? Not to be a console one?
Jochen Arndt 11-Feb-16 11:11am
   
With Visual Studio select Unicode at 'Character Set' in the general project settings. Then the Unicode versions of the API functions are called. But you must still tell the compiler about the encoding of string literals in your code (using TCHAR, LPCTSTR, LPCSTR):

// An ASCII/CP string:
char *cpString = "ASCII/CP";

// An Unicode string:
wchar_t *wString = L"Unicode";

// A string using the project encoding:
TCHAR *tString = _T("Project defined encoding");
Marko Ilic 11-Feb-16 11:52am
   
It is already set to Unicode (Use Unicode Character Set), if it wasn't _T and other functions would fail...
Jochen Arndt 11-Feb-16 12:01pm
   
_T() would not fail. It creates the string as Unicode or ANSI depending on the project setting.

However, then I don't see your problem. Just use _T() for string literals and call the corresponding _t functions:

TCHAR *path1 = _T("literal_path");
FILE *f1 = _tfopen(path1, _T("r"));
LPCTSTR *path2 = getPathFromSomewhere();
FILE *f2 = _tfopen(path2, _T("r"));

See also http://www.codeproject.com/Articles/76252/What-are-TCHAR-WCHAR-LPSTR-LPWSTR-LPCTSTR-etc
Marko Ilic 11-Feb-16 14:12pm
   
As I said in above example this doesn't work...
TCHAR filename[MAX_PATH] = _T("ilić/test.txt");
FILE *pfile = _wfopen(filename, _T("r"));
if(!pfile) printf("Error \n");
else printf("Finally working \n");

But when I write the path like this it works:
TCHAR filename[MAX_PATH] = _T("ili\u0107/test.txt"); // U+0107 = ć
Jochen Arndt 11-Feb-16 14:29pm
   
OK. Now I see your problem.

Your source files are probably not Unicode but ANSI. You can use the method specifying characters by code or change the encoding of your source files. See https://msdn.microsoft.com/en-us/library/xwy0e8f2.aspx about supported encoding for compilers and linkers.

To change the encoding of a source file, choose File - Advanced Save Options, and select a Unicode Encoding.
Marko Ilic 11-Feb-16 15:08pm
   
I can't change that, it depends from user to user... But I tried making a UTF-8 file with notepad++ and still it did not work.
Rate this:
Please Sign up or sign in to vote.

Solution 1

The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.

I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As option but there is no encoding option. Finally i found it, in Visual Studio 2015

File -> Advanced Save Options in the Encoding dropdown change it to Unicode

Image of the window

One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?
   
v2
Rate this:
Please Sign up or sign in to vote.

Solution 2

This is mainly a solution to the question raised in solution 1.

When a file is opened in an editor, that tries to identify the encoding. With Unicode files, there may be a Byte order mark - Wikipedia, the free encyclopedia[^]. If that is present, the editor knows the encoding.

If there is no BOM, the editor may try to identify Unicode files by further checks if the file contains non ASCII characters (codes 0x00 and 0x080 to 0xFF). But this depends on the editor. If you for example create an UTF-8 file without BOM with Notepad++, this might not be detected by other editors like the VS editor which than assumes the file to be encoded with the current code page.

If the file has not been identified as Unicode it is treated as ASCII/ANSI using the current code page with Windows.

So it is always a good idea to use a BOM when creating Unicode files. If you have a look at the allowed encodings at the VS Advanced Save Options dialog, you will note that all Unicode encodings are with BOM except UTF-16LE (called Unicode there). This indicates that the VS editor has a detection method for this encoding (which is not difficult because ASCII, ANSI, and UTF-8 did not contain zero bytes while UTF-16 files usually always have them).
   

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)




CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100