Opening a file with unicode path.

Question

0.00/5 (No votes)

See more:

Hi every one, I come in need of big help. I have a file that I want to open but the problem is that the file can be in a UTF directory (path can be in a cyrilic or latin). So I did an extensive search, read and tried almost 10 stack solution but came out empty, at this point am really desperate.

Here is my exact problem:

I get with a path, for example:

CSS

čovećž/test_file.txt

The way I can open this is with _wfopen, but the problem with this is that this function takes wchar_t.
And it can work if I code the path with unicode:

C++

wchar_t path[100] = _T("\u010d\u006f\u0076\u0065\u0107\u017e/test_file.txt");

Once I knew I needed a Unicoded wchar_t string I tried converting it.

Things I tried bellow:

I am asking anyone to help mi out with this, either convert the string to unicode or use some other function (not wfopen)!!
You can also use BOOST lib, I already got it set up.

Targeted platform is: Windows only!

I would like if somebody can code an example, because links to articles won't do much, because I think I read EVERYTHING that is on this topic. :(

So basically I need this:
https://www.branah.com/unicode-converter[^]

Thank you in advance.

What I have tried:

Manual conversion:
I tried converting it manually, going through the string and changing chars with unicode codes. But in C/C++ most of the characters are the same for example ć = č = š. So this did not work.

Then I turned to stackoverflow:

I tried this function:

C++

std::wstring s2ws(const std::string& s)
{
    int len;
    int slength = (int)s.length() + 1;
    len = MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, 0, 0);
    wchar_t* buf = new wchar_t[len];
    MultiByteToWideChar(CP_ACP, 0, s.c_str(), slength, buf, len);
    std::wstring r(buf);
    delete[] buf;
    return r;
}
std::wstring stemp = s2ws(x);
LPCWSTR result = stemp.c_str();

Did not work, then someone submitted this:

C++

#include <codecvt>
#include <iostream>
#include <iomanip>
#include <string>

int main() {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> convert;

    std::string s = "test";

    std::cout << std::hex << std::setfill('0');
    std::cout << "Input `char` data: ";
    for (char c : s) {
      std::cout << std::setw(2) << static_cast<unsigned>(static_cast<unsigned char>(c)) << ' ';
    }
    std::cout << '\n';

    std::wstring ws = convert.from_bytes(s);

    std::cout << "Output `wchar_t` data: ";
    for (wchar_t wc : ws) {
      std::cout << std::setw(4) << static_cast<unsigned>(wc) << ' ';
    }
    std::cout << '\n';
}

But this only works for ASCII code, but then i put put a Unicode char (ć) it converts into 003f, but it should be 0107.

So at this point I started thinking it wasn't possible in C++, and I installed BOOST and tried some things with wpath

C++

boost::filesystem::wpath dirPath

And it did not work..

Posted 11-Feb-16 4:27am

Marko Ilic

Updated 11-Feb-16 4:32am

v3

Add a Solution

Comments

Jochen Arndt 11-Feb-16 10:56am

The simplest solution would be to build your application as Unicode (which is the standard nowadays). Then you don't have to bother with conversions (at least for path names and Windows API functions).

Marko Ilic 11-Feb-16 11:00am

How do you mean build it as Unicode? Not to be a console one?

Jochen Arndt 11-Feb-16 11:11am

With Visual Studio select Unicode at 'Character Set' in the general project settings. Then the Unicode versions of the API functions are called. But you must still tell the compiler about the encoding of string literals in your code (using TCHAR, LPCTSTR, LPCSTR):

// An ASCII/CP string:
char *cpString = "ASCII/CP";

// An Unicode string:
wchar_t *wString = L"Unicode";

// A string using the project encoding:
TCHAR *tString = _T("Project defined encoding");

Marko Ilic 11-Feb-16 11:52am

It is already set to Unicode (Use Unicode Character Set), if it wasn't _T and other functions would fail...

Jochen Arndt 11-Feb-16 12:01pm

_T() would not fail. It creates the string as Unicode or ANSI depending on the project setting.

However, then I don't see your problem. Just use _T() for string literals and call the corresponding _t functions:

TCHAR *path1 = _T("literal_path");
FILE *f1 = _tfopen(path1, _T("r"));
LPCTSTR *path2 = getPathFromSomewhere();
FILE *f2 = _tfopen(path2, _T("r"));

See also http://www.codeproject.com/Articles/76252/What-are-TCHAR-WCHAR-LPSTR-LPWSTR-LPCTSTR-etc

Marko Ilic 11-Feb-16 14:12pm

As I said in above example this doesn't work...
TCHAR filename[MAX_PATH] = _T("ilić/test.txt");
FILE *pfile = _wfopen(filename, _T("r"));
if(!pfile) printf("Error \n");
else printf("Finally working \n");

But when I write the path like this it works:
TCHAR filename[MAX_PATH] = _T("ili\u0107/test.txt"); // U+0107 = ć

Jochen Arndt 11-Feb-16 14:29pm

OK. Now I see your problem.

Your source files are probably not Unicode but ANSI. You can use the method specifying characters by code or change the encoding of your source files. See https://msdn.microsoft.com/en-us/library/xwy0e8f2.aspx about supported encoding for compilers and linkers.

To change the encoding of a source file, choose File - Advanced Save Options, and select a Unicode Encoding.

Marko Ilic 11-Feb-16 15:08pm

I can't change that, it depends from user to user... But I tried making a UTF-8 file with notepad++ and still it did not work.

2 solutions

Solution 2

This is mainly a solution to the question raised in solution 1.

When a file is opened in an editor, that tries to identify the encoding. With Unicode files, there may be a Byte order mark - Wikipedia, the free encyclopedia[^]. If that is present, the editor knows the encoding.

If there is no BOM, the editor may try to identify Unicode files by further checks if the file contains non ASCII characters (codes 0x00 and 0x080 to 0xFF). But this depends on the editor. If you for example create an UTF-8 file without BOM with Notepad++, this might not be detected by other editors like the VS editor which than assumes the file to be encoded with the current code page.

If the file has not been identified as Unicode it is treated as ASCII/ANSI using the current code page with Windows.

So it is always a good idea to use a BOM when creating Unicode files. If you have a look at the allowed encodings at the VS Advanced Save Options dialog, you will note that all Unicode encodings are with BOM except UTF-16LE (called Unicode there). This indicates that the VS editor has a detection method for this encoding (which is not difficult because ASCII, ANSI, and UTF-8 did not contain zero bytes while UTF-16 files usually always have them).

Posted 11-Feb-16 23:28pm

Jochen Arndt

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Marko Ilic · Accepted Answer · 2016-02-11T10:58:00

The problem was that I was saving the CPP file as ANSI... I had to convert it to UTF-8. I tried this before posting but VS 2015 turns it into ANSI, I had to change it in VS so I could get it working.

I tried opening the cpp file with notepad++ and changing the encoding but when I turn on VS it automatically returns. So I was looking to Save As option but there is no encoding option. Finally i found it, in Visual Studio 2015

File -> Advanced Save Options in the Encoding dropdown change it to Unicode

Image of the window

One thing that is still strange to me, how did VS display the characters normally but when I opened the file in N++ there was ? (like it was supposed to be, because of ANSI)?