How to input a custom character to a file using fstream in C++

Question

1.00/5 (1 vote)

See more:

can anyone tell me is there a possible way to input a custom character says ซ to a file, well i tried it out but after inputting it through my program and opening the file i happen to find it converted to ascii (VT).

What I have tried:

here is the piece of code i open file up with and input my char:

C++

fstream write_on;
fstream Read_From;
const char *Write_File_Name = "C:\\users\\Username\\Desktop\\pic1.txt";
wchar_t const buf[] = L"ซ";

write_on.open(Write_File_Name, ios::binary | ios::out);
write_on.write(buf, 1);
write_on.close();

thanks

Posted 17-Jun-16 13:28pm

Member 11593571

Updated 18-Jun-16 6:54am

v2

Add a Solution

Comments

Sergey Alexandrovich Kryukov 17-Jun-16 20:16pm

Nothing is really "converted". What encoding do you need? (If this is Unicode, please don't answer "Unicode". What UTF?) With BOM or not? If there is no BOM, some software, such as text editor may "think" it's ASCII/ANSI, even if actually it is not. You don't have to use the BOM; you just should not trust each and every piece of software.

By the way, why having a character in a PNG file?

—SA

Member 11593571 18-Jun-16 11:51am

well if you think it has something to do with text editor, this time i wrote the character and read it through my code no text editor needed, but still no difference, and thanks for your reply

Sergey Alexandrovich Kryukov 18-Jun-16 12:43pm

All right, this is something more certain. I can assure you, if you write data in file and read it in symmetric manner, you will get identical data. Unfortunately, you did not show how you read it. It's important that you are not losing any data. You are losing it, because of the size 1. This is a bug. Please see Solution 1.

You need to understand that wchar_t is implementation-dependent. In Windows, in particular, it is oriented to the character representation using UTF-16L. It means that one character with code point withing BMP is represented as two bytes, and the other characters use a pair of 16-bit words, called surrogate pair. So, the character had 2 or 4-byte representation; your case if 2 bytes.

—SA

2 solutions

Solution 2

In addition to Solution 1:

Please see my comment to the question.

You need to understand that wchar_t is implementation-dependent. In Windows, in particular, it is oriented to the character representation using UTF-16L, one of the Unicode encodings. Formally speaking, it does not have to be any particular encoding; it could be just some arbitrary data of the given size of this type. The everything depends on how this data is interpreted.

It means that one character with code point withing BMP is represented as two bytes, and the other characters use a pair of 16-bit words, called surrogate pair. So, the character had 2 or 4-byte representation; your case if 2 bytes, so you don't need an array of wchar_t, but in general case you would need it. Then you would need to write all the elements of this array to your file and read accordingly.

See also:
Wide character — Wikipedia, the free encyclopedia,
BMP (Unicode) — Wikipedia, the free encyclopedia,
Roadmap to the BMP,
https://www.gnu.org/software/libunistring/manual/html_node/The-wchar_005ft-mess.html,
FAQ - UTF-8, UTF-16, UTF-32 & BOM[^].

—SA

Posted 18-Jun-16 6:54am

Sergey Alexandrovich Kryukov

Comments

Member 11593571 18-Jun-16 19:05pm

i read all the links you gave me above, it was useful but didn't helped me fix my problem, as you said my character is two bytes and wchar_t can store two bytes too, then no problem with the way i input it right? and as i said above, this is how i read the data Read_From.open(Read_File_Name, ios::binary | ios::in); Read_From.read(buf, sizeof(wchar_t)); but no luck there, i really don't know what am i doing wrong, and what to do. but there is one thing that if i use one of ascii characters it works just perfect.

Sergey Alexandrovich Kryukov 18-Jun-16 19:25pm

How so? Did you fix your bug with writing 1?
—SA

Member 11593571 18-Jun-16 20:39pm

yeah i changed it to sizeof(wchar_t) or strlen(L"ซ").i think it fixes the bug but doesn't fix my problem, so do you have any idea of how can i write ซ in my text file?

Sergey Alexandrovich Kryukov 18-Jun-16 20:53pm

Or? These are two different things. Excuse me, what is the exact line after the fix? What is your platform?
—SA

Member 11593571 19-Jun-16 8:56am

i am using windows

Sergey Alexandrovich Kryukov 19-Jun-16 10:04am

All right, then see Solution 1. See also comments to it.
—SA

Member 11593571 19-Jun-16 12:58pm

i have tried solution 1 and 2 over and over but no changes, and i commented to Sergey Alexandrovich Kryukov reply if you would read it.

Member 11593571 19-Jun-16 15:18pm

oh boy, even if i do it just by cout << "ซ" << endl; it prints α╕ï. so what would you say now?

Sergey Alexandrovich Kryukov 20-Jun-16 9:04am

Here is the whole thing: what does it mean, "it prints"?
—SA

Andreas Gieriet 20-Jun-16 9:30am

There are several components involved in the whole story:
1) your editor: what character encoding do you use? You cannot trust that what you see is what the compiler gets. I.e. try to enter the character as a hex code L'\x..\x..' etc.
2) The console where your printf goes to: the console gets the byte sequence from the printf output and has to make sense of the bytes. Make sure that your console knows how to properly display the respective byte sequences. E.g. my console only knows 7-bit ASCII - other characters are displayed as "garbage".
Cheers
Andi

Sergey Alexandrovich Kryukov 20-Jun-16 10:09am

It's all about separation of concerns, taking all unrelated factors out of equation. Everyone can make, for example, a round trip, write data, read it back and write to another file, to make sure nothing is lost...
—SA

Sergey Alexandrovich Kryukov 20-Jun-16 9:07am

I guess nearly everyone knows. I would guarantee if I write something, you read it.
You've never shown your fool final version of write with fixed bug, and never how you read.
—SA

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jochen Arndt · Accepted Answer · 2016-06-18T06:12:00

Solution 1

I don't think it is converted to ASCII. But you are wirting only one byte to the file (the lower byte of your wide character). To write all character bytes use:

write_on.write(buf, sizeof(wchar_t));

[EDIT]
See somments. The universal solution should be:

write_on.write((char*)buf, wcslen(buf) * sizeof(wchar_t));

With your example string "ซ" a single Unicode character (the Thai character SO SO) is written to file. Assuming your platform uses UTF-16LE encoding for wide characters (like Windows), the Unicode code point is 0x0E0B and the binary file content will be 0x0B followed by 0x0E.

When reading such files later you must know what kind of encoding is used. Or more general: You must know how to interpret the file content with each file you want to read.

If you interpret the file as ASCII (or some 8-bit text), you will read the ASCII control characters 0x0B and 0x0E (VT and SO). But if you interpret it as UTF16-LE, you will get the code point 0x0E0B.
[/EDIT]

Posted 18-Jun-16 6:12am

Jochen Arndt

Updated 19-Jun-16 21:15pm

v2

Comments

Sergey Alexandrovich Kryukov 18-Jun-16 13:00pm

This is most certainly so, but, strictly speaking, not in 100% cases, so this is almost correct. Therefore, I up-voted the answer with almost 5 :-).

The only missing part is the mention that wchar_t is implementation dependent and proper explanation of that "wide" casualty. In general case, wchar_t is not always 16 bit, and the array not always would have 1 element. Your solution is correct under some assumptions: for example, when the implied encoding is UTF-16, and wchar_t is 16 bit, so 16 bits needed for representation of this particular character can be fit in one wchar_t word.

Please see solution 2 for further explanations.

—SA

Jochen Arndt 19-Jun-16 4:37am

Thank you Sergey. You are right, I should have explained that in more detail.

I used the sizeof() operator to take into account that the size is platform dependant. So my answer is correct in all cases for a single character (as used in the question). For multiple characters it is of course

write_on.write(buf, wcslen(buf) * sizeof(wchar_t));

Sergey Alexandrovich Kryukov 19-Jun-16 10:00am

Right. You better add this factor wcslen(buf) to your solution.
—SA

Member 11593571 19-Jun-16 12:57pm

first of all without typecasting buf to (char*) it doesn't compiles it, so i had to do it this way: write_on.write((char*)buf, wcslen(buf) * sizeof(wchar_t)); but still no changes made and still getting VT and SO instead of my custom character.

Jochen Arndt 20-Jun-16 2:58am

You have a wide string consisting of one wide character. Assuming your platform uses UTF-16LE (like Windows) the Unicode code point of your character is 0x0E0B (the Thai character SO SO).

This is written to file (it will write the value byte oriented: 0x0B followed by 0x0E).

If you now want to read the file, you must know how to interpret the file content (as with any file you are reading). If you interpret the file content as ASCII characters you will see the control characters SO and VT. But if you interpret the content as UTF-16LE you will get the character 0x0E0B.

So when you want your character back, you must write code that performs the reverse operation:

- Get the file size
- Allocate a wchar_t buffer with (size / sizeof(wchar_t)) wide characters (usually add one for a terminating NULL)
- Read the file content into the buffer (binary mode)
- You will have your character(s) back in your allocated buffer

Member 11593571 18-Jun-16 14:32pm

o.k i just changed 1 to sizeof(wchar_t), now it is inputing one more character than the buffer data itself contains, before i mean with write_on.write(buf, 1); it was converting my custom character to (VT) but the way you said as sizeof(wchar_t) it is adding (SO) over just for nothing, and this is the way i read data
Read_From.open(Read_File_Name, ios::binary | ios::in); Read_From.read(buf, sizeof(buf)); i also have tried Read_From.read(buf, 1);

Jochen Arndt 19-Jun-16 4:28am

There is no conversion.

A wchar_t represents a single character but contains 2 or 4 bytes depending on the implementation (see Sergey's answer). Therefore I choosed sizeof(wchar_t) because that will be implementation dependant size of a wchar_t in bytes.

The write function writes a number of bytes from the specified buffer. The buffer parameter is of type char*. But that does not mean that it writes printable characters. It writes bytes.
See http://www.cplusplus.com/reference/ostream/ostream/write/:
"This function simply copies a block of data, without checking its contents: The array may contain null characters, which are also copied without stopping the copying process."