Click here to Skip to main content
15,890,512 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
can anyone tell me is there a possible way to input a custom character says ซ to a file, well i tried it out but after inputting it through my program and opening the file i happen to find it converted to ascii (VT).

What I have tried:

here is the piece of code i open file up with and input my char:
C++
fstream write_on;
fstream Read_From;
const char *Write_File_Name = "C:\\users\\Username\\Desktop\\pic1.txt";
wchar_t const buf[] = L"ซ";

write_on.open(Write_File_Name, ios::binary | ios::out);
write_on.write(buf, 1);
write_on.close();


thanks
Posted
Updated 18-Jun-16 6:54am
v2
Comments
Sergey Alexandrovich Kryukov 17-Jun-16 20:16pm    
Nothing is really "converted". What encoding do you need? (If this is Unicode, please don't answer "Unicode". What UTF?) With BOM or not? If there is no BOM, some software, such as text editor may "think" it's ASCII/ANSI, even if actually it is not. You don't have to use the BOM; you just should not trust each and every piece of software.

By the way, why having a character in a PNG file?

—SA
Member 11593571 18-Jun-16 11:51am    
well if you think it has something to do with text editor, this time i wrote the character and read it through my code no text editor needed, but still no difference, and thanks for your reply
Sergey Alexandrovich Kryukov 18-Jun-16 12:43pm    
All right, this is something more certain. I can assure you, if you write data in file and read it in symmetric manner, you will get identical data. Unfortunately, you did not show how you read it. It's important that you are not losing any data. You are losing it, because of the size 1. This is a bug. Please see Solution 1.

You need to understand that wchar_t is implementation-dependent. In Windows, in particular, it is oriented to the character representation using UTF-16L. It means that one character with code point withing BMP is represented as two bytes, and the other characters use a pair of 16-bit words, called surrogate pair. So, the character had 2 or 4-byte representation; your case if 2 bytes.

—SA

In addition to Solution 1:

Please see my comment to the question.

You need to understand that wchar_t is implementation-dependent. In Windows, in particular, it is oriented to the character representation using UTF-16L, one of the Unicode encodings. Formally speaking, it does not have to be any particular encoding; it could be just some arbitrary data of the given size of this type. The everything depends on how this data is interpreted.

It means that one character with code point withing BMP is represented as two bytes, and the other characters use a pair of 16-bit words, called surrogate pair. So, the character had 2 or 4-byte representation; your case if 2 bytes, so you don't need an array of wchar_t, but in general case you would need it. Then you would need to write all the elements of this array to your file and read accordingly.

See also:
Wide character — Wikipedia, the free encyclopedia,
BMP (Unicode) — Wikipedia, the free encyclopedia,
Roadmap to the BMP,
https://www.gnu.org/software/libunistring/manual/html_node/The-wchar_005ft-mess.html,
FAQ - UTF-8, UTF-16, UTF-32 & BOM[^].

—SA
 
Share this answer
 
Comments
Member 11593571 18-Jun-16 19:05pm    
i read all the links you gave me above, it was useful but didn't helped me fix my problem, as you said my character is two bytes and wchar_t can store two bytes too, then no problem with the way i input it right? and as i said above, this is how i read the data Read_From.open(Read_File_Name, ios::binary | ios::in); Read_From.read(buf, sizeof(wchar_t)); but no luck there, i really don't know what am i doing wrong, and what to do. but there is one thing that if i use one of ascii characters it works just perfect.
Sergey Alexandrovich Kryukov 18-Jun-16 19:25pm    
How so? Did you fix your bug with writing 1?
—SA
Member 11593571 18-Jun-16 20:39pm    
yeah i changed it to sizeof(wchar_t) or strlen(L"ซ").i think it fixes the bug but doesn't fix my problem, so do you have any idea of how can i write ซ in my text file?
Sergey Alexandrovich Kryukov 18-Jun-16 20:53pm    
Or? These are two different things. Excuse me, what is the exact line after the fix? What is your platform?
—SA
Member 11593571 19-Jun-16 8:56am    
i am using windows
I don't think it is converted to ASCII. But you are wirting only one byte to the file (the lower byte of your wide character). To write all character bytes use:
write_on.write(buf, sizeof(wchar_t));


[EDIT]
See somments. The universal solution should be:
write_on.write((char*)buf, wcslen(buf) * sizeof(wchar_t));


With your example string "ซ" a single Unicode character (the Thai character SO SO) is written to file. Assuming your platform uses UTF-16LE encoding for wide characters (like Windows), the Unicode code point is 0x0E0B and the binary file content will be 0x0B followed by 0x0E.

When reading such files later you must know what kind of encoding is used. Or more general: You must know how to interpret the file content with each file you want to read.

If you interpret the file as ASCII (or some 8-bit text), you will read the ASCII control characters 0x0B and 0x0E (VT and SO). But if you interpret it as UTF16-LE, you will get the code point 0x0E0B.
[/EDIT]
 
Share this answer
 
v2
Comments
Sergey Alexandrovich Kryukov 18-Jun-16 13:00pm    
This is most certainly so, but, strictly speaking, not in 100% cases, so this is almost correct. Therefore, I up-voted the answer with almost 5 :-).

The only missing part is the mention that wchar_t is implementation dependent and proper explanation of that "wide" casualty. In general case, wchar_t is not always 16 bit, and the array not always would have 1 element. Your solution is correct under some assumptions: for example, when the implied encoding is UTF-16, and wchar_t is 16 bit, so 16 bits needed for representation of this particular character can be fit in one wchar_t word.

Please see solution 2 for further explanations.

—SA
Jochen Arndt 19-Jun-16 4:37am    
Thank you Sergey. You are right, I should have explained that in more detail.

I used the sizeof() operator to take into account that the size is platform dependant. So my answer is correct in all cases for a single character (as used in the question). For multiple characters it is of course

write_on.write(buf, wcslen(buf) * sizeof(wchar_t));
Sergey Alexandrovich Kryukov 19-Jun-16 10:00am    
Right. You better add this factor wcslen(buf) to your solution.
—SA
Member 11593571 19-Jun-16 12:57pm    
first of all without typecasting buf to (char*) it doesn't compiles it, so i had to do it this way: write_on.write((char*)buf, wcslen(buf) * sizeof(wchar_t)); but still no changes made and still getting VT and SO instead of my custom character.
Jochen Arndt 20-Jun-16 2:58am    
You have a wide string consisting of one wide character. Assuming your platform uses UTF-16LE (like Windows) the Unicode code point of your character is 0x0E0B (the Thai character SO SO).

This is written to file (it will write the value byte oriented: 0x0B followed by 0x0E).

If you now want to read the file, you must know how to interpret the file content (as with any file you are reading). If you interpret the file content as ASCII characters you will see the control characters SO and VT. But if you interpret the content as UTF-16LE you will get the character 0x0E0B.

So when you want your character back, you must write code that performs the reverse operation:

- Get the file size
- Allocate a wchar_t buffer with (size / sizeof(wchar_t)) wide characters (usually add one for a terminating NULL)
- Read the file content into the buffer (binary mode)
- You will have your character(s) back in your allocated buffer

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900