Unicode string handling in C++

Question

5.00/5 (1 vote)

See more:

Ok i could read the unicode file but now i see that i get the entire file in one string and now i am unable to break it in line and then words. I am very confused. i had previous post but since the problem is different now i am posting new ques. objective: - i want to filter some words from the file (e.g. enclosed in double quotes) - i have read the unicode (UTF16 file )and its got in single string - i need to break it line by line and then using cstok break it in words

Platform Windows , Visual studio 2010 , Unicode: UTF16 If you have different suggestions, i am open to change the code ,also it would be great if you could paste the sample code to understand.

Pasting the code below:

C++

#include <codecvt>
#include <locale>

wifstream fin("profiles.txt", ios_base::binary);  //open a file
wofstream fout("out.txt",ios_base::binary);  // this dumps the parsing ouput

fin.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff,       std::consume_header>));
fout.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));

wstring line;
getline(fin,line);  //-----------------here i get the entire file in wstring line

// Need suggestions on below code on how to handle

while (!fin.eof())
{
    // read an entire line into memory
    // wchar_t buf[MAX_CHARS_PER_LINE];
    
    //fin.getline(buf, MAX_CHARS_PER_LINE);
    
    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index
    
    // array to store memory addresses of the tokens in buf
    const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0
    
    // parse the line
    token[0] = wcstok(buf, DELIMITER); // first token
    
    if (token[0]) // zero if line is blank
    {
    
        for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
        {
            token[n] = wcstok(0, DELIMITER); // subsequent tokens
    
            if (!token[n]) break; // no more tokens
    
            std::wstring str2 =token[n];
         }
    }
}

Posted 4-Dec-13 19:00pm

nxc121

Updated 4-Dec-13 22:53pm

Richard MacCutchan

v2

Add a Solution

Comments

Pablo Aliskevicius 5-Dec-13 4:45am

Did you look at this?

http://www.codeproject.com/Articles/23198/C-String-Toolkit-StrTk-Tokenizer

Richard MacCutchan 5-Dec-13 4:54am

So what is your problem?

nv3 5-Dec-13 5:11am

Have you checked that your file contains newline characters at the end of each line? And those locale manipulations you do look pretty unnecessary to me (for what you are trying to achieve).

1 solution

Add a Solution

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

nxc121 · Answer 1 · 2013-12-04T23:35:00

Ok i finally figured out , the problem was due to little endianess , which screwed the file. and since i was getting osme output i never thought to look at it. the solution was just to replace the file reading method by:

C#

in.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,
        std::codecvt_mode(std::little_endian|std::consume_header)>);

after fixing this the rest of the code worked as expected.
Thanks @Richard , @nv3 ,@pablo for response. much appreciated.

Unicode string handling in C++

1 solution

Solution 1

Add your solution here

Preview 0

Unicode string handling in C++

1 solution

Solution 1

Add your solution here

Preview 0

Existing Members

...or Join us