Click here to Skip to main content
15,881,092 members
Please Sign up or sign in to vote.
5.00/5 (1 vote)
See more:
Ok i could read the unicode file but now i see that i get the entire file in one string and now i am unable to break it in line and then words. I am very confused. i had previous post but since the problem is different now i am posting new ques. objective: - i want to filter some words from the file (e.g. enclosed in double quotes) - i have read the unicode (UTF16 file )and its got in single string - i need to break it line by line and then using cstok break it in words

Platform Windows , Visual studio 2010 , Unicode: UTF16 If you have different suggestions, i am open to change the code ,also it would be great if you could paste the sample code to understand.

Pasting the code below:

C++
#include <codecvt>
#include <locale>

wifstream fin("profiles.txt", ios_base::binary);  //open a file
wofstream fout("out.txt",ios_base::binary);  // this dumps the parsing ouput

fin.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff,       std::consume_header>));
fout.imbue(std::locale(fin.getloc(),new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));

wstring line;
getline(fin,line);  //-----------------here i get the entire file in wstring line

// Need suggestions on below code on how to handle

while (!fin.eof())
{
    // read an entire line into memory
    // wchar_t buf[MAX_CHARS_PER_LINE];
    
    //fin.getline(buf, MAX_CHARS_PER_LINE);
    
    // parse the line into blank-delimited tokens
    int n = 0; // a for-loop index
    
    // array to store memory addresses of the tokens in buf
    const wchar_t* token[MAX_TOKENS_PER_LINE] = {}; // initialize to 0
    
    // parse the line
    token[0] = wcstok(buf, DELIMITER); // first token
    
    if (token[0]) // zero if line is blank
    {
    
        for (n = 0; n < MAX_TOKENS_PER_LINE; n++)   // setting n=0 as we want to ignore the first token
        {
            token[n] = wcstok(0, DELIMITER); // subsequent tokens
    
            if (!token[n]) break; // no more tokens
    
            std::wstring str2 =token[n];
         }
    }
}
Posted
Updated 4-Dec-13 22:53pm
v2
Comments
Pablo Aliskevicius 5-Dec-13 4:45am    
Did you look at this?

http://www.codeproject.com/Articles/23198/C-String-Toolkit-StrTk-Tokenizer
Richard MacCutchan 5-Dec-13 4:54am    
So what is your problem?
nv3 5-Dec-13 5:11am    
Have you checked that your file contains newline characters at the end of each line? And those locale manipulations you do look pretty unnecessary to me (for what you are trying to achieve).

1 solution

Ok i finally figured out , the problem was due to little endianess , which screwed the file. and since i was getting osme output i never thought to look at it. the solution was just to replace the file reading method by:

C#
in.imbue(std::locale(fin.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff,
        std::codecvt_mode(std::little_endian|std::consume_header)>);


after fixing this the rest of the code worked as expected.
Thanks @Richard , @nv3 ,@pablo for response. much appreciated.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900