Click here to Skip to main content
15,879,535 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hi All,

I have to read a CSV file created as google doc containing some french accent. When I tried to read this file using fgets() function, it replaced the french character with some garbage values.

I am new to c++ and does not has the idea how to read a unicoded file.
Please guide me in order to get the solution. It will be more nice to me if you provide the code base.


Thanks a lot in advance.


Rajesh
Posted

The first thing is to determine the character set that the text file is using. The "garbage values" are how the French character is encoded in the character set. From your comments I am pretty sure it is not Unicode but if you posted the values (in hex) of the string that you read in that would make it much easier to be sure. I suspect the text uses a MBCS (multi-byte character set) in which case to display it properly on Windows you may need to use the correct code page.

BTW The "r" mode is correct as fopen defaults to text mode, but you can use "rt" if you want. In any case reading a text file as text or binary makes little difference (except for the way lines are terminated).
 
Share this answer
 
From your comment:
I am observing the garbage characters inside Visual Studio IDE while debugging. I am using VS6.0.

OK so I think it is a problem about character sets. As Andrew said it, your text file is probably not a Unicode text file and since your computer is not french you can't display the characters properly. (Visual Studio will take the current locale settings and load the corresponding characters set for the 8-bits characters).

I suggest that you convert your input string into unicode strings using these functions:
C#
//just to set a max size for the string buffers
#define MAX_SIZE 1000
//change this value to use a different characters set
#define CODE_PAGE 1250

//converts an ansi string (8 bits per character)
//into a unicode string (16 bits per character)
//using the code page provided by constant CODE_PAGE
//Note: do not delete or free the returning pointer!
WCHAR* AnsiToUnicode(LPCSTR ansiString)
{
    static WCHAR unicodeString[MAX_SIZE];
    MultiByteToWideChar(
        CODE_PAGE,          // code page
        MB_PRECOMPOSED,     // character-type options
        ansiString,         // address of string to map
        -1,                 // number of bytes in string
        unicodeString,      // address of wide-character buffer
        MAX_SIZE            // size of buffer
    );
    return unicodeString;
}

//converts a unicode string (16 bits per character)
//into an ansi string (8 bits per character)
//using the code page provided by constant CODE_PAGE
//Note: do not delete or free the returning pointer!
char* UnicodeToAnsi(LPCWSTR unicodeString)
{
    static char ansiString[MAX_SIZE];
    WideCharToMultiByte(
        CODE_PAGE,      // code page
        0,              // performance and mapping flags
        unicodeString,  // address of wide-character string
        -1,             // number of characters in string
        ansiString,     // address of buffer for new string
        MAX_SIZE,       // size of buffer
        NULL,           // address of default for unmappable characters
        NULL            // address of flag set when default
    );
    return ansiString;
}

//test
void test()
{
     FILE *fp;
     char str[100];
     fp = _tfopen(_T("D:\\myfile.csv"), _T("rt"));
     while (fgets(str, 100, fp))
     {
          //convert the string
          WCHAR* wstr = AnsiToUnicode(str);
 
          //do something......
     }
}


Or you may use cleaner versions of these functions:
C#
int AnsiToUnicode(LPCSTR ansiString, LPWSTR unicodeString, int maxSize)
{
    return MultiByteToWideChar(
        CODE_PAGE,          // code page
        MB_PRECOMPOSED,     // character-type options
        ansiString,         // address of string to map
        -1,                 // number of bytes in string
        unicodeString,      // address of wide-character buffer
        maxSize             // size of buffer
    );
}

int UnicodeToAnsi(LPCWSTR unicodeString, char* ansiString, int maxSize)
{
    return WideCharToMultiByte(
        CODE_PAGE,      // code page
        0,              // performance and mapping flags
        unicodeString,  // address of wide-character string
        -1,             // number of characters in string
        ansiString,     // address of buffer for new string
        maxSize,        // size of buffer
        NULL,           // address of default for unmappable characters
        NULL            // address of flag set when default
    );
}


And don't forget to enable unicode string display under Visual Studio:
To set your debugger options to display Unicode strings, click the Tools menu, click Options, click Debug, then check the Display Unicode Strings check box.
 
Share this answer
 
Comments
Harrison H 1-Mar-11 18:04pm    
I'd give you more than five for the sheer effort you've put into resolving the OP's problem.
Olivier Levrey 2-Mar-11 3:42am    
Thank you :)
Sometimes it takes time just because OP can't explain properly what he/she really wants...
This function should work properly (well it does for me!).
You are maybe trying to read a "french" file from a non-french version of windows?

Try changing the locale before reading the file:

C
//required for the locale function
#include <locale.h>

void yourFunction()
{
    //changed locale settings for the current thread only to french
    setlocale(LC_ALL, "French");
    //then open and read your text file
    //...
}
 
Share this answer
 
Comments
Member 7660635 28-Feb-11 5:08am    
Thanks for your response. I tried this but it did not work. Here is the my code
//required for the locale function
#include <locale.h>

void yourFunction()
{
//changed locale settings for the current thread only to french
setlocale(LC_ALL, "French");
FILE *fp = fopen("D:\\myfile.csv","r");
while(fgets(str,1000,fp))
{
// do something
}
}

"str" still contains garbage values.
Olivier Levrey 28-Feb-11 5:19am    
Use fopen with "rt" and not "r" otherwise you will get binary data
i belive you should use UNICODE version of fgets()
 
Share this answer
 
fgets() will try to open and read file as ANSI. If you are using TCHARs, you should use _fgetts(). Otherwise, to read Unicode file, you should use fgetws(), and character buffer should use WCHAR, not char.
 
Share this answer
 
Comments
Member 7660635 28-Feb-11 5:10am    
I tried both _fgetts() and fgetws() but could not resolve it.
Hans Dietrich 28-Feb-11 5:17am    
Show us your code.
Member 7660635 28-Feb-11 8:13am    
Please find the code below:

FILE *fp;
TCHAR str[100];
fp = _tfopen("D:\\myfile.csv", _T("rb"));
while( _fgetts( str, 100, fp ))
{
//do something......
}
In the code you sent me, you use fopen with "r". If you want to read text, you should use "rt".
 
Share this answer
 
Comments
Member 7660635 28-Feb-11 5:53am    
Thanks for your quick response.
I tried this but could not resolve.
Olivier Levrey 28-Feb-11 6:10am    
With all the answers you have there, it should work.

Where are you observing the garbage characters? Inside Visual Studio IDE while debugging? Inside a dialog box you created? Inside another text file you wrote?
Provide your full code or tell us more details about where you see the problem, because it SHOULD work.
Member 7660635 28-Feb-11 8:07am    
I am observing the garbage characters inside Visual Studio IDE while debugging. I am using VS6.0.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900