Click here to Skip to main content
Rate this: bad
good
Please Sign up or sign in to vote.
See more: C++ Windows Unicode utf8
I try to read one by one character in the Unicode (utf-8) file, but I don't know how to read a single character. So can you tell me what is the easiest way to read a single character?
Posted 6-Jan-12 16:34pm
Edited 9-Jan-12 3:09am
v2
Comments
johny10151981 at 9-Jan-12 8:09am
   
Correcting Title
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 4

Due to the fact that UTF-8 encoded characters have a variable length, you have to check each byte read. A possible solution (using file a file handle opened in binary mode) would be:
 
typedef struct {
    int nLen;
    unsigned char cByte[6];
} utf8char_t;
 
// Read UTF-8 char into struct
// Return number of UTF-8 bytes read (0 upon EOF, -1 upon invalid codes)
int read_utf8_char(FILE *f, utf8char_t& tChar)
{
    tChar.nLen = 0;
    if (feof(f))
        return 0;
    unsigned char c = tChar.cByte[0] = 
        static_cast<unsigned char>(fgetc(f));
    if (c & 0x80)
    {
        while (c & 0x80)
        {
            ++tChar.nLen;
            c <<= 1;
        }
        for (int i = 1; i < tChar.nLen && i < 6)
        {
            if (feof(f))
                return 0;
            tChar.cByte[i] = static_cast<unsigned char>(fgetc(f));
            if ((tChar.cByte[i] & 0xC0) != 0x80)
                return -1;
        }
        if (tChar.nLen >= 6)
            return -1;
    }
    else
        tChar.nLen = 1;
    return tChar.nLen;
}
 
Please nothe that this example does not check for all possible wrong UTF-8 codes.
  Permalink  
Comments
johny10151981 at 8-Jan-12 22:24pm
   
Dude OP Said Unicode
 
Unicode is 2 bytes long, UTF-8 is variable length
Jochen Arndt at 9-Jan-12 4:03am
   
He said Unicode in the title and stated more precisely UTF-8 in the question.
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 3

After reading a good article referenced above by DrBones69, you can also use sample code from this thread: Read unicode file into wstring[^]
  Permalink  
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 2

Maybe this article will get you started in the right direction.
  Permalink  
Comments
Emilio Garavaglia at 7-Jan-12 13:16pm
   
:-O
Rate this: bad
good
Please Sign up or sign in to vote.

Solution 1

There are several options depending on the type of stream you're using like fgetc or ReadFile or fstream.>> etc.
  Permalink  

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
0 OriginalGriff 5,170
1 DamithSL 4,357
2 Maciej Los 3,750
3 Kornfeld Eliyahu Peter 3,470
4 Sergey Alexandrovich Kryukov 2,851


Advertise | Privacy | Mobile
Web01 | 2.8.141216.1 | Last Updated 9 Jan 2012
Copyright © CodeProject, 1999-2014
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100