Click here to Skip to main content
12,951,548 members (52,646 online)
Rate this:
 
Please Sign up or sign in to vote.
I try to read one by one character in the Unicode (utf-8) file, but I don't know how to read a single character. So can you tell me what is the easiest way to read a single character?
Posted 6-Jan-12 15:34pm
Updated 9-Jan-12 2:09am
v2
Comments
johny10151981 9-Jan-12 8:09am
   
Correcting Title
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 2

Maybe this article will get you started in the right direction.
  Permalink  
Comments
Emilio Garavaglia 7-Jan-12 13:16pm
   
:-O
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 1

There are several options depending on the type of stream you're using like fgetc or ReadFile or fstream.>> etc.
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 3

After reading a good article referenced above by DrBones69, you can also use sample code from this thread: Read unicode file into wstring[^]
  Permalink  
Rate this: bad
 
good
Please Sign up or sign in to vote.

Solution 4

Due to the fact that UTF-8 encoded characters have a variable length, you have to check each byte read. A possible solution (using file a file handle opened in binary mode) would be:

typedef struct {
    int nLen;
    unsigned char cByte[6];
} utf8char_t;
 
// Read UTF-8 char into struct
// Return number of UTF-8 bytes read (0 upon EOF, -1 upon invalid codes)
int read_utf8_char(FILE *f, utf8char_t& tChar)
{
    tChar.nLen = 0;
    if (feof(f))
        return 0;
    unsigned char c = tChar.cByte[0] = 
        static_cast<unsigned char>(fgetc(f));
    if (c & 0x80)
    {
        while (c & 0x80)
        {
            ++tChar.nLen;
            c <<= 1;
        }
        for (int i = 1; i < tChar.nLen && i < 6)
        {
            if (feof(f))
                return 0;
            tChar.cByte[i] = static_cast<unsigned char>(fgetc(f));
            if ((tChar.cByte[i] & 0xC0) != 0x80)
                return -1;
        }
        if (tChar.nLen >= 6)
            return -1;
    }
    else
        tChar.nLen = 1;
    return tChar.nLen;
}


Please nothe that this example does not check for all possible wrong UTF-8 codes.
  Permalink  
Comments
johny10151981 8-Jan-12 22:24pm
   
Dude OP Said Unicode

Unicode is 2 bytes long, UTF-8 is variable length
Jochen Arndt 9-Jan-12 4:03am
   
He said Unicode in the title and stated more precisely UTF-8 in the question.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

    Print Answers RSS
Top Experts
Last 24hrsThis month
OriginalGriff 6,084
CHill60 3,480
Maciej Los 3,083
Jochen Arndt 1,975
ppolymorphe 1,830


Advertise | Privacy | Mobile
Web02 | 2.8.170525.1 | Last Updated 9 Jan 2012
Copyright © CodeProject, 1999-2017
All Rights Reserved. Terms of Service
Layout: fixed | fluid

CodeProject, 503-250 Ferrand Drive Toronto Ontario, M3C 3G8 Canada +1 416-849-8900 x 100