Handling simple text files in C/C++






4.33/5 (12 votes)
Recent questions on reading ANSI vs Unicode text prompted the following
This code handles a block of text read from a text file in various formats and preprocesses it into the form required by a program. The source text can be ANSI, UTF-8, Unicode or Unicode Big Endian. The code below will convert the text into Unicode or UTF-8 as appropriate for the project settings, whether compiled for Unicode or MBCS support.
The source text is identified by the presence of a Byte Order Mark at the beginning of the buffer. In the absence of a BOM it is assumed that the data is pure ANSI, although there are other tools and Win32 API functions that can help in the determination as described in the following
MSDN link: Unicode and Character Sets[^]
Character types and Byte Order Marks are defined as follows:
- ANSI No signature, single byte characters in the range 0x00 to 0x7F.
- UTF-8 Signature = 3 bytes: 0xEF 0xBB 0xBF followed by multi-byte characters as referred in the following link UTF Information[^].
- UTF-16 LE (Little Endian), used for Windows and other operating systems. Typically called "Unicode". Signature = 2 bytes: 0xFF 0xFE (or 1 word 0xFEFF) followed by words: 0x0000 to 0x007F for normal 0-127 ASCII chars. 0x0080 to 0xFDFF for the extended set.
- UTF-16 BE (Big Endian). This is used for Macintosh operating systems. Signature = 2 bytes: 0xFE 0xFF (or 1 word 0xFFFE) followed by words as UTF-16 but with bytes reversed.
null
bytes to signify the end of the text block (even if it is ANSI). Also the calling routine is responsible for disposing of both the buffers when they are no longer required.
PTSTR Normalise(PBYTE pBuffer
)
{
PTSTR ptText; // pointer to the text char* or wchar_t* depending on UNICODE setting
PWSTR pwStr; // pointer to a wchar_t buffer
int nLength; // a useful integer variable
// obtain a wide character pointer to check BOMs
pwStr = reinterpret_cast<PWSTR>(pBuffer);
// check if the first word is a Unicode Byte Order Mark
if (*pwStr == 0xFFFE || *pwStr == 0xFEFF)
{
// Yes, this is Unicode data
if (*pwStr++ == 0xFFFE)
{
// BOM says this is Big Endian so we need
// to swap bytes in each word of the text
while (*pwStr)
{
// swap bytes in each word of the buffer
WCHAR wcTemp = *pwStr >> 8;
wcTemp |= *pwStr << 8;
*pwStr = wcTemp;
++pwStr;
}
// point back to the start of the text
pwStr = reinterpret_cast<PWSTR>(pBuffer + 2);
}
#if !defined(UNICODE)
// This is a non-Unicode project so we need
// to convert wide characters to multi-byte
// get calculated buffer size
nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, NULL, 0, NULL, NULL);
// obtain a new buffer for the converted characters
ptText = new TCHAR[nLength];
// convert to multi-byte characters
nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, ptText, nLength, NULL, NULL);
#else
nLength = wcslen(pwStr) + 1; // if Unicode, then copy the input text
ptText = new WCHAR[nLength]; // to a new output buffer
nLength *= sizeof(WCHAR); // adjust to size in bytes
memcpy_s(ptText, nLength, pwStr, nLength);
#endif
}
else
{
// The text data is UTF-8 or Ansi
#if defined(UNICODE)
// This is a Unicode project so we need to convert
// multi-byte or Ansi characters to Unicode.
// get calculated buffer size
nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, NULL, 0);
// obtain a new buffer for the converted characters
ptText = new TCHAR[nLength];
// convert to Unicode characters
nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, ptText, nLength);
#else
// This is a non-Unicode project so we just need
// to skip the UTF-8 BOM, if present
if (memcmp(pBuffer, "\xEF\xBB\xBF", 3) == 0)
{
// UTF-8
pBuffer += 3;
}
nLength = strlen(reinterpret_cast<PSTR>(pBuffer)) + 1; // if UTF-8/ANSI, then copy the input text
ptText = new char[nLength]; // to a new output buffer
memcpy_s(ptText, nLength, pBuffer, nLength);
#endif
}
// return pointer to the (possibly converted) text buffer.
return ptText;
}