This code handles a block of text read from a text file in various formats and preprocesses it into the form required by a program. The source text can be ANSI, UTF-8, Unicode or Unicode Big Endian. The code below will convert the text into Unicode or UTF-8 as appropriate for the project settings, whether compiled for Unicode or MBCS support.
The source text is identified by the presence of a Byte Order Mark at the beginning of the buffer. In the absence of a BOM it is assumed that the data is pure ANSI, although there are other tools and Win32 API functions that can help in the determination as described in the following
MSDN link: Unicode and Character Sets
Character types and Byte Order Marks are defined as follows:
No signature, single byte characters in the range 0x00 to 0x7F.
Signature = 3 bytes: 0xEF 0xBB 0xBF
followed by multi-byte characters as referred in the following link
- UTF-16 LE (Little Endian), used for Windows and other operating systems. Typically called "Unicode".
Signature = 2 bytes: 0xFF 0xFE (or 1 word 0xFEFF)
followed by words:
0x0000 to 0x007F for normal 0-127 ASCII chars.
0x0080 to 0xFDFF for the extended set.
- UTF-16 BE (Big Endian). This is used for Macintosh operating systems.
Signature = 2 bytes: 0xFE 0xFF (or 1 word 0xFFFE)
followed by words as UTF-16 but with bytes reversed.
Following the comments from MilanA below, I have modified the code to always
return a newly allocated buffer, even when no conversion has taken place.
The input buffer into which the text is read must be followed by two
bytes to signify the end of the text block (even if it is ANSI). Also the calling routine is responsible for disposing of both the buffers when they are no longer required.
PTSTR Normalise(PBYTE pBuffer
PTSTR ptText; PWSTR pwStr; int nLength;
pwStr = reinterpret_cast<PWSTR>(pBuffer);
if (*pwStr == 0xFFFE || *pwStr == 0xFEFF)
if (*pwStr++ == 0xFFFE)
WCHAR wcTemp = *pwStr >> 8;
wcTemp |= *pwStr << 8;
*pwStr = wcTemp;
pwStr = reinterpret_cast<PWSTR>(pBuffer + 2);
nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, NULL, 0, NULL, NULL);
ptText = new TCHAR[nLength];
nLength = WideCharToMultiByte(CP_UTF8, 0, pwStr, -1, ptText, nLength, NULL, NULL);
nLength = wcslen(pwStr) + 1; ptText = new WCHAR[nLength]; nLength *= sizeof(WCHAR); memcpy_s(ptText, nLength, pwStr, nLength);
nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, NULL, 0);
ptText = new TCHAR[nLength];
nLength = MultiByteToWideChar(CP_UTF8, 0, reinterpret_cast<PCSTR>(pBuffer), -1, ptText, nLength);
if (memcmp(pBuffer, "\xEF\xBB\xBF", 3) == 0)
pBuffer += 3;
nLength = strlen(reinterpret_cast<PSTR>(pBuffer)) + 1; ptText = new char[nLength]; memcpy_s(ptText, nLength, pBuffer, nLength);
I was a Software Engineer for 40+ years starting with mainframes, and moving down in scale through midi, UNIX and Windows PCs. I started as an operator in the 1960s, learning assembler programming, before switching to development and graduating to COBOL, Fortran and PLUS (a proprietary language for Univac systems). Later years were a mix of software support and development, using mainly C, C++ and Java on UNIX and Windows systems.
Since retiring I have been learning some of the newer (to me) technologies (C#, .NET, WPF, LINQ, SQL, Python ...) that I never used in my professional life, and am actually able to understand some of them.
I still hope one day to become a real programmer.