|
|
Comments and Discussions
|
|
 |
|

|
Just wondering if you are aware of any performance issues with the ReadUnicodeString(CString&) method in this class. I'm investigating a performance problem where this class is used for processing a unicode file. Profiling has shown that about 90% of the processing time is spent in this method. Additionally if I manually convert the unicode file to an ansi file, my processing time goes from 5 minutes a file to 15 seconds a file. My suspicion is that the W2T conversion is the bottleneck, but I haven't isolated it yet. Do either of you have any metrics concerning this class's ReadString(CString&) and the parent's ReadString(CString&) methods. While I can see there being a difference, I wouldn't have expected it to be so significant. Thanks.
Chris Meech
I am Canadian. [heard in a local bar]
In theory there is no difference between theory and practice. In practice there is. [Yogi Berra]
|
|
|
|

|
I finally tracked down the difficulty. To some extent it is dependant upon the data that the file contains. In my case, I'm working with tab delimited records that have an average record length of something less than 100 characters. If I change the constant ACCUMULATOR_CHAR_COUNT that is declared in the .h file to be 128 instead of 2048, all performance issues disappear. As a suggestion, perhaps the call to StdioFile::Read() in the LoadAccumulator() method could be modified to use a member variable that specifies a record size and limits how much it reads, ie. have it read much less than the ACCUMULATOR_CHAR_COUNT so that the actual read operations are optimized based upon some assigned average record length.
Chris Meech
I am Canadian. [heard in a local bar]
In theory there is no difference between theory and practice. In practice there is. [Yogi Berra]
|
|
|
|

|
For example if I have string like:
SOME_KEY=Курсив
function ReadString read only "SOME_KEY=", and than fails to read other strings after it. I tested it many times.
|
|
|
|

|
As I can see it happens because CStdioFile::Read in LoadAccumulator function returns wrong read byte count.
|
|
|
|

|
Hi Sunriser,
sunrizer104 wrote: function returns wrong read byte count.
I believe the issue arises from assumptions on the code page and the whole SB/MB -> UTF16 codepoint mappings.
I'll get the corrected code uploaded shortly.
Jeff
|
|
|
|

|
Solution found - just replace call from CStdioFile::Read to CFile::Read in LoadAccumulator
modified on Wednesday, March 25, 2009 10:36 AM
|
|
|
|

|
sorry, i hastened with solution - it work very unstable
|
|
|
|
|

|
I've solved the problem by reopening file with CFile::typeBinary flag when Unicode has been detected (and for UNICODE build). Something like this works fine for me:
CUTF16File in;
in.Open(L"file.txt", CFile::modeRead | CFile::shareDenyNone));
if (in.IsUnicodeFile())
{
in.Close();
in.Open(L"file.txt", CFile::modeRead | CFile::typeBinary | CFile::shareDenyNone));
}
|
|
|
|

|
Another bug has been found. m_dwCurrentFilePointer wasn't initialized in Open function and after reopening file there were problems. Should be:
BOOL CUTF16File::Open(LPCTSTR lpszFileName, UINT nOpenFlags, CFileException* pError )
{
m_dwCurrentFilePointer = 0;
...
}
|
|
|
|

|
Hello SkyNight
I changed the underlying CUTF16File class to always open the file in binary mode. It meant changing the way text is written.
General Note to all users, if your project specifies Unicode, then writing any text via CUTF16File::WriteString(LPCTSTR lpsz, BOOL bAsUnicode /*= FALSE */), will always create and write a Unicode text file, regardless of the value of the bAsUnicode flag. If anyone objects to this and feels that CUTF16File should always write ANSI text if bAsUnicode is FALSE, please say. But I don't believe that the parent MFC CStdioFile class does this - and I've tried to keep behaviour as close to CStdioFile where possible and appropriate.
I have sent an update to this article to CodeProject admin and subject to approval it should appear here within the week
|
|
|
|

|
result method:
add virtual DWORD GetLength() const { return m_fileLength;} and DWORD m_fileLength
at CUTF16File.h
modify if( CFile::GetLength())< 2) { return; }
to
if( ( m_fileLength = CFile::GetLength()) <2) { return; }
at CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) in CUTF16File.cpp
modify if(CFile::GetLength()<2) { return bResult; }
to
if(( m_fileLength = CFile::GetLength()) < 2) { return bResult; }
CUTF16File's object can call GetLength()to recive fileLength at any time.
|
|
|
|

|
Hi beniiiiiiiiiiiiiiiiiiiiii01
What is the nug with GetLength
I ran the test (with the change you suggested in the previous post) and opened the output1 (Unicode) and 2 (Ansi). Immediately after opening them - GetLength returned 0.
Then I wrote the text to them and called GetLength, and it returned 62 and 27 respectively.
So it all seems to be fine to me.
Remeber that CFile::GetLength() returns number of bytes as per MSDN.
|
|
|
|

|
CUTF16File::CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) :
CStdioFile(lpszFileName, nOpenFlags),
m_bIsUnicode(FALSE),
m_bByteSwapped(FALSE),
m_dwCurrentFilePointer(0)
{
//char uchBOM[2] = {0};
BYTE uchBOM[2] = {0};
...
}
modify char to BYTE,else
if(uchBOM[0] == UNICODE_BOM[0] && uchBOM[1] == UNICODE_BOM[1])
{
m_bIsUnicode = TRUE;
m_bByteSwapped = FALSE;
}
will fail to work.
|
|
|
|

|
Yep, you're right.
I'm posting a revised version as we speak.
|
|
|
|

|
i set a mfc dialog project with vc6.0. add this UTF-16 class to project and copy test code to it. but wcBOM always equal to 0 when read unicode_write.txt.why? by the way my testproject is set to unicode flag.
|
|
|
|

|
I like c++
|
|
|
|

|
Any chance you could send me your file....so I can check it out?
|
|
|
|

|
this class can`t read character "?"
I like c++
|
|
|
|

|
I've found when reading from a text file in the following format (hope this pastes okay):
01.wav|01 康定情歌(琵琶演奏).wav
02.wav|02 敖包相会(中胡演奏).wav
03.wav|03 二月里来(古筝演奏).wav
04.wav|04 送别(古筝演奏).wav
I find that the "accumulator" gets emptied before it should. I've edited UTF16File.cpp, method LoadAccumulator, and changed:
for (UINT i = 0; i < uCount / 2; i++)
to
for (UINT i = 0; i < uCount; i++)
and everything now seems to work fine.
Has anyone any ideas why the "divide by two" was there? I kow that Unicode characters occupy two bytes, but the buffer is being read using a WCHAR type, so this is automatically handling the double-byte issue.
-- modified at 13:25 Monday 26th June, 2006
|
|
|
|

|
I think the /2 is there because uCount is the number of 1-byte chars (note the sizeof(WCHAR) in its initialisation). Since the for-loop increments the WCHAR pointer pwszBuffer, then we only want to iterate for half the number of 1-byte chars.
I am at a loss as to why this did not work properly for Unicode builds, but it was ok for non-Unicode ones.
Jordan
Ashes to ashes, DOS to DOS.
|
|
|
|

|
Hello everybody.
A few months ago I said that I'd made some changes that allowed the code to work with non-Unicode builds. I made a couple of other mods I believe to get it fully working.
I invited the author to contact me so that he could post the new version up - and he did. But he apparently had problems with CodeProject themselves and it never got done.
Every now and then I get emails from people asking for my new version. I don't mind this but of course there is a delay in my replies. If you're like me, and you look for something you want it now, not after the time it takes to write a request and get an email reply.
So I'm pasting the contents of the header and implementation files in this message so you can just go ahead and copy it straight off.......
1. First UTF16File.h
#if !defined(AFX_UTF16File_H__32BEF8AC_25E0_482F_8B00_C40775BCDB81__INCLUDED_)
#define AFX_UTF16File_H__32BEF8AC_25E0_482F_8B00_C40775BCDB81__INCLUDED_
#if _MSC_VER > 1000
#pragma once
#endif
#pragma warning(push, 3)
#include <list>
#pragma warning(pop)
const unsigned char UNICODE_BOM[2] = {unsigned char(0xFF), unsigned char(0xFE)};
const unsigned char UNICODE_RBOM[2] = {unsigned char(0xFE), unsigned char(0xFF)};
const INT ACCUMULATOR_CHAR_COUNT = 2048;
class CUTF16File: public CStdioFile
{
public:
CUTF16File();
CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags);
virtual BOOL Open(LPCTSTR lpszFileName, UINT nOpenFlags, CFileException* pError = NULL);
virtual BOOL ReadString(CString& rString);
virtual LPTSTR ReadString(LPTSTR lpsz, UINT nMax);
virtual VOID WriteString(LPCTSTR lpsz, BOOL bAsUnicode = FALSE);
virtual LONG Seek(LONG lOff, UINT nFrom);
BOOL IsUnicodeFile() { return m_bIsUnicode; }
protected:
BOOL ReadUnicodeString(CString& szString);
LPTSTR ReadUnicodeString(LPTSTR lpsz, UINT nMax);
virtual VOID WriteANSIString(LPCSTR lpsz);
virtual VOID WriteUnicodeString(LPCWSTR lpsz);
BOOL m_bIsUnicode;
BOOL m_bByteSwapped;
private:
BOOL LoadAccumulator();
std::list<WCHAR> m_Accumulator;
DWORD m_dwCurrentFilePointer;
};
#endif
2. Second UTF16File.cpp
#include "stdafx.h"
#include "UTF16File.h"
#include <atlconv.h>
#ifdef _DEBUG
#undef THIS_FILE
static char THIS_FILE[]=__FILE__;
#define new DEBUG_NEW
#endif
CUTF16File::CUTF16File(): CStdioFile(),
m_bIsUnicode(FALSE),
m_bByteSwapped(FALSE),
m_dwCurrentFilePointer(0)
{
}
CUTF16File::CUTF16File(LPCTSTR lpszFileName, UINT nOpenFlags) :
CStdioFile(lpszFileName, nOpenFlags),
m_bIsUnicode(FALSE),
m_bByteSwapped(FALSE),
m_dwCurrentFilePointer(0)
{
char uchBOM[2] = {0};
if(CFile::modeWrite == (nOpenFlags & CFile::modeWrite)) { return; }
if(CFile::GetLength() < 2) { return; }
m_dwCurrentFilePointer += CStdioFile::Read(reinterpret_cast<LPVOID>(uchBOM), 2);
if(uchBOM[0] == UNICODE_BOM[0] && uchBOM[1] == UNICODE_BOM[1])
{
m_bIsUnicode = TRUE;
m_bByteSwapped = FALSE;
}
if(uchBOM[0] == UNICODE_RBOM[0] && uchBOM[1] == UNICODE_RBOM[1])
{
m_bIsUnicode = TRUE;
m_bByteSwapped = TRUE;
}
if(FALSE == m_bIsUnicode)
{
m_dwCurrentFilePointer = 0;
CStdioFile::Seek(0, CFile::begin);
}
m_Accumulator.clear();
}
BOOL CUTF16File::Open(LPCTSTR lpszFileName, UINT nOpenFlags, CFileException* pError )
{
BOOL bResult = FALSE;
unsigned char uchBOM[3] = {0};
bResult = CStdioFile::Open(lpszFileName, nOpenFlags, pError);
if(CFile::modeWrite == (nOpenFlags & CFile::modeWrite)) { return bResult; }
if(CFile::GetLength() < 2) { return bResult; }
if(TRUE == bResult)
{
m_dwCurrentFilePointer += CStdioFile::Read(reinterpret_cast<LPVOID>(uchBOM), 2);
if(uchBOM[0] == UNICODE_BOM[0] && uchBOM[1] == UNICODE_BOM[1])
{
m_bIsUnicode = TRUE;
m_bByteSwapped = FALSE;
}
if(uchBOM[0] == UNICODE_RBOM[0] && uchBOM[1] == UNICODE_RBOM[1])
{
m_bIsUnicode = TRUE;
m_bByteSwapped = TRUE;
}
if(FALSE == m_bIsUnicode)
{
m_dwCurrentFilePointer = 0;
CStdioFile::Seek( 0, CFile::begin );
}
}
m_Accumulator.clear();
return bResult;
}
BOOL CUTF16File::ReadString( CString& rString )
{
if(TRUE == m_bIsUnicode)
{
return ReadUnicodeString(rString);
}
return CStdioFile::ReadString(rString);
}
LPTSTR CUTF16File::ReadString(LPTSTR lpsz, UINT nMax)
{
if(TRUE == m_bIsUnicode)
{
return ReadUnicodeString(lpsz, nMax);
}
return CStdioFile::ReadString(lpsz, nMax);
}
BOOL CUTF16File::ReadUnicodeString(CString& rString)
{
USES_CONVERSION;
BOOL bRead = FALSE;
WCHAR c[2] = {0};
rString.Empty();
LoadAccumulator();
while(FALSE == m_Accumulator.empty())
{
bRead = TRUE;
c[0] = m_Accumulator.front();
m_Accumulator.pop_front();
if(L'\r' == c[0] || L'\n' == c[0])
{
m_dwCurrentFilePointer += 2;
c[0] = m_Accumulator.front();
m_Accumulator.pop_front();
if(L'\r' == c[0] || L'\n' == c[0])
{
m_dwCurrentFilePointer += 2;
Seek(m_dwCurrentFilePointer, CFile::begin);
}
break;
}
m_dwCurrentFilePointer += 2;
rString += W2T(c);
if(TRUE == m_Accumulator.empty())
{
LoadAccumulator();
}
}
return bRead;;
}
LPTSTR CUTF16File::ReadUnicodeString( LPTSTR lpsz, UINT nMax )
{
USES_CONVERSION;
BOOL bRead = FALSE;
LPTSTR p = lpsz;
WCHAR c[2] = {0};
ASSERT(lpsz != NULL);
ASSERT(AfxIsValidAddress(lpsz, nMax));
ASSERT(m_pStream != NULL);
if(nMax <= 1) { return lpsz; }
LoadAccumulator();
while(FALSE == m_Accumulator.empty() && --nMax)
{
bRead = TRUE;
c[0] = m_Accumulator.front();
m_dwCurrentFilePointer += 2;
*p++ = *(W2T(c));
m_Accumulator.pop_front();
if(L'\r' == c[0] || L'\n' == c[0])
{
m_dwCurrentFilePointer += 2;
c[0] = m_Accumulator.front();
m_Accumulator.pop_front();
if(L'\r' == c[0] || L'\n' == c[0])
{
m_dwCurrentFilePointer += 2;
Seek(m_dwCurrentFilePointer, CFile::begin);
}
break;
}
if(TRUE == m_Accumulator.empty())
{
LoadAccumulator();
}
}
*p = L'\0';
return TRUE == bRead ? lpsz : NULL;
}
VOID CUTF16File::WriteString( LPCTSTR lpsz, BOOL bAsUnicode )
{
USES_CONVERSION;
if(TRUE == bAsUnicode)
{
WriteUnicodeString(T2W(lpsz));
}
else
{
WriteANSIString(lpsz);
}
}
BOOL CUTF16File::LoadAccumulator()
{
BYTE cbBuffer[ACCUMULATOR_CHAR_COUNT * sizeof(WCHAR)];
UINT uCount = CStdioFile::Read(cbBuffer, ACCUMULATOR_CHAR_COUNT * sizeof(WCHAR));
WCHAR* pwszBuffer = reinterpret_cast<WCHAR*>(cbBuffer);
for(UINT i = 0; i < uCount / 2; i++)
{
WCHAR c = *pwszBuffer++;
if(TRUE == m_bByteSwapped)
{
BYTE b1 = BYTE(c >> 8); BYTE b2 = BYTE(c & 0xFF);
c = WCHAR(b1 | (b2 << 8));
}
m_Accumulator.push_back(c);
}
return 0 == uCount;
}
LONG CUTF16File::Seek(LONG lOff, UINT nFrom)
{
LONG lResult = CStdioFile::Seek(lOff, nFrom);
m_dwCurrentFilePointer = CStdioFile::Seek(0, CFile::current);
m_Accumulator.clear();
return lResult;
}
VOID CUTF16File::WriteANSIString( LPCSTR lpsz )
{
CStdioFile::WriteString(lpsz);
}
VOID CUTF16File::WriteUnicodeString(LPCWSTR lpsz)
{
if(0 == CFile::GetPosition())
{
CFile::Write(static_cast<LPVOID>(LPVOID(UNICODE_BOM)), sizeof(UNICODE_BOM));
}
CFile::Write(lpsz, wcslen(lpsz) * sizeof(WCHAR));
}
/////////////////////////////////////////////////////////////////////////////
That's it. Hope you find this useful.
Jordan
Ashes to ashes, DOS to DOS!
-- modified at 4:55 Sunday 28th May, 2006
|
|
|
|

|
Great Article. Thanks for the non-Unicode version! I found that with Visual Studio 2005 I had to modify the first line of the constructor
from
char uchBOM[2] = {0};
to
unsigned char uchBOM[2] = {0};
...in order to have this class correctly read/recognize the encoding BOM.
robo
|
|
|
|

|
Hi Jordan,
Merry Christmas! I finally got your changes incorporated. I made a few changes to get a clean compile under VS 2002. I marked them with a comment.
Jeff
|
|
|
|
 |
|
|
General News Suggestion Question Bug Answer Joke Rant Admin
|
A UTF-16 class derived from CStdioFile for reading and writing Unicode files
| Type | Article |
| Licence | CPOL |
| First Posted | 9 Feb 2005 |
| Views | 106,753 |
| Downloads | 821 |
| Bookmarked | 54 times |
|
|