Enhance your dynamic memory allocation with an undocumented MFC class

lano1106

4.88/5 (76 votes)

Jan 5, 2006

13 min read

126392

1016

This article describes how to improve dynamic allocation performance on a class basis in a non-intrusive way with an undocumented MFC utility class.

Download source - 47.2 Kb

Introduction
Why use alternatives to malloc/free, new/delete
Limitations of other alternatives
Anatomy of CFixedAlloc
How to use CFixedAlloc
The demo program
Suggestions
Conclusion
Bibliography
History

Introduction

I never stop to be amazed by what you can find by browsing the MFC source code files. My latest discovery is rather spectacular. It consists of a small utility class that allows modifying how objects for a given class are dynamically allocated. This class is CFixecAlloc. It is used inside the famous CString class to allocate string buffers. Using CFixedAlloc with the existing code requires very minimal changes and strives surprising improvements in performance. In this article, I will discuss CRT dynamic allocation overheads, usual alternatives to CRT shortcomings, the anatomy of the CFixedAlloc class, how to use CFixedAlloc in your programs, and finally, I will present a demo application so you can see by yourself what CFixedAlloc can do for your programs.

Why use alternatives to malloc/free, new/delete

It is a well known fact that dynamic memory allocation is expensive. It incurs performance overhead both in execution time and in space. If you need to be convinced, just take a peak in the malloc.c file provided with the CRT source code files. malloc() function is a complex function that takes some time to be executed. Concerning the space overhead, the subject is a bit complex. Here is an overview. Depending on the version of VC++ you are using when you compile your program, and the Windows version where your program is running, a different version of heap will be used. This is out of scope for this article but if you are curious about how CRT chooses the heap version, you can find the function __heap_select() in the heapinit.c file. In the CRT winheap.h header file, three heap versions are defined:

__SYSTEM_HEAP
__V5_HEAP
__V6_HEAP

V5 and V6 heap versions map to heap implementations private to the CRT library while the system heap maps directly to the Win32 heap services (HeapCreate(), HeapAlloc(), etc.). What is interesting about the V5 and V6 versions is that you can know exactly what is the space overhead for these heaps by looking at the CRT source code. For example, you can find this statement for the V6 heap:

// add 8 bytes entry overhead and round up to next para size
sizeEntry = (intSize + 2 * (int)sizeof(int) + 
            (BYTES_PER_PARA - 1)) & ~(BYTES_PER_PARA - 1);

BYTES_PER_PARA equals 16 bytes. This is a huge overhead. Consider the case where you would request twelve bytes. The CRT would reserve 32 bytes for this request. This is more than the double of what you requested! Unfortunately (or fortunately, it depends on your perspective), the V5 and the V6 heap are nowadays rarely used and we cannot know for sure what is the HeapAlloc() overhead because we do not have access to this source code. However, it is a very good assumption that there is an overhead. Microsoft states this fact in the HeapCreate() documentation:

The system uses memory from the private heap to store heap support structures, so not all of the specified heap size is available to the process. For example, if the HeapAlloc function requests 64 kilobytes (K) from a heap with a maximum size of 64K, the request may fail because of system overhead.

Limitations of other alternatives

Some people have written memory pool classes to workaround the CRT heap overhead. The excellent article from Uri Twig is an example of such a class. However, most of them require massive code changes to replace all new and delete invocations to some calls to the buffer pool class member functions. Another common potential limitation (potential because it depends on your needs) is that they lack thread safety. CFixedAlloc addresses these limitations and it will be explained how in the next section.

Anatomy of CFixedAlloc

Let's start by viewing the class declaration:

class CFixedAlloc
{
// Constructors
public:
    CFixedAlloc(UINT nAllocSize, UINT nBlockSize = 64);

// Attributes
    UINT GetAllocSize() { return m_nAllocSize; }

// Operations
public:
    void* Alloc();      // return a chunk of 
                        // memory of nAllocSize
    void Free(void* p); // free chunk of memory 
                        // returned from Alloc
    void FreeAll();     // free everything allocated 
                        // from this allocator

// Implementation
public:
    ~CFixedAlloc();

protected:
    struct CNode
    {
        CNode* pNext;   // only valid when in 
                        // free list
    };

    UINT m_nAllocSize;  // size of each block 
                        // from Alloc
    UINT m_nBlockSize;  // number of blocks to 
                        // get at a time
    CPlex* m_pBlocks;   // linked list of blocks 
                        // (is nBlocks*nAllocSize)
    CNode* m_pNodeFree; // first free node 
                        // (NULL if no free nodes)
    CRITICAL_SECTION m_protect;
};

As you can see, it is a very simple class. The first thing that we can notice is that the class provides thread safety with a critical section. Next, m_AllocSize contains the object size of the class that will be using its service, and m_blockSize specifies the number of objects each fixed block of memory can contain. Both data members are set at the construction. The remaining unknown is the mysterious CPlex pointer. I will not go in to the details of this class but just know that it is the class that does the actual CRT dynamic allocation calls. It only contains a pointer to another CPlex object in order to create a linked list of CPlex so its size is only 4 bytes. When allocating memory, it requests:

m_allocSize*m_blockSize+sizeof(CPlex)

If you are interested in knowing more about CPlex, I recommend you to read the book MFC Internals which discusses CPlex in more details. With this information, we can already compare the memory overhead difference between CRT and CFixedAlloc. By using CRT, you will have the CRT overhead for each object, while with CFixedAlloc, it is the CRT overhead plus the size of CPlex (4 bytes) divided by the number of allocated objects. When the number of objects is big, the memory overhead is very close to zero, and as you will see with the demo, it can make a huge difference of many MBs. Now, let's look more closely at the two more important CFixedAlloc functions:

void* CFixedAlloc::Alloc()
{
    EnterCriticalSection(&m_protect);
    if (m_pNodeFree == NULL)
    {
        CPlex* pNewBlock = NULL;
        TRY
        {
            // add another block
            pNewBlock = CPlex::Create(m_pBlocks, 
                        m_nBlockSize, m_nAllocSize);
        }
        CATCH_ALL(e)
        {
            LeaveCriticalSection(&m_protect);
            THROW_LAST();
        }
        END_CATCH_ALL

        // chain them into free list
        CNode* pNode = (CNode*)pNewBlock->data();
        // free in reverse order to make it 
        // easier to debug
        (BYTE*&)pNode += 
            (m_nAllocSize * m_nBlockSize) - m_nAllocSize;
        for (int i = m_nBlockSize-1; i >= 0; i--, 
                     (BYTE*&)pNode -= m_nAllocSize)
        {
            pNode->pNext = m_pNodeFree;
            m_pNodeFree = pNode;
        }
    }
    ASSERT(m_pNodeFree != NULL); // we must have something

    // remove the first available node from the free list
    void* pNode = m_pNodeFree;
    m_pNodeFree = m_pNodeFree->pNext;

    LeaveCriticalSection(&m_protect);
    return pNode;
}

void CFixedAlloc::Free(void* p)
{
    if (p != NULL)
    {
        EnterCriticalSection(&m_protect);

        // simply return the node to the free list
        CNode* pNode = (CNode*)p;
        pNode->pNext = m_pNodeFree;
        m_pNodeFree = pNode;
        LeaveCriticalSection(&m_protect);
    }
}

In Alloc(), the code checks if the pool of free memory is empty, if it is then it creates a new block of memory (CPlex) and sets up a list of free nodes. It is interesting to note that there is no wasted space to place node information because it is placed directly into each block. Of course, in order for this scheme to work, m_nAllocSize must be bigger than sizeof(CNode). This is why it is checked in the constructor:

ASSERT(nAllocSize >= sizeof(CNode));

If you want to see the CRT code, with a glimpse of an eye, you can tell that CFixedAlloc is much lighter. From an algorithmic point of view, the main factor for this simplicity is because the memory block's size is fixed (thus the name of the class). In a heap that allows variable size allocation, after many allocations and deallocations, the heap gets fragmented and the heap code has to scan the heap once in a while to merge adjacent small free blocks into big ones to make an efficient use of memory. With my demo program, I got the allocation time shred by a factor of 4 to 5.

And finally, let me introduce to you macros that help to use CFixedAlloc with your classes:

// DECLARE_FIXED_ALLOC -- used in class definition
#define DECLARE_FIXED_ALLOC(class_name) \
public: \
    void* operator new(size_t size) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
    void* operator new(size_t, void* p) \
        { return p; } \
    void operator delete(void* p) { s_alloc.Free(p); } \
    void* operator new(size_t size, LPCSTR, int) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
protected: \
    static CFixedAlloc s_alloc \

// IMPLEMENT_FIXED_ALLOC -- used in class implementation file
#define IMPLEMENT_FIXED_ALLOC(class_name, block_size) \
CFixedAlloc class_name::s_alloc(sizeof(class_name), block_size) \

The DECLARED_FIXED_ALLOC() macro overloads the new and delete operators of your class and thus allows you to use CFixedAlloc with absolutely no code change. With the IMPLEMENT_FIXED_ALLOC(), this is where you specify the block size. To complete this section, I will add that you can find more information on this type of allocation scheme in the excellent book of Scott Meyer, Effective C++.

How to use CFixedAlloc

Include "fixalloc.h" in the header file that contains the class definition that you want to modify.
Add the DECLARE_FIXED_ALLOC() macro in the class declaration.
Add the IMPLEMENT_FIXED_ALLOC() macro in the CPP file that contains the class definition.
Since CFixedAlloc is a private MFC class, you have to add an additional include directory in the compiler options to point to the MFC source code directory. (I am assuming that you installed it during Visual Studio setup. It is installed by default, I think.)
Recompile.
Fine tune the size of the block size.

Now, you are done. Your class is upgraded for using the MFC fixed allocation service. One word of caution, if the define _DEBUG is present during the compilation, the CFixedAlloc macros will expand to nothing and the result will be that your class will behave as if you did no change to it. During the development of the demo program, I got this small problem where because I choose to link with the debug version of the run-time library, the _DEBUG macro was implicitly set. Be aware of that. To figure that out, I added garbage in this block that MFC automatically adds when it creates a new file:

#ifdef _DEBUG
#define new DEBUG_NEW
#undef THIS_FILE
static char THIS_FILE[] = __FILE__;
#endif

Also, here is a small comment on the last step. It is very important that you provide a good value for the block size. If it is too big, you will be wasting memory, and if it is too small then you will not get as much performance improvements as you could get. However, even if the block size value is too small, you will still reduce the number of CRT allocation calls by a factor equal to the block size which is non negligible. The ideal value for the block size is the exact number of objects that will be allocated like in the demo program, but of course it is usually not possible to know that value.

Visual C++ 2005 warnings

It has been reported that when compiling the demo program with VC++2005, the user is getting the following warnings:

.\CFixedAllocDemoDlg.cpp(237) : warning C4995: 
    'CFixedAlloc': name was marked as #pragma deprecated
.\CFixedAllocDemoDlg.cpp(240) : warning C4995: 
    'CFixedAlloc': name was marked as #pragma deprecated

Since that I am not aware of any MFC class that would have superseded CFixedAlloc and after reading this article, you should know what the class CFixedAlloc does and how, my recommendation, if using CFixedAlloc in your program makes a difference, is that you can safely ignore the warnings. If you feel extra cautious and you really want to make sure that your program will compile with future versions of MFC in case Microsoft removes the CFixedAlloc class from MFC, nothing stops you from making a private copy of the CFixedAlloc files.

This is true that Microsoft is slowly moving away from CFixedAlloc. In MFC6, CFixedAlloc is used with every temporary handle maps objects classes (i.e. CWnd and CGdiObject). Here is an example:

class CTempGdiObject : public CGdiObject
{
    DECLARE_DYNCREATE(CTempGdiObject)
    DECLARE_FIXED_ALLOC(CTempGdiObject);
};

Such classes have been removed in MFC7 (VC++2003). I am really curious about what motivates the MFC team to do so and if someone knows, I would really appreciate if you could share that information.

Memory deallocation

CFixedAlloc returns a node in a "free list" when an object is deleted but the "free list" is inside the same allocated CPlex. Since plexes are single linked, they will remain allocated for the entire duration of the program (until their head is deleted). In fact, CFixedAlloc is more than an allocator, it is a recycler: it allocates on need but does not release. It just keeps apart for eventual subsequent allocation requests.

This results in better performance speed in case of frequent alternate new / delete / new calls (since you will mostly reuse the system resource you already own), but may create problems in those situations when a huge amount of memory is required only for a limited time: the program will retain the memory until the head of the plexes is deleted (and since it is static ... at the end of the program) unless the user explicitly releases the memory with CFixedAlloc::FreeAll() once he is done with the memory. You can call CFixedAlloc::FreeAll() anytime to free yourself the memory if you have previously deleted all the allocated objects. Otherwise, the memory will be freed when the CFixedAlloc object is destroyed. Here is a little reminder: Global objects are created in startup code before entering WinMain(). Global objects are destroyed after the program has exited WinMain() and right before the program gets unloaded.

CFixedAlloc and inheritance

When using CFixedAlloc with the derived class, there are a few things that you must be aware of. Let's say you have:

class Base
{
    DECLARE_FIXED_ALLOC(Base);
};

class Child : public Base
{
private:
    // Some data members declared.
};

If you do:

Child *p = new Child;

Base::operator new() will be called and because the class Child does not have the same size as the class Base, CFixedAlloc will not work. This problem is discussed with great details in the book Effective C++ and to prevent that problem, this is also why there is an ASSERT() statement in the overloaded operator new in the DECLARE_FIXED_ALLOC() macro. However, do not rely on the ASSERT() macros to catch potential errors in your code because they are useless! Since the ASSERT() macro is active only in debug mode they are of no use because in debug mode the macro DECLARE_FIXED_ALLOC() also expands to nothing. That being said, nothing stops you from using CFixedAlloc in a Base class *and* its derived classes:

class Base
{
    DECLARE_FIXED_ALLOC(Base);
};

class Child : public Base
{
    DECLARE_FIXED_ALLOC(Child);
};

Of course, it doesn't make sense to declare the Base class as FIXED_ALLOC if the class is abstract. One thing to be aware about using CFixedAlloc in class hierarchy is that the operator delete is static and not virtual. Consider the following example:

Base *p = new Child;
delete p;

In this case, Base::operator delete() would be called (I haven't tested it, but I'm 99% confident that this is what would happen) and that would be a serious problem. To workaround that problem here is a potential solution:

class Base
{
public:
    virtual void Destroy() {delete this;} 
    DECLARE_FIXED_ALLOC(Base);
};

class Child : public Base
{
public:
    virtual void Destroy() {delete this;} 
    DECLARE_FIXED_ALLOC(Child);
};

Base *p = new Child;
p->Destroy();

The demo program

The demo program is very simple. It just declares two almost identical classes: one without CFixedAlloc and the other using it. Then, a bunch of objects is created. This operation is timed, and once completed, the result is displayed. For computing the size overhead of the class without CFixedAlloc, I used the overhead you would have if your program was using the __V6_HEAP which is very unlikely and thus the computed value might not be totally exact. However, from experience, it seems to be quite accurate:

#define ITER_NUM 1000*1024

class A
{
public:
    A( A *next ) : m_next(next) {}
    A  *m_next;
    int dummy1;
    int dummy2;
};

class B
{
public:
    B( B *next ) : m_next(next) {}
    B  *m_next;
    int dummy1;
    int dummy2;

    DECLARE_FIXED_ALLOC(B);
};

IMPLEMENT_FIXED_ALLOC(B,ITER_NUM);

void CCFixedAllocDemoDlg::OnTestTimebutton()
{
    // TODO: Add your control notification 
    // handler code here
    A *curAObj, *firstAObj = NULL;
    B *curBObj, *firstBObj = NULL;
    DWORD startA, endA, startB, endB;
    register int i;
    {
        CWaitCursor wait;

        startA = GetTickCount();

        for( i = 0; i < ITER_NUM; i++ )
        {
            firstAObj = new A(firstAObj);
        }
        while( firstAObj )
        {
            curAObj = firstAObj->m_next;
            delete firstAObj;
            firstAObj = curAObj;
        }

        startB = endA = GetTickCount();

        for( i = 0; i < ITER_NUM; i++ )
        {
            firstBObj = new B(firstBObj);
        }
        while( firstBObj )
        {
            curBObj = firstBObj->m_next;
            delete firstBObj;
            firstBObj = curBObj;
        }

        endB = GetTickCount();
    }

    displayResult( endA-startA,endB-startB );
}

#define BYTES_PER_PARA 16

void CCFixedAllocDemoDlg::displayResult( DWORD timeA, 
                                          DWORD timeB )
{
    TCHAR buf[1024];

    /*
     * Each object A takes 32 bytes. See article for details
     */
    int overheadA = (32-sizeof(A))*ITER_NUM;

    /*
     * First compute the allocated size then substract the 
     * requested size
     */
    int overheadB = 
      (8+sizeof(B)*ITER_NUM + sizeof(CPlex) + (BYTES_PER_PARA-1))
                    & ~(BYTES_PER_PARA - 1);
    overheadB    -= sizeof(B)*ITER_NUM;

    wsprintf( buf, __TEXT("Creating and destroying %d objects\n")
                   __TEXT("without CFixedAlloc\t: %4d ms\n")
                   __TEXT("with CFixedAlloc\t: %4d ms\n")
                   __TEXT("You saved %d bytes with CFixedAlloc"),
                   ITER_NUM, timeA, timeB,
                   overheadA - overheadB );
    MessageBox(buf,__TEXT("Results"));
}

Suggestions

You could make a single thread version of CFixedAlloc just by making a private copy of the class and remove the critical section stuff to further increase the performance gain. Johan Strand pointed out that the single thread support has already been added to MFC7 (VC++.NET). If you want to use the single thread version of CFixedAlloc, you can use the following macros:

// DECLARE_FIXED_ALLOC -- used in class definition
#define DECLARE_FIXED_ALLOC_NOSYNC(class_name) \
public: \
    void* operator new(size_t size) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
    void* operator new(size_t, void* p) \
        { return p; } \
    void operator delete(void* p) { s_alloc.Free(p); } \
    void* operator new(size_t size, LPCSTR, int) \
    { \
        ASSERT(size == s_alloc.GetAllocSize()); \
        UNUSED(size); \
        return s_alloc.Alloc(); \
    } \
protected: \
    static CFixedAllocNoSync s_alloc \

// IMPLEMENT_FIXED_ALLOC_NOSYNC -- used in class implementation file
#define IMPLEMENT_FIXED_ALLOC_NOSYNC(class_nbame, block_size) \
CFixedAllocNoSync class_name::s_alloc(sizeof(class_name), block_size) \

There is a mistake in the IMPLEMENT_FIXED_ALLOC_NOSYNC() macro. The macro parameter class_nbame should be class_name. These macros do not appear to be used by MFC which makes sense since there is a syntax error in the macro. This means that the no sync version has probably not been tested by the MFC team. However, once the syntax error is fixed and since the change was trivial, it should work fine. You can fix the syntax error and use the fixed fixalloc.h in your program without having to recompile MFC (the macro isn't used anywhere in MFC). I have modified the demo program so that you can test out the single thread version. All you have to do is:

Fix the syntax error in your fixalloc.h file.
Uncomment the following define statement:

/*
 * Uncomment the next define statement 
 * to use the single thread version of
 * CFixedAlloc. Prior to recompilation, 
 * you must fix a syntax error in
 * fixalloc.h. The IMPLEMENT_FIXED_ALLOC_NOSYNC() 
 * macro parameter class_nbame
 * should be rename to class_name.
 *
 * Note: The single thread version is only 
 * available in VC++.NET and up.
 */
//#define _USE_SINGLE_THREAD

#ifndef _USE_SINGLE_THREAD
class B
{
public:
    B( B *next ) : m_next(next) {}
    B  *m_next;
    int dummy1;
    int dummy2;

    DECLARE_FIXED_ALLOC(B);
};

IMPLEMENT_FIXED_ALLOC(B,ITER_NUM);
#else
class B
{
public:
    B( B *next ) : m_next(next) {}
    B  *m_next;
    int dummy1;
    int dummy2;

    DECLARE_FIXED_ALLOC_NOSYNC(B);
};

IMPLEMENT_FIXED_ALLOC_NOSYNC(B,ITER_NUM);
#endif

My testing of the no sync version has shown a speedup improvement of 40% over the original CFixedAlloc version. The bottom line is use the no sync version if you are not developing a multithread application.

Conclusion

That is it! I hope you enjoyed this article, and if you did and found it useful, please take a few seconds to rank it. You can do so right at the bottom of the article. Also, I invite you to visit my website to check for any updates of this article.

Bibliography

Platform SDK: Memory Management MSDN Library.
Uri Twig, Buffer Pool\Object Pool
Charles Petzold, Programming Windows, Fifth Edition, Microsoft Press, 1999
Jeff Prosise, Programming Windows with MFC, Second Edition Microsoft Press, 1999
George Shepherd, Scot Wingo, MFC Internals, Addison Wesley, 1996
Scott Meyer, Effective C++ (3rd edition), Addison Wesley, 2005

History

01-27-2006
- Included comments in VC++2005 warnings reported by Allan Braun.
- Table of content added.
01-20-2006
- Expanded the "How to use" section based on the comments from readers.
- Expanded the "Suggestions" section based on the comments from Johan Strand.
11-28-2005
- Original article.

Enhance your dynamic memory allocation with an undocumented MFC class

Contents