Custom fixed block allocators are specialized memory managers used to solve performance problems with the global heap. In the article "An Efficient C++ Fixed Block Memory Allocator", I implemented an allocator class to improve speed and eliminate the possibility of a fragmented heap memory fault. In this latest article, the
Allocator class is used as a basis for the
xallocator implementation to replace
Unlike most fixed block allocators, the
xallocator implementation is capable of running in a completely dynamic fashion without advanced knowledge of block sizes or block quantity. The allocator takes care of all the fixed block management for you. It is completely portable to any PC-based or embedded system. In addition, it offers insight into your dynamic usage with memory statistics.
In this article, I replace the C library
free with alternative fixed memory block versions
xfree(). First, I'll briefly explain the underlying
Allocator storage recycling method, then present how
The basic philosophy of the memory management scheme is to recycle memory obtained during object allocations. Once storage for an object has been created, it's never returned to the heap. Instead, the memory is recycled, allowing another object of the same type to reuse the space. I've implemented a class called
Allocator that expresses the technique.
When the application deletes using
Allocator, the memory block for a single object is freed for use again but is not actually released back to the memory manager. Freed blocks are retained in a linked list, called the free-list, to be doled out again for another object of the same type. On every allocation request,
Allocator first checks the free-list for an existing memory block. Only if none are available is a new one created. Depending on the desired behavior of
Allocator, storage comes from either the global heap or a static memory pool with one of three operating modes:
- Heap blocks
- Heap pool
- Static pool
Heap vs. Pool
Allocator class is capable of creating new blocks from the heap or a memory pool whenever the free-list cannot provide an existing one. If the pool is used, you must specify the number of objects up front. Using the total objects, a pool large enough to handle the maximum number of instances is created. Obtaining block memory from the heap, on the other hand, has no such quantity limitations – construct as many new objects as storage permits.
The heap blocks mode allocates from the global heap a new memory block for a single object as necessary to fulfill memory requests. A deallocation puts the block into a free-list for later reuse. Creating fresh new blocks off the heap when the free-list is empty frees you from having to set an object limit. This approach offers dynamic-like operation since the number of blocks can expand at run-time. The disadvantage is a loss of deterministic execution during block creation.
The heap pool mode creates a single pool from the global heap to hold all blocks. The pool is created using operator new when the
Allocator object is constructed.
Allocator then provides blocks of memory from the pool during allocations.
The static pool mode uses a single memory pool, typically located in static memory, to hold all blocks. The static memory pool is not created by
Allocator but instead is provided by the user of the class.
The heap pool and static pool modes offers consistent allocation execution times because the memory manager is never involved with obtaining individual blocks. This makes a new operation very fast and deterministic.
Allocator constructor controls the mode of operation.
Allocator(size_t size, UINT objects=0, CHAR* memory=NULL, const CHAR* name=NULL);
Refer to "An Efficient C++ Fixed Block Memory Allocator" for more information on
xallocator module has six main APIs:
xmalloc() is equivalent to
malloc() and used in exactly the same way. Given a number of bytes, the function returns a pointer to a block of memory the size requested.
void* memory1 = xmalloc(100);
The memory block is at least as large as the user request, but could actually be more due to the fixed block allocator implementation. The additional over allocated memory is called slack but with fine-tuning block size, the waste is minimized, as I'll explain later in the article.
xfree() is the CRT equivalent of
free(). Just pass
xfree() a pointer to a previously allocated
xmalloc() block to free the memory for reuse.
xrealloc() behaves the same as
realloc() in that it expands or contracts the memory block while preserving the memory block contents.
char* memory2 = (char*)xmalloc(24);
strcpy(memory2, "TEST STRING");
memory2 = (char*)xrealloc(memory2, 124);
xalloc_stats() outputs allocator usage statistics to the standard output stream. The output provides insight into how many
Allocator instances are being used, blocks in use, block sizes, and more.
xalloc_init() must be called one time before any worker threads start, or in the case of an embedded system, before the OS starts. On a C++ application, this function is called automatically for you. However, it is desirable to call
xalloc_init() manually is some cases, typically on an embedded system to avoid the small memory overhead involved with the automatic
xalloc_destroy() call mechanism.
xalloc_destroy() is called when the application exits to clean up any dynamically allocated resources. On a C++ application, this function is called automatically when the application terminates. You must never call
xalloc_destroy() manually except in programs that use
xallocator only within C files.
Now, when to call
xalloc_destroy() within a C++ application is not so easy. The problem arises with
static objects. If
xalloc_destroy() is called too early,
xallocator may still be needed when a
static object destructor get called at program exit. Take for instance this class:
memory = xmalloc(100);
Now create a
static instance of this class at file scope.
static MyClassStatic myClassStatic;
Since the object is
MyClassStatic constructor will be called before
main(), which is okay as I’ll explain in the “Porting issues” section below. However, the destructor is called after
main() exits which is not okay if not handled correctly. The problem becomes how to determine when to destroy the
xallocator dynamically allocated resources. If
xalloc_destroy() is called before
xallocator will already be destroyed when
~MyClassStatic() tries to call
xfree() causing a bug.
The key to the solution comes from a guarantee in the C++ Standard:
“Objects with static storage duration defined in namespace scope in the same translation unit and dynamically initialized shall be initialized in the order in which their definition appears in the translation unit.”
In other words,
static object constructors are called in the same order as defined within the file (translation unit). The destruction will reverse that order. Therefore, xallocator.h defines a
XallocInitDestroy class and creates a
static instance of it.
static INT refCount;
static XallocInitDestroy xallocInitDestroy;
The constructor keeps track of the total number of
static instances created and calls
xalloc_init() on the first construction.
INT XallocInitDestroy::refCount = 0;
if (refCount++ == 0)
The destructor calls
xalloc_destroy() automatically when the last instance is destroyed.
if (--refCount == 0)
When including xallocator.h in a translation unit,
xallocInitDestroy will be declared first since the
#include comes before user code. Meaning any other
static user classes relying on
xallocator will be declared after
#include “xallocator.h”. This guarantees that
~XallocInitDestroy() is called after all user
static classes destructors are executed. Using this technique,
xalloc_destroy() is safely called when the program exits without danger of having
xallocator destroyed prematurely.
XallocInitDestroy is an empty class and therefore is 1-byte in size. The cost of this feature is then 1-byte for every translation unit that includes xallocator.h with the following exceptions.
- On an embedded system where the application never exits, the technique is not required except if
STATIC_POOLS mode is used. All references to
XallocInitDestroy can be safety removed and
xalloc_destroy() need never be called. However, you must now call
xalloc_init() manually in
main() before the
xallocator API is used.
xallocator is included within a C translation unit, a
static instance of
XallocInitDestroy is not created. In this case, you must call
xalloc_init() in main() and
xalloc_destroy() before main() exits.
To enable or disable automatic
xallocator initialization and destruction, use the
On a PC or similarly equipped high RAM system, this 1-byte is insignificant and in return ensures safe
xallocator operation in
static class instances during program exit. It also frees you from having to call
xalloc_destroy() as this is handled automatically.
Overload new and delete
To make the
xallocator really easy to use, I've created a macro to overload the
delete within a class and route the memory request to
xfree(). Just add the macro
XALLOCATOR anywhere in your class definition.
Using the macro, a
delete of your class routes the request to
xallocator by way of the overloaded
MyClass* myClass = new MyClass();
A neat trick is to place
XALLOCATOR within the base class of an inheritance hierarchy so that all derived classes allocate/deallocate using
xallocator. For instance, say you had a GUI library with a base class.
GuiBase derived class (buttons, widgets, etc...) now uses
delete is called without having to add
XALLOCATOR to every derived class. This is a powerful means to enable fixed block allocations for an entire hierarchy with a single macro statement.
xallocator relies upon multiple
Allocator instances to manage the fixed blocks; each
Allocator instance handles one block size. Like
xallocator is designed to operate in heap blocks or static pool modes. The mode is controlled by the
STATIC_POOLS define within
#define STATIC_POOLS // Static pools mode enabled
In heap blocks mode,
xallocator creates both
Allocator instances and new blocks dynamically at runtime based upon the requested block sizes. By default,
xallocator uses powers of two block sizes. 8, 16, 32, 64, 128, etc... This way,
xallocator doesn't need to know the block sizes in advance and offers the utmost flexibility.
The maximum number of
Allocator instances dynamically created by
xallocator is controlled by
MAX_ALLOCATORS. Increase or decrease this number as necessary for your target application.
#define MAX_ALLOCATORS 15
In static pools mode,
xallocator relies upon
Allocator instances created during dynamic initialization (before entering
main()) and static memory pools to satisfy memory requests. This eliminates all heap access with the tradeoff being the block sizes and pools are of fixed size and cannot expand at runtime.
Allocator initialization in STATIC_POOLS mode is tricky. The problem again lies with user class static constructors which might call into the
xallocator API during construction/destruction. The C++ standard does not guarantee the order of static constructor calls between translation units during dynamic initialization. Yet,
xallocator must be initialized before any APIs are executed. Therefore, the first part of the solution is to preallocate enough static memory for each
Allocator instance. Of course, each allocator can use a different
MAX_BLOCKS value as required. Using this mode, the global heap is never called.
#define MAX_ALLOCATORS 12
#define MAX_BLOCKS 32
CHAR* _allocator8 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator16 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator32 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator64 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator128 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator256 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator396 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator512 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator768 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator1024 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator2048 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
CHAR* _allocator4096 [sizeof(AllocatorPool<CHAR, MAX_BLOCKS>)];
static Allocator* _allocators[MAX_ALLOCATORS];
xalloc_init() is called during dynamic initalization (via
new is used to initialize each
Allocator instance into the static memory previously reserved.
extern "C" void xalloc_init()
new (&_allocator8) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator16) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator32) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator64) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator128) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator256) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator396) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator512) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator768) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator1024) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator2048) AllocatorPool<CHAR, MAX_BLOCKS>();
new (&_allocator4096) AllocatorPool<CHAR, MAX_BLOCKS>();
_allocators = (Allocator*)&_allocator8;
_allocators = (Allocator*)&_allocator16;
_allocators = (Allocator*)&_allocator32;
_allocators = (Allocator*)&_allocator64;
_allocators = (Allocator*)&_allocator128;
_allocators = (Allocator*)&_allocator256;
_allocators = (Allocator*)&_allocator396;
_allocators = (Allocator*)&_allocator512;
_allocators = (Allocator*)&_allocator768;
_allocators = (Allocator*)&_allocator1024;
_allocators = (Allocator*)&_allocator2048;
_allocators = (Allocator*)&_allocator4096;
At application exit, the destructor for each
Allocator is called manually.
extern "C" void xalloc_destroy()
for (INT i=0; i<MAX_ALLOCATORS; i++)
_allocators[i] = 0;
for (INT i=0; i<MAX_ALLOCATORS; i++)
if (_allocators[i] == 0)
_allocators[i] = 0;
Hiding the Allocator Pointer
When deleting memory,
xallocator needs the original
Allocator instance so the deallocation request can be routed to the correct
Allocator instance for processing. Unlike
xfree() does not take a size and only uses a
void* argument. Therefore,
xmalloc() actually hides a pointer to the allocator within an unused portion of the memory block by adding an additional 4-bytes (typical
sizeof(Allocator*)) to the request. The caller gets a pointer to the block’s client region where the hidden allocator pointer is not overwritten.
extern "C" void *xmalloc(size_t size)
Allocator* allocator = xallocator_get_allocator(size);
void* blockMemoryPtr = allocator->Allocate(size);
void* clientsMemoryPtr = set_block_allocator(blockMemoryPtr, allocator);
xfree() is called, the allocator pointer is extracted from the memory block so the correct
Allocator instance can be called to deallocate the block.
extern "C" void xfree(void* ptr)
if (ptr == 0)
Allocator* allocator = get_block_allocator(ptr);
void* blockPtr = get_block_ptr(ptr);
xallocator is thread-safe when the locks are implemented for your target platform. The code provided has Windows locks. For other platforms, you'll need to provide lock implementations for the four functions within xallocator.cpp:
When selecting a lock, use the fastest OS lock available to ensure
xallocator operates as efficiently as possible within a multi-threaded environment. If your system is single threaded, then leave the implementation for each of the above functions empty.
Depending on how
xallocator is used, it may be called before
main(). This means
lock_release() can be called before
lock_init(). Since the system is single threaded at this point, the locks aren’t necessary until the OS kicks off. However, just make sure
lock_release() behaves correctly if
lock_init() isn’t called first. For instance, the check for
_xallocInitialized below ensures the correct behavior by skipping the lock until
lock_init() is called.
static void lock_get()
if (_xallocInitialized == FALSE)
xallocator may return block sizes larger than the requested amount and the additional unused memory is called slack. For instance, for a request of 33 bytes,
xallocator returns a block of 64 bytes. The additional memory (64 – (33 + 4) = 27 bytes) is slack and goes unused. Remember, if 33 bytes is requested an additional 4-bytes are required to hold the block size. So if a client requests 64-bytes, really the 128-byte allocator is used because 68-bytes are needed.
Adding additional allocators to handle block sizes other than powers of two offers more block sizes to minimize waste. Run your application and profile your
xmalloc() requested sizes with a bit of temporary debug code. Then add allocator block sizes for specifically handling those cases where a large number of blocks are being used.
In the code below, an
Allocator instance is created with a block size of 396 when a block between 257 and 396 is requested. Similarly, a block request of between 513 and 768 results in an
Allocator to handle 768-byte blocks.
size_t blockSize = size + sizeof(Allocator*);
if (blockSize > 256 && blockSize <= 396)
blockSize = 396;
else if (blockSize > 512 && blockSize <= 768)
blockSize = 768;
blockSize = nexthigher<size_t>(blockSize);
With a minor amount of fine-tuning, you can reduce wasted storage due to slack based on your application's memory usage patterns. If no tuning is required and using blocks solely based on powers of two is acceptable, the only lines of code required from the snippet above are:
blockSize = nexthigher<size_t>(size + sizeof(Allocator*));
xalloc_stats(), it’s easy to find which allocators are being used the most.
xallocator Block Size: 128 Block Count: 10001 Blocks In Use: 1
xallocator Block Size: 16 Block Count: 2 Blocks In Use: 2
xallocator Block Size: 8 Block Count: 1 Blocks In Use: 0
xallocator Block Size: 32 Block Count: 1 Blocks In Use: 0
Allocator vs. xallocator
The advantage of using
Allocator is that the allocator block size exactly the size of the object and the minimum block size is only 4-bytes. The disadvantage is that the
Allocator instance is
private and only usable by that class. This means that the fixed block memory pool can't easily be shared with other instances of similarly sized blocks. This can waste storage due to the lack of sharing between memory pools.
xallocator, on the other hand, uses a range of different block sizes to satisfy requests and is thread-safe. The advantage is that the various sized memory pools are shared via the
xfree interface, which can save storage, especially if you tune the block sizes for your specific application. The disadvantage is that even with block size tuning, there will always be some wasted storage due to slack. For small objects, the minimum block size is 8-bytes, 4-bytes for the free-list pointer and 4-bytes to hold the block size. This can become a problem with a large number of small objects.
An application can mix
xallocator usage in the same program to maximize efficient memory utilization as the designer sees fit.
xallocator performance vs. the global heap on a Windows PC shows just how fast it is. An basic test of allocating and deallocating 20000 4096 and 2048 sized blocks in a somewhat interleaved fashion tests the speed improvement. All tests run with maximum compiler speed optimizations. See the attached source code for the exact algorithm.
Allocation Times in Milliseconds
|Allocator||Mode||Run||Benchmark Time (mS)|
|Global Heap||Debug Heap||1||1247|
|Gobal Heap||Debug Heap||2||1640|
|Global Heap||Debug Heap||3||1650|
|Global Heap||Release Heap||1||32.9|
|Global Heap||Release Heap||2||33.0|
|Global Heap||Release Heap||3||27.8|
Windows uses a debug heap when executing within the debugger. The debug heap adds extra safety checks slowing its performance. The release heap is much faster as the checks are disabled. The debug heap can be disabled within Visual Studio by setting
_NO_DEBUG_HEAP=1 in the Debugging > Environment project option.
The debug global heap is predictably the slowest at about 1.6 seconds. The release heap is much faster at ~30mS. This benchmark test is very simplistic and a more realistic scenario with varying blocks sizes and random new/delete intervals might produce different results. However, the basic point is illustrated nicely; the memory manager is slower than allocator and highly dependent on the platform's implementation.
xallocator running heap blocks mode very fast once the free-list is populated with blocks obtained from the heap. Recall that the heap blocks mode relies upon the global heap to get new blocks, but then recycles them into the free-list for later use. Run 1 shows the allocation hit creating the memory blocks at 17mS. Subsequent benchmarks clock in a very fast 5mS since the free-list is fully populated.
As the benchmarking shows, the
xallocator is highly efficient and about five times faster than the global heap on a Windows PC. On an ARM STM32F4 CPU built using a Keil compiler I've see well over a 10x speed increase.
A medical device I worked on had a commercial GUI library that utilized the heap extensively. The size and frequency of the memory requests couldn’t be predicted or controlled. Using the heap is such an uncontrolled fashion is a no-no on a medical device, so a solution was needed. Luckily, the GUI library had a means to replace
free() with our own custom implementation.
xallocator solved the heap speed and fragmentation problem making the GUI framework a viable solution on that product.
If you have an application that really hammers the heap and is causing slow performance, or if you’re worried about a fragmented heap fault, integrating
xallocator may help solve those problems.
- 11th March, 2016
- 13th March, 2016
- Updated Benchmarking section of article
xallocator to increase performance
- Updated the attached source code
- 28th March, 2016
- Updated attached source code to fix
STATIC_POOLS mode and help with porting
- Updated Code Implementation section to reflect the new design
- 3rd April, 2016
- Created a reference articles section
- 9th April, 2016
- Minor bug correction. New source code attached.
- 11th April, 2016
- Fixed bug based on feedback. New source code attached.
- 15th April, 2016
- Updated the "Overload new and delete" section and code snippets.