A Fast Efficient Allocator for Small Blocks of Memory

znrobinson

2.85/5 (11 votes)

Jan 8, 2007

CPOL

9 min read

55764

738

Describes building a fast and efficient small block memory allocator (with full source).

Download source files - 3.6 KB

Introduction

Dynamic memory allocation is a funny thing. I mean, I think most people call malloc/free without thinking about the overhead associated with it. When it comes to heap-based memory allocation, managing blocks of memory so that memory can be reclaimed and reused entails a significant amount of bookkeeping, and this bookkeeping costs CPU cycles. The first rule when trying to write fast chunks of code: touch the allocator as little as possible. But malloc/free is there for a reason, it's a tool like any other and just has to be used appropriately. But can we make it cheaper to use?

A few guys and myself, in an astonishing display of what too much testosterone can drive grown men to do, took up the small block challenge. The goal to write the fastest, most "efficient" small block allocator, and the chance to obtain everlasting bragging rights over the others. So OK... it wasn't just macho posturing, the reason it came up at all was that it turns out that a UI kit that we were all working on allocated an astounding amount of memory at an even more astounding rate, and much of it consisted of very small memory requests under 1024 bytes. We just had some gentlemanly disagreements on who was bigger... err... I mean who could write a better allocator. Hence the contest, and its ground rules:

Two functions with the following signatures: void* alloc(long size); and void free(void*p);
Must support > 1024 allocation (speed and efficiency testing are on allocations <= 1024)
Must be "naturally" aligned to the next power of 2, to a maximum of 8 byte alignment
Must not crash on NULL free
Must support allocation of 0 (return a valid ptr)
Must call blockalloc()/blockfree() (essentially malloc and free) for "underlying" pool allocations

The major components of the scoring were speed and efficiency, with the lack of efficiency being measured by how much waste you have during the benchmark.

efficiency = (amount of memory requested of your allocator) / 
             (amount of memory that you have requested from blockalloc)

Given efficiency, the score was essentialy calculated as follows:

score = (time in ms) / (efficiency * efficiency * efficiency)

The lowest score wins. From this, you can see that there is a very high penalty for not being as 'efficient' as possible. In our tests for performance and efficiency, it shows my allocator beating Visual C's malloc/free by 25 times on speed, and more than 13% in efficiency.

No claim is made to establish this as the fastest possible small block allocator, although this implementation is pretty fast. I have lots of ideas on how it could be made faster, and I am sure there are implementations out there that blow this away. But it is pretty interesting how easy it was to beat Microsoft's out-of-the-box implementation, and it serves as a good starting point for those of you out there who are interested in taking it further.

A good axiom is to make things as simple as possible and only introduce complexity as the need arises. My first couple of allocators were very complicated beasts, mostly because my focus was on the wrong problems (e.g., minimizing block header sizes etc...). In the end, my allocator became extremely simple, essentially a fixed block allocator that managed 129 separate heaps. With each heap managing a specific fixed allocation size starting at 4 bytes, and then 8 bytes, and then in 8 byte increments up to 1024 bytes. Technically, this is a suballocator; it uses malloc and free to allocate/free larger blocks of memory. This allocator manages these memory blocks and uses them to allocate smaller blocks of memory. In our testing, this allocator wins by managing these smaller allocations more efficiently than the more general-purpose malloc/free.

A fixed block allocator, just like it sounds, is an allocator than can only allocate blocks of a fixed or given size. By only needing to deal with blocks of a given size, the amount of code and the complexity of the data structures needed to manage memory is minimized, which maps directly to performance.

Allocating Memory

Let's get started by looking at the rtAllocator::alloc method:

void* rtAllocator::alloc(long ls);

The first thing that happens here is that we need to identify which of the 129 separate heaps is appropriate to satisfy the request. First, the size (ls) of the allocation is checked to see if it is > 1024 bytes. If the request is for larger than 1024 bytes, the request is simply thrown to the general purpose allocator (malloc), since we weren't concerned with allocations of this size. If the size of the request is <= 1024, then we need to determine which of the 129 fixed size heaps should be used to satisfy the request. To do this quickly and efficiently, a lookup table of 1024 elements is used. This lookup table has been initialized with a number that indicates which of the 129 heaps should be used to satisfy the request. Looking at the code for this:

void* rtAllocator::alloc(long ls)
{
  if (ls == 0) ls = 1;
  int bdIndex = -1;
  if (ls <= 1024) bdIndex = mBDIndexLookup[ls];

The first line is used to handle the special case of attempting to allocate zero bytes; in this case, we simply treat it as an allocation of one byte by changing the size to one. In the next line, we initialize the index bdIndex to -1, and if the allocation is within our target range (<= 1024), the lookup table is used, with the size as the offset into the table, to determine which of the 129 heaps to use.

If the allocation request is larger than 1024, then the index bdIndex will be -1, and the request is simply passed through to the general purpose allocator. Given the following code:

if (bdIndex < 0)
{
  // Not handling blocks of this size throw to blockalloc
  INCALLOCCOUNTER(bdCount);
  return ALLOCJR_ALLOC(ls);
}

Note: The macro ALLOCJR_ALLOC is used to wrap malloc so that we can instrument and record allocation statistics. ALLOCJR_FREE is used for a similar purpose to wrap calls to the free function.

At this point in the code, we know we are handling the allocation (<=1024), and we know which of the 129 fixed size heaps will be used to satisfy the request. So the next thing to do is to check to see if we already have a block with the necessary space to satisfy the request. Each heap maintains a double-linked list of free blocks (blocks that have room for at least one more allocation). If there are no free blocks, then we allocate a block (using malloc) and link it into our linked list of free blocks with the following code:

if (!mFreeBlocks[bdIndex])
{
  INCBLOCKCOUNTER();
  block* b = (block*)ALLOCJR_ALLOC(
    block::getAllocSize(bd[bdIndex].fixedAllocSize, bd[bdIndex].chunks));
  if (b)
  {
    b->init(bd[bdIndex].fixedAllocSize, bdIndex, bd[bdIndex].chunks);

    addBlockToArray(b);
    mFreeBlocks[bdIndex] = b;
  }
}

At this point, there should be at least one block available that can be used to satisfy the allocation request. The allocation request is then forwarded to block::alloc to allocate the memory from within the available free block. Each block has a number of chunks, with each chunk being large enough to satisfy one allocation request. In the block data structure, a single-linked list of free "chunks"is maintained within the interior of the block.

In order to avoid the cost of setting up the linked list when we initialize a newly allocated block, a fence pointer is maintained and points at the first uninitialized chunk. When allocating memory, we look first to the linked list to see if the list contains any free chunks. If it does not, we see if there are chunks left in the block that have yet to be included in the list of free chunks by examining the fence pointer. If there is room, we increment mInitCursor to the next chunk and utilize the chunk that was previously pointed to by mInitCursor.

inline void* alloc()
{
  void* result;

  if (mFreeChunk)
  {
    result = mFreeChunk;
    mFreeChunk = (void**)*mFreeChunk;
  }
  else
  {
    result = mInitCursor;
    mInitCursor += mFixedAllocSize;
  }

  mAllocCount++;
  return result;
}

After we return from block::alloc, we need to see if the block is completely full. This is done through a call to block::isFull. If it is full, we remove it from the double-linked free list so that we won't consider when looking for space to satisfy future requests. When the block is removed from the free list, a sentinel value is assigned to the block's mNextFreeBlock pointer so that we can easily determine that the block is full. Given the following code:

block *b = mFreeBlocks[bdIndex];

if (b->mNextFreeBlock != ALLOCJR_FULLBLOCK && b->isFull())
{
  // Unlink from freelist
  if (b->mNextFreeBlock)
  {
    b->mNextFreeBlock->mPrevFreeBlock = b->mPrevFreeBlock;
  }
  if (b->mPrevFreeBlock)
  {
    b->mPrevFreeBlock->mNextFreeBlock = b->mNextFreeBlock;
  }
  mFreeBlocks[bdIndex] = b->mNextFreeBlock;
  // special value means removed from free list
  b->mNextFreeBlock = ALLOCJR_FULLBLOCK; 
  b->mPrevFreeBlock = ALLOCJR_FULLBLOCK;
}

At this point, we have successfully allocated a block of the requested size. Now that we have walked through the process of allocating memory, I will now walk you through the process of freeing memory.

Freeing Memory

The process of freeing memory begins with a call to rtAllocator::free.

void rtAllocator::free(void* p)

The first thing that this method does is to check to see if we have been passed a null pointer reference; if so, we return as follows:

if (!p) return;

If the pointer is non-null, a check is then made to see if the memory is managed by us or had been retrieved from a pass-through to malloc. In order to do this, we do a binary search through the array of block pointers that we maintain to see if it is one of ours. This is done through a call to the following function:

block* b = findBlockInArray(p);

If the block is one of ours, then b will be non-null, otherwise we know that the pointer we are freeing is not ours and we pass it directly to free. If it is ours, then a call to block::free is made to the block that was returned. In the block::free method, we simply return the "chunk" to the free list within the block so that the chunk can be reused.

inline void free(void* p)
{
  void **pp = (void**)p;
  *pp = mFreeChunk;
  mFreeChunk = (void**)p;
  mAllocCount--;
}

In the first line of this function, we coerce the pointer into a void **, which allows us to easily write the current linked list head pointer to the front of the chunk. Then we do exactly that in the second line. In the third line, we complete inserting the chunk into the free list for the block by setting the head pointer to point to the newly inserted chunk. The last thing we do in this function is decrement the counter for how many allocated chunks that are currently in the block. Upon returning from the call to block::free, we still have to do a couple of more checks before we are done. First, we check to see if the last call to block::free ended up making the block empty. We do this as follows:

if (b->isEmpty())

If the block is now completely empty, we unlink the block from the double-linked list of free blocks, and return the block to the system by calling free with the block, as follows:

if (b->isEmpty())
{
  // Unlink from freelist and return to the system
  if (b->mNextFreeBlock)
  {
    b->mNextFreeBlock->mPrevFreeBlock = b->mPrevFreeBlock;
  }
  if (b->mPrevFreeBlock)
  {
    b->mPrevFreeBlock->mNextFreeBlock = b->mNextFreeBlock;
  }
  if (mFreeBlocks[b->mBDIndex] == b) 
    mFreeBlocks[b->mBDIndex] = b->mNextFreeBlock;

  removeBlockFromArray(b);

  DECBLOCKCOUNTER();
  ALLOCJR_FREE(b);
}

If the block is not empty, we make sure that the block is included in the double-linked free block list since we now know that there is at least one free chunk available in the block. This check is made by comparing the mNextFreeBlock member against an invalid pointer constant, ALLOCJR_FULLBLOCK, which is assigned to the mNextFreeBlock block whenever it becomes full; if this check succeeds, then we link it back into the list as follows:

// need to see if block is not in free list if not add it back
if (b->mNextFreeBlock == ALLOCJR_FULLBLOCK)
{
  b->mPrevFreeBlock = NULL;
  b->mNextFreeBlock = mFreeBlocks[b->mBDIndex];
  if (mFreeBlocks[b->mBDIndex]) 
  mFreeBlocks[b->mBDIndex]->mPrevFreeBlock = b;
  mFreeBlocks[b->mBDIndex] = b;
}

At this point, the memory has been reclaimed and can be reused.

Conclusion

In wrapping up, I just want to say that it was a lot of fun hacking on this allocator, and I hope that you enjoyed reading though my description of how it works. Have fun looking at the code!

The author maintains a blog at http://www.liquidthought.com with more programming tips, snippets, and software.