Have you ever wanted to
free memory from a shared memory segment across process boundaries? Have you ever wanted to roll your own memory manager? There are situations where two processes need to allocate memory from the shared memory segment in an efficient manner? This article describes a fast and efficient algorithm to allocate and free memory from any pre-created memory block, including a shared memory segment.
There are several techniques to create a heap manager using a pre-allocated block of memory. This article describes two algorithms, and provides source code for the second algorithm.
Fixed Size Partitioning Algorithm
This technique divides the memory blocks into fixed size blocks, and pushes the blocks into a stack. Memory allocation (
malloc) is accomplished by popping a block out of the stack, and de-allocation pushes the freed memory block into the stack.
For example: if an application allocates three classes of objects from the heap, and the first class of objects has size less than 32 bytes, the second class of objects has size less than 128 bytes, and the third kind of objects have size less than 512 bytes. In such a scenario, the algorithm will first divide the pre-allocated memory block into three contiguous blocks, and then sub-divide the blocks into a list of fixed size sub-blocks (lists of blocks with 32 bytes, 128 bytes, and 512 bytes in sizes, respectively). Sub-blocks are then pushed into three separate stacks. The stacks need not be a separate data structure; you can always build the stack in place while sub-dividing the memory blocks. When a memory request for 32 bytes or less is made, the heap manager pops a block out of the first stack and returns it to the caller. When a memory request for 128 bytes or less is made, the heap manager pops a block out of the second stack and returns it to the caller, and so on. Freeing a memory block is nothing but pushing the block back into the correct stack.
Advantages of this algorithm are listed below:
- Memory allocation and de-allocation are extremely fast.
- Heap manager can be fine tuned by the developer by carefully selecting the heap block size.
- Very little memory is wasted for heap header information.
- Thread level heap management can be implemented by creating heaps per thread basis. A thread level heap can be stored in the Thread Local Storage, and accessed without having a global lock.
Disadvantages of this algorithm are listed below:
- Heap initialization can be a little slower as all the heap partitioning must be done in advance.
- If the stacks are built in place, the 'RESERVED but not COMMITTED' feature of virtual memory cannot be used as all the memory must be accessible to build the in-place stack.
- Developer must choose the heap partitioning size carefully to avoid wastage of memory.
- Memory will be wasted if a large number of requests fall in the lower boundary of the partition size.
- Large blocks of memory (larger than the largest partition size) cannot be allocated.
Dynamic Heap Allocation Algorithm
This algorithm behaves more like
free. It uses a pair of binary trees to build a heap in the pre-allocated block of memory. Allocation is accomplished by splaying the trees to get the best matching tree node and returning the memory represented by the tree node. De-allocation is done by inserting the freed memory block into the trees. Each tree node has a key, and the value of the key is used to splay the tree to locate the best matching node.
Splay tree is a self adjusting binary tree with nodes having unique keys. Splaying the tree involves rearranging the tree in such a way that the root node of the tree is the node with the right key value. If a node with the right key value cannot be located in the tree, then the root node will be a node with the closest key value. An example is shown below.
If you splay the tree with a key of 3, the resulting tree will look like:
When the tree is splayed with a key of 5, the resulting tree will look like:
Length Tree (LENGTH_TREE)
Length tree balances using the size of the free memory block as the key value. Each tree node represents a block of 'free' memory and the node's key stores the size of the block. Since there can be more than one free memory blocks with the same size, each node in this tree is a linked list containing other nodes.
Pointer Tree (POINTER_TREE)
The pointer tree balances using the memory address as the key value. Each tree node represents a block of 'free' memory and the node's key stores the address of the block. Since each free block of memory will have a unique address, this tree is a proper binary tree. The address key is the offset of the memory address with respect to the beginning of the memory segment.
The heap is initialized by creating both
POINTER_TREE in the heap. Each tree will contain exactly one node representing the entire memory block.
The parameter for the
malloc is the size of the requested memory block. Therefore, we need to splay the
LENGTH_TREE using the request length to get the best matching tree node. The allocation algorithm is described below:
- Splay the
LENGTH_TREE to get the best matching free memory block (or tree node)
- Remove the node from the
LENGTH_TREE and remove the corresponding address node from the
- If the memory block is larger than the requested size, split the memory block and insert the remainder into the
- Return the pointer of the block (minus the heap block header)
ALLOC (int len)
TREE_NODE node = SPAY_TREE (LENGTH_TREE, len);
// Remove the node from the trees
REMOVE_NODE (LENGTH_TREE, node);
REMOVE_NODE (POINTER_TREE, node);
// Split the block if the requested length is less
// than block size
If (node.KEY > len)
TREE_NODE rem = SPLIT_BLOCK (node);
INSERT_NODE (LENGTH_NODE, rem);
INSERT_NODE (POINTER_NODE, rem);
Return ADDRESS_OFF (node);
The de-allocation is accomplished by inserting the freed block back into the trees. Simply inserting the freed block into the tree will cause the memory to fragment, resulting in performance slowdown. The best way to avoid memory fragmentation is to merge the freed block with the successor memory block as well as with the predecessor memory block, and thus creating a contiguous block of memory.
The parameter for
free is the pointer of the memory block to be freed. This pointer's header contains the length of the memory block. The successor node's address can be computed by adding the length of the block to the address of the block. Search the
POINTER_TREE to see if the successor block is in the free list. If the successor block is in the free list, merge the current block with it. If the predecessor block is in the free list, merge the current block with the predecessor:
- Get the successor memory block (
CURRENT_BLOCK + the length of the block)
- Splay the
POINTER_TREE to check if the successor block is in the free list or not
- If the successor is in the free list, merge the current node with it
- Check if the left node of the successor node is actually the predecessor node of the current node
- If a predecessor is not found, merge the current node with it
- Insert the merged node into the
- Insert the merged node into the
FREE (void* ptr)
TREE_NODE node = ptr â€“ sizeof (BLOCK_SIZE);
TREE_NODE next = node + SIZE_OF (node);
TREE_NODE tmp = SPAY_TREE (POINTER_TREE, next);
// Try to merge with the successor and predecessor
If (tmp != 0)
IS_PREDECESSOR (tmp.left, node)
MERGE_PREDECESSOR (node, tmp.left);
MERGE_SUCCESSOR (node, next);
INSERT_NODE (POINTER_TREE, node);
INSERT_NODE (LENGTH_TREE, node);
Shared memory is shared between multiple processes, and one of the issues with a shared memory segment (in any OS, Windows, Linux, OSX etc.) is that the memory segment may not be mapped into the same address location in all the processes. This behavior mandates that all the data structures stored in the shared memory should use the pointer offsets rather than the actual pointers. That means the
POINTER_TREE must not store any pointers in their nodes. The
RIGHT attributes of the tree nodes are the memory offsets of the left and right tree nodes.
The algorithm supports the 'RESERVED but not COMMITTED' feature of Windows virtual memory. This feature is very useful to fine tune the memory usage of the application.
There are two test applications: one written in MFC and the other is a plain console based application. The test application has two modes. In the first mode, it measures the performance of the algorithm with respect to
malloc. In the second mode, the test application launches three more instances of it and allocates and frees memory from the shared heap.
Performance Testing Mode
Performance testing can be done by invoking the program with the following command line parameters:
- /t: The program measures the performance numbers by creating a local heap. The local heap does not use any lock.
- /tc: The program measures the performance numbers by creating a local heap and uses a
CRITICAL_SECTION object (or
pthread_mutex on Linux) as the heap locking mechanism.
- /tm: The program creates a shared memory heap and uses a
MUTEX object (or file locker on Linux) as the locking mechanism.
Inter-process Allocation and De-Allocation Mode
In this mode, the test program creates four stacks in the shared memory. Later, it will launch three more instances of itself. Each instance will create objects in the shared memory and push those objects into one of the shared stack. Each instance will pop objects out of the other shared stacks, and will display the data on the screen and delete the objects. This mode demonstrates how a process can free memory allocated by other processes.
Performance is measured using the local heap, and the memory is allocated and freed 50,000 times in random fashion. The same logic is used to measure the performance of
free as well. Performance measurement is done in the Windows and Linux Operating Systems. The
malloc C-API is directly mapped to the
HeapAlloc API in Windows. In Linux,
free are well optimized code. Therefore, the debug version of the test program will not perform as well as
The algorithm performs extremely well when compared to
malloc in a prolonged usage scenario. It performs much better than
malloc if you allocate and free large amounts of memory frequently. On the other hand, if the application allocates and frees only a small amount of memory (say, less than 4MB) and less frequently, then
malloc performs better.
The source code tree is arranged as follows:
- /common folder: Contains the algorithm implementation and support classes
- /heap folder: Contains the test program
- Heap.vcproj is the VS2005 project file for building the GUI test application
- Heap_c.vcproj is the VS2005 project file for building the console test application
- The GUI binary is located in the /heap/Release directory
- The console binary is located in the /heap/Release_c directory