In this article, you will learn about a bi-partite circular buffer for high performance buffering. You will also see where it comes from, and why you'd want to use it.
The Bip-Buffer is like a circular buffer, but slightly different. Instead of keeping one head and tail pointer to the data in the buffer, it maintains two revolving regions, allowing for fast data access without having to worry about wrapping at the end of the buffer. Buffer allocations are always maintained as contiguous blocks, allowing the buffer to be used in a highly efficient manner with API calls, and also reducing the amount of copying which needs to be performed to put data into the buffer. Finally, a two-phase allocation system allows the user to pessimistically reserve an area of buffer space, and then trim back the buffer to commit to only the space which was used..
Let's cover a little history first. If you don't already know why a circular buffer can be implemented really efficiently in hardware, or why that makes them the buffer of choice in most electronics, here's why.
Back in Days of Old...
Once upon a time, computers were much simpler. They didn't have 64 bit data buses. Heck, they didn't even have real 16 bit registers - although you could occasionally convince a pair of them to sub in for that purpose. These were simpler times, where Real Men programmed in assembly language, and laughed at anyone who didn't know how to use the carry flag for all kinds of nefarious purposes.
With simpler times came elegant hacks to eke the most power out of every instruction cycle available. Take, for example, a simple terminal communications program. Newer RS232 serial controllers had things like automatic handling of RTS and CTS signal lines to control the flow of data - but this came at a cost. Namely, the connection would be stopping and starting all the time, instead of streaming along. So in between the controller card and the system, would often be found a FIFO. This simple circular buffer was often no more than a couple of bytes long, but it meant that the system could run smoothly along without polling to see if data had arrived, or being hammered by constant interrupts from the serial controller.
Most FIFOs started out on-chip, but people also added their own in their code - the idea being that if you had some really gnarly dancing that you had to do on the incoming data, you may as well batch it all up into one lump and do it infrequently... giving spare time to the system to do other things. Like scroll the console, or decode GIFs.
As I said, a FIFO is a very simple circular buffer. Most are implemented very simply as well; they're typically 2n bytes in size, which allows the pointers to simply overflow to get back around to the other end of the buffer. The FIFO logic can tell if the FIFO is empty because the head and tail values are the same, and it's full if the head is one greater than the tail.
Implementing these in software was easy on the old 8 bit systems. Take a 16 bit register pair. Decide on a location in memory (a multiple of 256) to store the FIFO data in. Then, after setting the register to the start of the buffer, don't touch the high register - just increment the low register. This gives you a 256-byte long buffer which you can walk through in one (in the case of the Zilog Z80, 4 cycles - the smallest execution unit available on that system) instruction. You can never go out of the bounds of your buffer, because the low register acts as an index with a value from 0 to 255. When you hit what would have been index 256, the register overflows and clocks back over to zero.
The Modern Day
Unfortunately, there is no solution quite as elegant available to Windows programmers today as that simple old 8-bit solution. Sure, you can dive down into assembly language (provided you can work out how the compiler maps registers to values... something I've never seen a good enough explanation of to get my head around), but most people don't have time for assembly language any more. And besides, we're dealing with 32 bit registers now - incrementing just one low-order byte from inside that register isn't really all that kosher any more. It can lead to cache flushing, pipeline stalling, printer fires, rains of frog, etc.
If you can't just clock the low-order register to walk through the buffer, you have to start worrying about things like checking to see how much buffer you have filled before the end, making sure that you remember to copy the rest of the data from the start of the buffer, and all kinds of other bookkeeping headaches.
My first attempt at implementing something like this relied on the vague hope that the virtual memory system could be tricked into setting things up in such a way that you could set up a mirror of a section of memory right next to the original. The idea being that you could still use the rotating allocation of data; a copy operation could go at full speed without any checking to see if you'd walked off the end of the buffer - because as far as your process's address space is concerned, the end of your buffer is also the beginning of your buffer.
Now, this mirroring technique may actually work. Due to some restrictions, I decided not to implement it myself (yet - I'm sure I'll find a use for it some day). The idea behind it is that first one reserves two areas of virtual memory, side by side. One then maps the same temporary file into both virtual memory sections. Voila! Instant mirroring, and a nice large buffered expanse one can copy data from willy-nilly.
Unfortunately, while it should (again, I've not tried it) indeed work, there is another problem - namely, that files can only be mapped on 64kb boundaries (possibly larger on larger memory systems). This means that your buffer has to be a minimum of 64kb in size, and will take up 128kb of your virtual address space. Depending on your application, this may be a valid technique. However, I don't see writing a server application with 1000s of sockets being a valid prospect here.
So what to do? If mirroring won't work, how close can we get to using a circular buffer in our code? Heck, even if we can get close, why would we want to?
The Advantages of the Circular Buffer
There are a number of key advantages to using a circular buffer for the temporary storage of data.
When one puts data into a block of memory, one also has to take it out again to make use of it. (Or one can use it in place). It is useful to be able to make use of the data in the buffer while more data is being appended to the buffer. However, as one frees up space at the head of the buffer, that space is no longer usable, unless one copies all of the data in the buffer which has not yet been used to the beginning of the buffer. This frees up space at the end of the buffer, allowing more data to be appended.
There are a couple of ways around this; one can simply copy the data (which is a reasonably expensive proposition), or one can extend the buffer to allow more data to be appended (a massively expensive process).
With a circular buffer, the free space in the buffer is always available to have data appended into it; the data is copied, the pointer adjusted, and that's that. No copying, no reallocation, no worries. The buffer is allocated once, and then remains useful for its entire life.
A Fly in the Ointment
One could simply implement a circular buffer by allocating a chunk of memory, and maintaining pointers. When one walked off the end of the buffer, the pointer would be adjusted - and this operation would be reflected in every operation that is performed, whether copying data into the buffer or removing it. Length calculations are slightly more complicated than normal, but not overly so - simple inline functions take care of that problem with ease, sweeping it under the rug.
Unfortunately, most API calls don't believe in circular buffers. You have to pass them a single contiguous block of memory which they can access, and there is no way for you to modify their write behavior to adjust pointers when they cross the end of the buffer. So what to do? Well, this is where the Bip-Buffer comes in.
Enter the Bip-Buffer
If one cannot pass a circular buffer into an API function, then one needs an alternative that will work - preferably with many of the same advantages as the circular buffer. It is possible to build a circular buffer which is split into two regions - or which is bi-partite (and that's how you get the Bip in a Bip-Buffer). Each of the two regions move through the buffer, starting at the left and ending up at the right hand side. When one runs out of space for appending data, if there is only one region, a new one is created at the beginning (if possible). The diagram below shows how it works in more detail.
The buffer starts out empty, with no regions present (figure 1). (e.g., immediately after calling AllocatedBuffer)
Then, when data is first put into the buffer, a single region (the 'A' region) is created (figure 2). (Say, by calling Reserve followed by Commit.)
Data is added to the region, extending it to the right (Figure 3).
For the first time now, we remove data from the buffer (figure 4). (See the DecommitBlock call described below)
This continues until the region reaches the end of the buffer (figure 4). Once more free space is available to the left of region A than to the right of it, a second region (comically named "region B") is created in that space. The choice to create a new region when more space is available on the left is made to maximize the potential free space available for use in the buffer. The upshot of all this leaves us with something which looks rather like figure 5.
If we now use up more of the buffer space, we end up with figure 6, with new space only being allocated from the end of region B. If we eventually allocate enough data to use up all of the free space between regions A and B (figure 7), we no longer have any usable space in the buffer, and no more reservations can be performed until we read some data out of it.
If we then read more data out of the buffer (say the entire remaining contents of region A), we exhaust it entirely. At the point, as region A is completely empty, we no longer need to track two separate regions, and all of region B's internal data is copied over region A's internal data, and region B is entirely cleared. (figure 8)
If we read a little more data out of the buffer, we now end up with something a lot like figure 4, and the cycle continues.
Characteristics of the Bip-Buffer
The upshot of all of this is that on average, the buffer always has the maximal amount of free space available to be used, while not requiring any data copying or reallocation to free up space at the end of the buffer.
The biggest difference from an implementation standpoint between a regular circular buffer and the Bip Buffer is the fact that it only returns contiguous blocks. With a circular buffer, you need to worry about wrapping at the end of the buffer area - which is why for example if you look at Larry Antram's Fast Ring Buffer implementation, you'll see that you pass data into the buffer as a pointer and a length, the data from which is then copied byte by byte into the buffer to take into account the wrapping at the edges.
Another possibility which was brought up in the bulletin board (and the person who brought it up shall remain nameless, if just because they... erm... are nameless) was that of just splitting the calls across wraps. Well, this is one way of working around the wrapping problem, but it has the unfortunate side-effect that as your buffer fills, the amount of free space which you pass out to any calls always decreases to 1 byte at the minimum - even if you've got another 128kb of free space at the beginning of your buffer, at the end of it, you're still going to have to deal with ever shrinking block sizes. The Bip-Buffer neatly sidesteps this issue by just leaving that space alone if the amount you request is larger than the remaining space at the end of the buffer. When writing networking code, this is very useful; you always want to try to receive as much data as possible, but you never can guarantee how much you're going to get. (For most optimal results, I'd recommend allocating a buffer which is some multiple of your MTU size).
Yes, you are going to lose some of what would have been free space at the end of the buffer. It's a small price to pay for playing nicely with the API.
Use of this buffer does require that one checks twice to see if the buffer has been emptied; as one has to deal with the possibility that there are two regions currently in use. However, the flexibility and performance gains outweigh this minor inconvenience.
BipBuffer class (full source code provided in the link) has the following signature:
int ixa, sza, ixb, szb, buflen, ixResrv, szResrv;
The constructor initializes the internal variables for tracking regions, and memory pointers to
null; it does not allocate any memory for the buffer, in case one needs to use the class in an environment where exception handling cannot be used.
The destructor simply frees any memory which has been allocated to the buffer.
bool AllocateBuffer(int buffersize = 4096);
AllocateBuffer allocates a buffer from virtual memory. The size of the buffer is rounded up to the nearest full page size. The function returns
true if successful, or
false if the buffer cannot be allocated.
FreeBuffer frees any memory allocated to the buffer by the call to
AllocateBuffer, and releases any regions allocated within the Bip-Buffer.
bool IsInitialized() const;
true if the buffer has had memory allocated to it (by calling
false if there is no memory allocated to the buffer.
int GetBufferSize() const;
GetBufferSize returns the total size (in bytes) of the buffer. This may be greater than the value passed into
AllocateBuffer, if that value was not a multiple of the system's page size.
Clear ... well... clears the buffer. It does not free any memory allocated to the buffer; it merely resets the region pointers back to
null, making the full buffer usable for new data again.
BYTE* Reserve(int size, OUT int& reserved);
Now to the nitty-gritty. Allocating data in the Bip-Buffer is a two-phase operation. First, an area is reserved by calling the Reserve function; then, that area is
Committed by calling the
Commit function. This allows one to, say, reserve memory for an IO call, and when that IO call fails, pretend it never happened. Or alternatively, in a call to an overlapped
WSARecv() function, it allows one to advertise how much memory is available to the network stack to use for incoming data, and then adjust the amount of space used based on how much data was actually read in (which may be less than the requested amount).
To use Reserve, pass in the
size of block requested. The function will return the size of the largest free block available which is less than or equal to
size in length in the
reserved parameter you passed in. It will also return a
BYTE* pointer to the area of the buffer which you have reserved.
In the case where the buffer has no space available, Reserve will return a
NULL pointer, and
reserved will be set to zero.
Note: You cannot nest calls to
Commit; after calling
Reserve, you must call
Commit before calling
void Commit(int size);
Here's the other half of the allocation.
Commit takes a
size parameter, which is the number of bytes (starting at the
BYTE* you were passed back from Reserve) which you have actually used and want to keep in the buffer. If you pass in zero for this size, the reservation will be completely released, as if you had never reserved any space at all. Alternatively, in a debug build, if one passes in a value greater than the original reservation, an assert will fire. (In a release build, the original reservation size will be used, and no one will be any the wiser). Committing data to the buffer makes it available for routines which take data back out of the buffer.
The diagram above shows how
Commit work. When you call
Reserve, it will return a pointer to the beginning of the gray area above (figure 1). Say you then only use as much of that buffer as the blue section (figure 2). It'd be a shame to leave this area allocated and going to waste, so you can call
Commit with only as much data as you used, which gives you figure 3 - namely, the committed space extends to fill just the part you needed, leaving the rest free.
int GetReservationSize() const;
If at any time you need to find out if you have a pending reservation, or need to find out that reservation's size, you can call
GetReservationSize to find the amount reserved. No reservation? You'll get a zero back.
BYTE* GetContiguousBlock(OUT int& size);
Well, after all this work to put stuff into the buffer, we'd better have a way of getting it out again.
First of all, what if you need to work out how much data (total) is available to be read from the buffer?
int GetCommittedSize() const;
One method is to call
GetCommittedSize, which will return the total length of data in the buffer - that's the total size of both regions combined. I would not recommend relying on this number, because it's very easy to forget that you have two regions in the Bip-Buffer if you do. And that would be a bad thing (as several weeks of painful debugging experience has proved to me). As an alternative, you can call:
BYTE* GetContiguousBlock(OUT int& size);
... which will return a
BYTE* pointer to the first (as in FIFO, not left-most) contiguous region of committed data in the buffer. The
size parameter is also updated with the length of the block. If no data is available, the function returns
NULL (and the
size parameter is set to zero).
In order to fully empty the buffer, you may wish to loop around, calling
GetContiguousBlock until it returns
NULL. If you're feeling miserly, you can call it only twice. However, I'd recommend the former; it means you can forget that there's two regions, and just remember that there's more than one.
void DecommitBlock(int size);
So what do you do after you've consumed data from the buffer? Well, in keeping with the spirit of the aforementioned
Commit calls, you then call
DecommitBlock to release data from it. Data is released in FIFO order, from the first contiguous block only - so if you're going to call
DecommitBlock, you should do it pretty shortly after calling
GetContiguousBlock. If you pass in a
size of greater than the length of the contiguous block, then the entire block is released - but none of the other block (if present) is released at all. This is a deliberate design choice to remind you that there is more than one block and you should act accordingly. (If you really need to be able to discard data from blocks you've not read yet, it's not too difficult to copy the
DecommitBlock function and modify it so that it operates on both blocks; just unwrap the if statement, and adjust the size parameter after the first clause. Implementation of this is left as the dreaded youknowwhat).
And that's the Bip-Buffer implementation done. A short example of how to use it is provided below.
if (!buffer.AllocateBuffer(8192)) return false;
readEOF = false;
s = socket(...
... do something else ...
BYTE* pData = buffer.Reserve(GetBufferSize(), space);
if (pData == NULL) return;
int recvcount = recv(s, (char*)pData, space, 0);
if (recvcount == SOCKET_ERROR) return;
if (recvcount == 0) read_EOF = true;
while (pData = buffer.GetContiguousBlock(allocated)
fwrite(pData, allocated, 1, outputfile);
And so we reach the end. I hope that you find this code and way of managing data useful; I've certainly found that it comes in very handy for writing networking code. If you do find it useful, or use it in any of your code, all that I ask in return is that you drop me an email and let me know how the code is being used (what kind of project, what company, etc). Be vague if NDAs would get in the way - it's nice to know that it's out there, alive, and doing cool things.
Closing Notes (July 2020)
I hate to have to write this, but it ends up that there's a lot of weird people out there who seem to revel in sniping down others, and some of them have popped up recently.
First, to my knowledge, I invented this algorithm as presented here before anyone else, and independently from anyone else (I needed it to accelerate some UDP networking I was doing for control programs for a mass spectrometer I was helping to prototype software for). I'm not presenting anyone else's invention. And everything is obvious after it's published, with 20/20 hindsight.
The principles behind the API design is definitely mine - I've never seen anything else like it. The
CommitN pattern for writes, and the
DecommitN pattern for reads, was a fantastic epiphany and made using it much better than I'd hoped. I liked its elegance (it was a great epiphany). I've tried to keep similar ideas in mind when working on other APIs that I've worked on since (including some of the more recent APIs for the Xbox where I was part of the API design review board).
It looks like I was about 2 years too late to invent the virtual memory "perfect" ring buffer (that credit seems to go to Phil Howard in 2001 - see here), but I've since tried to popularize its use on the Xbox One in one of the tech talks I've given, and was able to work with people inside of the Windows Base team at Microsoft while I was there to make creating one much more robust from both Win32 and UWP. (If I remember correctly, it wasn't possible to use this trick on the PowerPC-based Xbox 360 because the cache was based on virtual addresses, not physical addresses, meaning that cache-coherency became a potential issue, making that trick unworkable).
Since this article was first written, a lot of other people have gone on to use this code in a variety of different places. It has since showed up in the codebase for Age of Empires Online, PuTTy, and a bunch of others. (If I get the chance, I'll put together a full list at some point).
Andrea Latuda and James Munns have done an amazing job of taking the chicken-scratch notes I wrote in 2003 and using them to create a lock-free implementation of a BipBuffer for high-throughput multithreaded scenarios. You can find their versions here and here.
Kevin Raowulf also has a Rust version here, but there's a few more out there if you go looking, such as this one.
(Clearly, Rust is a language I should pick up - which is why I've got a couple of books on it waiting for me in my "office").
It's also been cited in a few patents (both by Sanford):
- US20110320733A1 - Cache management and acceleration of storage media
- US9323659B2 - Cache management including solid state device virtualization
(Are you a professional software or electronic engineer? Don't go reading them - that's an information hazard. Leave it to the lawyers to worry about.)
- 20th July, 2020: Added note re: origin of the idea
- 10th May, 2003: Updated source code to fix the bug PeterChen noticed, fixed spelling, added abstract + more information based on comments. Changed diagram. Just a general tidy up.
- 6th January, 2003: First published