Fast IPC - New implementation






4.83/5 (18 votes)
New (faster) implementation of shared memory IPC
Introduction
For a few years I used excellent shared memory IPC implementation by studio_ukc: http://www.codeproject.com/Articles/14740/Fast-IPC-Communication-Using-Shared-Memory-and-Int.
While it is really fast, there are several drawbacks. Also, many members were routinely asking for the working code of that class - the one available for the download needs to be fixed (although some fixes are available in comments section).
I decided to go ahead and upload my own implementation based on the same idea, but with most of drawbacks eliminated. Please read original article before proceeding with this one.
Background
So, what are advantages of my implementation compared to the original?
- 1. Block size and block count were fixed - they were template parameters. Now, you can specify them during IPC instance creation:
- Client IPC instance does not need to know anything about master, except the name of the connection point (name of shared memory file). Once connected, it will read all other information from the shared memory.
- Once connection point is created, all connected IPC instances (including the one that created connection point) become equal - the creator can disconnect at any moment, and all others will continue to function.
- One more convenience added - you don't have to have connection point created before you try to attach to it - attach function will wait specified timeout until connection point appears.
- Block itself was a struct with user data as last member. Now, instead of this struct, the raw "void*" is used. Also, previously block was using 32 bytes of service information in addition to user data. For user data of small sizes (like pointers) it was a significant waste. Now, there are only 4 bytes of service information per block, allocated past all blocks to allow you to use any alignment you want.
- In original implementation you had to have separate instances of different classes to send and receive data. Now, this can be done using same instance:
CSimpleIPC ipc; ipc.createConnectionPoint(blockSize, blockCount, maxExpansions, preAllocate, minExpansionDistance); ipc.write(&somedata, sizeof(somedata)); ipc.read(&somedata, sizeof(somedata));
- This implementation is "multiple readers/ multiple writers", i.e. completely transparent, as well as an original one.
- Added ability to stop read/write waits even if they were specified as "infinite" ones - just call
requestStopCommunications()
and waits will be immediately abandoned. Once you decide to continue, callrequestResumeCommunications()
. To be exact, these calls affect waits only -if you have read or write in the loop, then read/write calls will continue to execute, simply ignoring timeout parameter. If you want to get out of the loop when stop was requested, check the result ofisCommunicationStopped()
function, i.e.:while (!ipc.isCommunicationStopped() && <some other condition>) { ... ipc.write(&somedata, sizeof(somedata)); ... }
- As an additional convenience, there is an ability to stop specific read/write function wait. You can supply an (optional) handle to the event, and signal it from another thread if you need to stop a wait.
CSimpleIPC ipc;
ipc.createConnectionPoint(blockSize, blockCount, maxExpansions, preAllocate, minExpansionDistance);
As you can see, now block size and block count are function parameters, not template ones. But it's not all yet. The blockCount
that you pass is in fact initial number of blocks. If value of maxExpansions
is over 1, then this IPC will increase number of blocks-in-use when needed. For example, you can start with 256 blocks. When IPC will need to use more, it will allocate 256 more, and so on - as many times as you specify in maxExpansions
. Obviously, value of maxExpansions
cannot be less than 1. The last parameter - minExpansionDistance
controls how expansion happen. Basically, once number of available blocks becomes less or equal to minExpansionDistance
, expansion happens - if maxExpansions
are not reached yet. You probably noticed one more parameter - preAllocate
, but I will cover it later in the Tweaks section.
Using the code
Download contains 6 files:
- SimpleIpc.h: the actual IPC implementation; it includes two following files:
- SimpleThreadDefines.h: some helpful defines (mostly to overcome compiler bug in MSVC++ 2010 SP1);
- SimpleCriticalSection.h: simple fast critical section implementation (it allows for lightweight exclusive or shared locks);
- IPCTest.cpp: small test and usage example. You can build it using next supplied file:
- IPCTest.vcxproj: project file for VC++ 2010;
- ipc.h: original (fixed) implementation of studio_ukc (with some usability enhancements, like using "
std::wstring
" class for the names of shared memory connection point).
So, for usage in your project you'll need only first 3 files.
Please excuse me for the code in IPCTest.cpp - it's quite crude and "not pretty", but it serves well for the purpose of testing and benchmarking. You can also look at it for usage example, but here is the general idea:
//start with:
CSimpleIPC ipcMain;
if (!ipcMain.createConnectionPoint(blockSize, blockCount, maxExpansions, preAlocate, minExpansionDistance))
//signal error, decide on what to do in case of failure;
// in some other place or thread:
CSimpleIPC ipcClient1, ipcClient2, .. ipcClientN;
if (!ipcClient1.attachToConnectionPoint(ipcMain.getConnectionPoint()))
//<signal error, ...>
....
if (!ipcClientN.attachToConnectionPoint(ipcMain.getConnectionPoint()))
//<signal error, ...>
// yet somewhere else - any thread:
while (!ipc.isCommunicationStopped()&& <some other condition>)
{
...
if(!ipcClientX.write(&somedata, sizeof(somedata)))
//<data was not written, do something>
...
if (!ipcClientY.read(&somedata, sizeof(somedata)))
//<data was not read, do something>
}
Now, to the important question: speed. Of course, extra features do not come for free. But how much is the degradation? I've tested free-flowing (i.e. just write/read, simplest processing) functionality on my desktop i7 3930 / 64Gb DDR3 1600 and my laptop - T7100 /4GB.
Numbers were pretty close - about 4.6 mln (laptop) and about 5.9 mln (desktop) blocks per second for the best case scenario, going down to 2.4 mln blocks per second (laptop) and ~1mln (desktop) in the worst case (on how to avoid it see Tweaks section). Here I specifically mention number of blocks transferred per second, rather then bytes. The most load for any implementation happens when block size is very small - few bytes. In this case the actual memory copy takes about 0.1% of overall processing time. But, the bigger are the blocks, the more time is spent in memcpy - up to 99.9%, so for larger blocks you are basically looking for the raw memory speed, and can say nothing about how well IPC is implemented.
Now, how this compares to the speed of original implementation by studio_ukc? I was quite surprised to see my implementation is about 50-60% faster on laptop and 30-40%faster on desktop (in the best case scenario) - kind of unexpected result, given that it's more complex. I included original implementation performance test as part of IPCTest.cpp, so you can compare results yourself.
Tweaks
Unfortunately, it is impossible to write an implementation that gives best results in all situations. So, there are several flags and values that you can set to tailor to your specific needs. Here is an explanation of what they are and when to use them:
- Flag that is set via
doSingleThread
(bool). You can call read/write function of one IPC instance from any number of threads. Unfortunately, this comes with price. IPC somehow needs to protect it's internal array of pointers to shared memory when it's getting expanded. But this also leads to the need to use shared lock for all read access, which, as you've guessed, is done via that lightweight lock, that involves 2 calls to the interlocked functions (one toInterlockedIncrement
, and one toInterlockedDecrement
) in best case. So, if you know that you will be calling read/write to one IPC instance only from one thread (i.e. if you have one IPC instance per thread), OR if your max number of expansions = 1 (i.e. you never expand), then you can set this flag to "true
", thus gaining ~15-20%. Default initial value of this flag is "false". This flag can be set any time from any thread, and affects only the instance you call it from. - Flag that is set during creation of connection point via parameter
preAllocate
(remember "advantage #1"? I promised to explain what it is). If set to "true
", it pre-allocates internal array of pointers to the shared memory for each instance. Not the shared memory itself, but array of pointers - that's important difference. Each array element is only 12 bytes on 32-bit compile, or just 24 bytes for 64-bit compile. For example, you can set initial number of blocks to be 2048, each block 8 bytes, so one memory file takes 24K (don't forget service information of 4 bytes per block). Also, let's say, you set maximum expansion number to be 1024. So, your minimum usage is 24K, and your potential maximum usage is 24Mb. Now, if you pre-allocate areay of pointers, it will grab that 12 bytes*1024, and will allocate 12K per intance. Once it's done, IPC does not need to use shared lock during the read, so you gain about 50% of speed, if you access one instance from all threads. This flag can be set only during creation, and affects all instances, connected to this connection point. Also, it prevents modification ofmaxExpansions
value later on (i.e.setMaxExpansionsCount()
function will do nothing when called). - You also can play with function
setMaxSpinLocksCount(uint32_t blockAccessSpinLock, uint32_t expansionAccessSpinLock)
. These are spinlock counts that control how many tries IPC have before resorting to WaitForXXX function (or a::Sleep(0)
loop). First parameter (blockAccessSpinLock
) controls after how many tries IPC will stop trying to get a new available block (for read or write), and skip to WaitForXXX, waiting for someone to signal block availability. Default value is 12. Second parameter (expansionAccessSpinLock
) controls shared read access to blocks when neithersignleThread
, norpreAllocate
is set. Default value is 40.
So, when and why you need to use these flags? The fasted way to use this IPC is to use one instance of IPC per thread. Say, you create connection point in the main thread, and then in all working threads you create new IPC instance, connect it to that connection point and start sending/receiving data. Why not always use this scenario? Well, everything comes with price. In this case - too much memory. When you create IPC instance and connect it to connection point , it maps shared memory file into it's own memory space (to be exact, into memory space of that thread or process). So, if you allocated paltry 1Mb in one shared file, and then spawned 1000 threads, each of them using it's own IPC instance, then now your process(es) use 1Gb of memory. This is Not Good. But, if you can control number of threads/processes connected to the connection point, then you'd better use one IPC per thread.
Now, if above scenario is not possible, i.e. you can't predict/limit number of threads connected, or you want to use pretty large number of (blocks count*expansions count), then you need to try to retreat to the second line of defence: use preAllocate
. Now, you can use just one IPC per process, and have all threads using it with almost same performance as if they had their own IPC instance. Yes, you pay for it with extra allocated space that might never be used, but 12K or even 120K is much less than potential gigabytes of memory that you suddenly need to allocate. In fact, I recommend setting this flag (and then using single IPC instance from all threads) in almost any situation - unless you either highly pressed on space usage (which is quite unlikely nowadays, especially under Windows ), or you need the top best performance, where even 1% counts. In general, using
preAllocate
and then accessing one instance from all threads is about 3-4% slower than having one IPC per thread.
Keep in mind, that if neither of these flags are set, i.e. you using default settings, performance is 4-5 times less than optimal, and fails to about 1 - 1.2 mln blocks per second for worst case scenario (single instance used by multiple threads). Interesting enough, performance of my desktop in this case is twice LESS than that of laptop.
And one more thing for advanced users Instead of calling "write" function that will copy passed buffer to the shared memory (to be read via copy again by "read" function), you can use 2 additional functions with the following use pattern:
BlockWriteDescriptor descriptor;
if (void* buf = ipc.acquireBlock(descriptor))
{
if (!(new (buf) MyType(<....>))))
//<Error. Mark this block somehow as unusable, i.e. zero it out>
ipc.releaseLockedBlock(descriptor);
}
//....
// somewhere else:
BlockReadDescriptor descriptor;
if (void* buf = ipc.acquireBlock(descriptor))
{
MyType* myVar = (MyType*)descriptor.result;
<... do someting with myVar ...>
myVar.~MyType();
ipc.releaseLockedBlock(descriptor);
}
Keep in mind, that all time between acquireBlock
and releaseLockedBlock
IPC might have other threads waiting, so use extreme caution and use this ability only if you know what you are doing. You might want to use it if you want to avoid memory fragmentation and save time on copying - this way you have only one memory manipulation (during call to "new") instead of 2 copies. But again - you were warned: this is extremely dangerous functionality, use with care.
Also, this code is suited for VC++ 2010 and newer. If you want to use older compilers, please do the following:
- remove include for <stdint.h>
- add the following define (of course, inside #ifndef/#endif):
#define uint32_t unsigned long
to_wstring
" with some other function (like "sprintf
", for example).That's it, and happy coding!
Updates:
07/15/2013: Version 1.1 - 2 bugfixes, both were prominent when INFINITE timeout was specified:
1) optional handle "stopOn" was ignored when INFINITE timeout was specified;
2) due to race conditions writer was not always informing reader that write operation was completed, thus causing reader to hang forever on the last unread block when INFINITE timeout was specified.