This article compares different allocation- and copy-methods in using large
byte in managed code.
Sometimes you have to deal with frequently allocated and copied large byte arrays. Especially in video processing RGB24 of a 320x240-picture needs a
byte of 320*240*3 = 230400 bytes. Choosing the right memory allocation and memory copying strategies may be vital for your project.
Using the Code
In my current project, I have to handle hundreds of uncompressed RGB24-Frames on multi core servers in real time. To be able to choose the best architecture for my project, I compared different memory allocations and copy mechanisms.
Because I know how difficult it is to find a good way to measure, I decided to do a really simple test and get a raw comparable result. I simply run a loop for 10 seconds and count the number of loops.
Looking around, I found 5 different methods to allocate large byte arrays:
CreateFileMapping() // This is shared memory
Here is a typical loop showing the new
private static void newbyte()
Console.Write("new byte: ");
long start = DateTime.UtcNow.Ticks;
int i = 0;
while ((start + duration) > DateTime.UtcNow.Ticks)
byte buf = new byte[bufsize];
byte is completely managed code.
Allocates memory from the unmanaged memory of the process.
IntPtr p = Marshal.AllocHGlobal(bufsize);
Marshal.AllocHGlobal() returns an
IntPtr but still does not need to be unsafe code. But when you want to access the allocated memory, you most often need unsafe code.
Allocates a block of memory of specified size from the COM task memory allocator.
IntPtr p = Marshal.AllocCoTaskMem(bufsize);
Same need of unsafe code like
For using shared memory in a managed code project, I wrote my own little helper class for using
Using shared memory is quite simple:
using (SharedMemory mem = new SharedMemory("abc", bufsize, true))
mem has a
void* to the buffer and a length-property. From inside another process, you can get access to the same memory by simple using
false in the constructor (and the same name).
SharedMem uses unsafe.
byte on the stack. Therefore it will be freed when you return from the current method. Using the stack may result in stack overflows when you don't do it wisely.
unsafe static void stack()
byte* buf = stackalloc byte[bufsize];
stackalloc requires using unsafe, too.
I don't want to talk about single/multicore, NUMA/non-NUMA-architecures and so on. Therefore I just print some interesting results. Feel free to run the test on your machines!
Running the test in Debug and Release offers dramatic differences in the number of loops in 10 seconds:
new byte: 425340907 100%
Marshal.AllocHGlobal: 19680751 5%
Marshal.AllocCoTaskMem: 21062645 5%
stackalloc: 341525631 80%
SharedMemory: 792007 0.2%
new byte: 71004 0.3%
Marshal.AllocHGlobal: 22660829 89%
Marshal.AllocCoTaskMem: 25557756 100%
stackalloc: 558497 2%
SharedMemory: 785470 3%
As you can see,
new byte and
stackalloc byte dramatically depend on the debug/release switch. And the other three do not depend on it. This may be because they are mainly kernel-managed.
new byte and
stackalloc byte are the fastest in managed code in release-mode and the slowest in debug-mode. But remember that the garbage collector has to handle the
new byte, too.
These two runs were done on my PC (intel dualcore, vista64). So let's compare it to a typical server (dual xeon quadcore, Windows server 2008 64bit) in release:
new byte: 553541729 425340907
Marshal.AllocHGlobal: 26460746 19680751
Marshal.AllocCoTaskMem: 28294494 21062645
stackalloc: 466980755 341525631
SharedMemory: 817317 792007
Because we are single-threaded, the number of cores does not matter. Remember, the garbage collector has its own thread.
Let's compare 32bit to 64bit (release):
new byte: 1046577767 516441931
Marshal.AllocHGlobal: 21034715 25152330
Marshal.AllocCoTaskMem: 23467574 27787971
stackalloc: 83956017 416630753
SharedMemory: 728858 793750
SharedMemory is a little bit faster on x64.
new byte is up to twice as fast on x86 than on x64. And
stackalloc byte is 5 times faster on x64 than on x86. I didn't expect this result!
The same result is true on my server.
So think twice before you decide which allocation method and target-platform you choose!
And now let's look for some memcopy-variants. I use the same algorithm to measure. Let one thread do a loop copying a
byte to another
byte for 10 seconds and count the number of copies.
Array.Copy: 360741 361740
Marshal.Copy: 360680 359712
Kernel32NativeMethods.CopyMemory: 361314 358927
Buffer.BlockCopy: 375440 374004
OwnMemCopyInt: 217736 33833
OwnMemCopyLong: 295372 54601
As expected only my own
MemCopy was a lot slower in Debug-Mode. Let's take a look at my own
static readonly int sizeOfInt = Marshal.SizeOf(typeof(int));
static public unsafe void MemCopy(IntPtr pSource, IntPtr pDest, int Len)
int size = sizeOfInt;
int count = Len / size;
int rest = Len % count;
int* ps = (int*)pSource.ToPointer(), pd = (int*)pDest.ToPointer();
for (int n = 0; n < count; n++)
*pd = *ps;
if (rest > 0)
byte* ps1 = (byte*)ps;
byte* pd1 = (byte*)pd;
for (int n = 0; n < rest; n++)
*pd1 = *ps1;
Even when you use unchecked unsafe code, the built in copy-functions perform much faster in debug mode than doing the copy in a loop yourself.
Array.Copy: 230788 360741
Marshal.Copy: 460061 360680
Kernel32NativeMethods.CopyMemory: 365850 361314
Buffer.BlockCopy: 368212 375440
OwnMemCopyInt: 218438 217736
OwnMemCopyLong: 286321 295372
In 32Bit x86-Code, the
Marshal.Copy is significantly faster than in 64bit-code.
Array.Copy is much slower in 32bit than in 64bit. And my own
memcopy-loop uses 32bit integers and therefore has the same speed. And the kernel-method is not affected by this setting.
It is a good idea to use the built-in
Points of Interest
Try the source on your machine and compare the results.
- 9th February, 2009: Initial post