C# is a great language. It allows you to be extremely productive and efficient by eliminating the need for manual memory management, and providing fast compile times, an extensive standard library, and various other handy features. However, for applications requiring heavy number crunching, its performance can be less than adequate. In this article, I will show you how C# can call into C++ functions when necessary, and provide an analysis of its performance.
Given a large number of random rectangles lying between (0, 0) and (2, 2), let us find the percentage of these rectangles that lie between (0, 0) and (1, 1). We will solve this problem using brute force in order to stress the CPU. The code for this algorithm is pretty straightforward. Basically, we generate
four random numbers in the interval (0, 2) for each rectangle, and assign them to the corner points. Then we count how many of these rectangles lie in the desired interval. All tests are carried out with 10 million rectangles on a q6600 with 4gb ram.
C# Reference Solution
This is all C#, and is used as a reference to measure relative performance. For 10 million rectangles, this method takes around 146 ms.
Interop Take 1: Marshaling
This is the easiest way to do interop. The C# side code for this is:
[DllImport("DllFuncs.dll", CallingConvention = CallingConvention.Cdecl, EntryPoint= "nativef")]
public static extern float getPercentBBMarshal(
[MarshalAs(UnmanagedType.LPArray, SizeParamIndex = 1)] BBox boxes, int size);
Basically, this tells the runtime to look for a function named
nativef using the calling convention cdecl in the native
library called "DllFuncs.dll". It also tells the runtime to transparently convert the C# array of
BBoxes into a C++ array. We need to pass the size, as arrays in C++ are unaware of their length. The corresponding C++ function is this:
float x1, y1, x2, y2;
return (x1 < 1) && (x2 < 1) && (y1 < 1) && (y2 < 1) && (x1 > 0) &&
(x2 > 0) && (y1 > 0) && (y2 > 0);
__declspec(dllexport) float __cdecl nativef(BBox * boxes, int size)
int sum = 0;
for (int i = 0; i < size; i++)
return (float)sum/(float)size * 100;
Measuring the performance of this function with a timer, we see an interesting result. The native function takes 341 ms for 10 million elements, which is around Twice the time taken by the C# equivalent! Moreover, for 1000 elements,
marshaling takes 239.3 ms, which is way above the 0.054 ms taken by pure C#. Surely, marshaling is adding a huge overhead, the relative importance of which is diminishing with the amount of work. To know where this overhead comes from, we need to know how marshaling works:
- Allocate a C++ equivalent of the C# array passed to the function.
- Copy values for the C# array to the C++ array.
- Call the C++ function.
- Copy the return value of the C++ function into the C# equivalent.
- Return control to the C# assembly.
Now, we can easily see the reason for bad performance. We are essentially allocating 10 million rectangles and copying the values of each of these. That is allocating and moving around 160 MB of data! No wonder performance is horrible. You may be asking yourself why the whole exercise of copying is necessary in the first place. There are two reasons for this:
- The Memory Layout of a structure in C# may not match that of the same structure in C++. Hence, a simple pointer assignment may not do the trick, as C++ may interpret this memory area differently than C#.
- The Garbage Collector in C# is free to move data physically in memory, in order to do compacting garbage collection. Hence, a pointer passed from C# to C++ may not be valid by the time control reaches C++, as the GC may have already moved the underlying memory to another physical location.
So, is it possible to get around these problems? Let's find out!
Interop Take 2: Direct pointer access
- The Memory Layout problem is simple to deal with. It's just a matter of telling the runtime to lay out the structure in memory just like C++ would. C++ lays out the data sequentially
using certain alignment rules which can be mimicked by C# using the following:
public float x1, y1, x2, y2;
The Garbage Collector problem can be taken care of using the
fixed statement. This statement makes sure that memory is not moved around by the GC for the lifetime of the statement. The
fixed statement can only be used form unsafe functions in assemblies compiled
with the /unsafe option.
[DllImport("DllFuncs.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern unsafe float nativef(IntPtr p, int size);
public static unsafe float getPercentBBInterop(BBox boxes)
fixed (BBox* p = boxes)
result = nativef((IntPtr)p, boxes.Length);
Pointer Access Performance
The function returns in 115 ms, which is around 26% faster that the C# equivalent. The performance gain is likely to increase with the complexity of the functions delegated to native code.
Charts displaying the performance with respect to various numbers of elements processed are shown below:
The source code for this is hosted on Github:
Make sure to check it out.