Introduction

Skip to the results

There is a performance penalty when using P/Invoke to cross the managed/unmanaged boundary. But how serious is this penalty? Can this penalty be reduced by not using P/Invoke, but writing a C++/CLI library that exposes functions from a traditional API?

In this article, we will look into the performance of P/Invoke compared to a C++/CLI wrapper.

Background

I am currently working on the new version of my OpenGL wrapper and Scene Graph called SharpGL (http://www.codeproject.com/KB/openGL/sharpgl.aspx). OpenGL is a very 'talkative' API - the functions are called many thousands of times per second. Whilst working on this library I wondered, would it be faster to write a C++/CLI class library to expose OpenGL functions or would it be faster to P/Invoke them directly? A brief Google suggested a C++/CLI wrapper but I wanted to look into this further.

I have written a tiny API called 'TraditionalAPI' which exposes three functions - this project invokes the functions a number of times using different methods.

Part 1: The Traditional API

The traditional API exposes three basic functions:

Test Function 1: IncrementCounter

This is the most basic function I could come up with, testing this function should be a good way of testing the overhead of a P/Invoke call:

//    A global counter.
unsigned int g_uCounter = 0;

TRADITIONALAPI_API void __stdcall TA_IncrementCounter()
{
    g_uCounter++;
}

Test Function 2: Square Root

The second function calculates the square root of a double. No complicated marshalling should be required:

//    A slightly more complex function, find the square root of a double.
TRADITIONALAPI_API double __stdcall TA_CalculateSquareRoot(double dValue)
{
    return ::sqrt(dValue);
}

Test Function 3: Dot Product

The next function calculates the dot product of two three-tuples. This function takes two arrays - so in the managed world we will have to pin memory to marshal this:

TRADITIONALAPI_API double __stdcall TA_DotProduct(
               double arThreeTuple1[], double arThreeTuple2[])
{
    return arThreeTuple1[0] * arThreeTuple2[0] + arThreeTuple1[1] * 
           arThreeTuple2[1] + arThreeTuple1[2] * arThreeTuple2[2];
}

Part 2: The C++/CLI Wrapper

The second part of the solution is a C++/CLI wrapper that wraps each function:

Test Function 1 Wrapper

void IncrementCounter()
{
    //    Call the unmanaged function.
    ::TA_IncrementCounter();
}

Nothing special here - this is a C++/CLI class so we will be able to call IncrementCounter from another .NET application.

Test Function 2 Wrapper

double CalculateSquareRoot(double value)
{
    //    Call the unmanaged function.
    return ::TA_CalculateSquareRoot(value);
}

Again, nothing complicated is required for this function.

Test Function 3 Wrapper

double DotProduct(array<double>^ threeTuple1, array<double>^ threeTuple2)
{
    //    Pin the arrays.
    pin_ptr<double> p1(&threeTuple1[0]);
    pin_ptr<double> p2(&threeTuple2[0]);
            
    //    Call the unmanaged function.
    return TA_DotProduct(p1, p2);
}

Now in this case, we actually have to do some work - pinning the managed arrays so that we can access them directly in the unmanaged API.

Part 3: The C# Test Application

The final part of the solution is a C# WPF application that runs the tests. The TraditionalAPI DLL can run each test function individually or run each test a number of times. Because of this, we can compare the following:

The time taken to run x tests directly in TraditionalAPI
The time taken to run x tests via the C++/CLI interface
The time taken to run x tests via P/Invoke

The Results

Below we have the results of running each test 10000 times:

And the results of running each test 100000 times:

Conclusion

Certainly not what I would have expected. According to my research, I was expecting to see the C++/CLI interface be at least an order of magnitude faster - as it does less error checking than a P/Invoke call. Even in the case of a million function calls, the C++/CLI interface is barely faster than using P/Invoke.

As we would expect, the cost of calling any native function from a CLI application is very high if we are calling it many times - in the case of using a very talkative API, it may even be worth writing a second C++ API that takes an aggregated set of parameters and calls the functions many times - the managed to unmanaged boundary is expensive.

Further Research

Has anyone found a case where a C++/CLI wrapper really does give a solid performance boost? Is there another way to do this that I have overlooked? Please provide any suggestions and I will update the article as necessary.