Click here to Skip to main content
15,860,859 members
Articles / High Performance Computing / Parallel Processing

How to use CPU instructions in C# to gain performace

Rate me:
Please Sign up or sign in to vote.
4.86/5 (16 votes)
26 Feb 2011CPOL6 min read 33.3K   29   12
How to use CPU instructions in C# to gain performace

Introduction

Today, .NET Framework and C# have become very common for developing even the most complex applications very easily. I remember that before getting our hands-on on C# in 2002, we were using all kinds of programming languages for different purposes ranging from Assembly, C++ to PowerBuilder. But, I also remember the power of using assembly or C++ to use every small drop of power of your hardware resources. Every once in a while, I'm getting to a project where using regular framework functionality puts my computer to the grill for a couple days to calculate something. In those cases, I go back to the good old C++ or Assembly routines to use the power of my computer. In this blog, I will be showing you the simplest way to take advantage of your hardware without introducing any code complexity.

Background

I believe that samples are the best teacher; therefore, I'll be using a sample CPU instruction from Streaming SIMD Extensions (SSE). SSE is just one of the many instruction set extensions to X86 architecture. I’ll be using an instruction named PTEST from the SSE4 instructions, which is almost in all Intel and AMD today. You can visit the links above for supported CPUs. PTEST helps us to perform bitwise comparison between two 128-bit parameters. I picked this because it was a good sample using also data structures. You can easily lookup online for any other instruction set for your project requirements.

We will be using unmanaged C++, wrap it with managed C++ and call it from C#. Don't worry, it is easier than it sounds.

Thanks to MSDN, we will have all necessary information at Alphabetical Listing of Intrinsic Functions. The PTEST _mm_testc_si128 is documented at http://msdn.microsoft.com/en-us/library/bb513983.aspx. If we would use C++, we would end up having the code from the MSDN link:

C++
#include <stdio.h>
#include <smmintrin.h>

int main ()
{
    __m128i a, b;

    a.m128i_u64[0] = 0xAAAA55551111FFFF;
    b.m128i_u64[0] = 0xAAAA55551111FFFF;

    a.m128i_u64[1] = 0xFEDCBA9876543210;
    b.m128i_u64[1] = 0xFEDCBA9876543210;

    int res1 = _mm_testc_si128(a, b);

    a.m128i_u64[0] = 0xAAAA55551011FFFF;

    int res2 = _mm_testc_si128(a, b);

    printf_s("First result should be 1: %d\nSecond result should be 0: %d\n",
                res1, res2);

    return 0;
}

I would like to point out that there are many ways to develop a software and the one I’m providing here is maybe not the best solution for your requirement. I’m just providing one way to accomplish a task, it depends on you to fit into your solution.

Ok, let’s start with a fresh new solution: (I’m assuming that you have Visual Studio 2010 with C++ language installed on it.)

  1. Add a new C# console application to your solution, for testing purposes.
  2. Add a new Visual C++ > CLR > Class Library project to your solution, named TestCPUInt.
  3. Add a reference to your console application and select TestCPUInt from the projects.
    Now we are ready to code.
  4. Open the TestCPUInt.h file, if it is already not opened
  5. Insert the following code right on top of the document:
    C++
    #include <smmintrin.h>
    #include <memory.h>
    
    #pragma unmanaged
    
    class SSE4_CPP
    {
    //Code here
    };
    
    #pragma managed

    This is the infrastructure for our unmanaged C++ code which will do the SSE4 call. As you can see, I placed our unmanaged code between #pragma unmanaged and #pragma managed. I think it is a great feature to be able to write unmanaged and managed code together. You may prefer to place the unmanaged code to another file, but I used one file for clarity. We used here two include header files, smmintrin.h and memory.h: the first one is for the SSE4 instructions and the other one is for a method I used to copy memory.

  6. Now paste the following code at the //Code here location:
    C++
    public:
    	int PTEST( __int16* bufferA, __int16* bufferB)
    	{
    		__m128i a, b;
    
    		//transfer the buffers to the _m128i data type, 
    		//because we do not want to handle with that in managed code
    		memcpy(a.m128i_i16, bufferA, sizeof(a.m128i_i16));
    		memcpy(b.m128i_i16, bufferB, sizeof(b.m128i_i16));
    
    		//Call the SSE4 PTEST instructions
    		return _mm_testc_si128(a, b);
    	}

    This _mm_testc_si128 will emit the SSE4 PTEST instruction. We have a little memory operation right before it to fill out the __m128i data structure on the C++ code. I used memcpy to transfer the data from the bufferA and bufferB arguments to the __m128i data structure to push it to the PTEST. I preferred to do this here to separate the whole SSE4 specific implementation. I could also send the __m128i to the PTEST method, but that would be more complex.

    As I mentioned before, in this example I used the PTEST sample with a data structure, you may run into other instructions which require only a pointer, in that case you don’t need to do the memcpy operation. There might be some challenges if you are not familiar with C++, especially when the IntelliSense is removed in Visual Studio 2010 VC++, but you can always check out for online answers. For example, the __m128i data structure is visible in the emmintrin.h file, which is located somewhere in [Program Files]\Microsoft Visual Studio 10.0\VC\include. Or you can check all fundamental data types if you are not sure what to use instead of __int16*.

  7. Now paste the following code on top of your managed C++ code. Which is on the bottom of your TestCPUInt.h file in the namespace section.
    C++
    namespace TestCPUInt {
    
    	public ref class SSE4
    	{
    	public:
    		int PTestWPointer(__int16* pBufferA, __int16* pBufferB)
    		{
    			SSE4_CPP * sse4_cpp = new SSE4_CPP();
    			return sse4_cpp->PTEST(pBufferA, pBufferB);
    		}
    }

    What we did here is to pass forward the pointers pBufferA and pBufferB, which we are going to call from C#, into the unmanaged code. For those who are not familiar with pointers, the * sign defines a pointer: __int16* means a pointer to a 16 bit integer. In our case, that is the address of the first element of an array.
    There are also ways without using managed C++ to call a dynamic library, but as I mentioned before, I’m showing only the simplest way for a C# developer.

  8. Let’s go to our C# code in the console application to use this functionality.
  9. First, we have to switch the application to allow unsafe code. For that, go to the project properties and check the “Allow unsafe code” under the build tab.
  10. Add the following code to your Program.cs file:
    C#
    static int TestCPUWithPointer(short[] bufferA, short[] bufferB)
    {
    	SSE4 sse4 = new SSE4();
    	unsafe
    	{
    		//fix the buffer variables in memory to prevent from 
    		//getting moved by the garbage collector
    		fixed (short* pBufferA = bufferA)
    		{
    			fixed (short* pBufferB = bufferB)
    			{
    				return sse4.PTestWPointer(pBufferA, pBufferB);
    			}
    		}
    	}
    }

If you never used unsafe code before, you can check out unsafe (C# Reference). Actually, it is fairly simple logic; the PTestWPointer required a pointer to an array and the only way to get the pointer to an array is to use the fixed statement. The fixed statement pins my buffer array in memory in order to prevent the garbage collector to move it around. But that comes with a cost: in one of my projects, the system was slowing down because of too many fixed objects in memory. Anyways, you may have to experiment for your own project.
That’s it!

But we will not stop here, for comparison I did the same operation in C#, as seen below:

C#
static int TestCLR(short[] bufferA, short[] bufferB)
{
	//We want to test if all bits set in bufferB are also set in bufferA
	for (int i = 0; i < bufferA.Length; i++)
	{
		if ((bufferA[i] & bufferB[i]) != bufferB[i])
			return 0;
	}
	return 1;
}

Here, I simply calculate if every bit of bufferB is in bufferA; PTEST does the same.

On the rest of the application, I compared the performance of these two methods. Below is a code which does the comparison for sake of testing:

C#
static void Main(string[] args)
{
	int testCount = 10000000;
	short[] buffer1 = new short[8];
	short[] buffer2 = new short[8];

	for (int i = 0; i < 8; i++)
	{
		buffer1[i] = 32100;
		buffer2[i] = 32100;
	}

	Stopwatch sw = new Stopwatch();
	sw.Start();
	int testResult = 0;
	for (int i = 0; i < testCount; i++)
		testResult = TestCPUWithPointer(buffer1, buffer2);
	sw.Stop();
	Console.WriteLine("SSE4 PTEST took {0:G} 
		and returned {1}", sw.Elapsed, testResult);

	sw.Start();
	for (int i = 0; i < testCount; i++)
		testResult = TestCLR(buffer1, buffer2);
	sw.Stop();
	Console.WriteLine("C# Test took {0:G} and 
		returned {1}", sw.Elapsed, testResult);

	Console.ReadKey();
}

On my environment, I gained %20 performance. On some of my projects, I gained up to 20 fold performance.
A last thing I would like to do is to show you how to move the fixed usage from C# to managed C++. That makes your code little cleaner like the one below:

C#
static int TestCPU(short[] bufferA, short[] bufferB)
{
	SSE4 sse4 = new SSE4();
	return sse4.PTest(bufferA, bufferB);
}

As you can see, it is only a method call. In order to do this, we have to add the following code to the TestCPUInt class in the TestCPUInt.h file:

C++
int PTest(array<__int16>^ bufferA, array<_int16>^ bufferB)
{
	pin_ptr<__int16> pinnedBufferA = 
		&bufferA[0]; // pin pointer to first element in arr
	__int16* pBufferA = 
		(__int16*)pinnedBufferA; // pointer to the first element in arr
	pin_ptr<__int16> pinnedBufferB = 
		&bufferB[0]; // pin pointer to first element in arr
	__int16* pBufferB = 
		(__int16*)pinnedBufferB; // pointer to the first element in arr

	SSE4_CPP * sse4_cpp = new SSE4_CPP();
	return sse4_cpp->PTEST(pBufferA, pBufferB);
}

This time, our method takes a managed array object instead of a __int16 pointer and pins it in the memory like we did using fixed in C#.

Conclusion

I believe that as much as higher level frameworks we are using, there will always be situations where we have to use our hardware resources more wisely. Sometimes these performance improvements save us big amounts of hardware investment.

Please go ahead and look into CPU instructions, Parallel computing, GPGPU, etc. and understand the hardware; knowing your tools better will make you a better software architect.


License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
I am an MCAD, MCSD, MCTS, MCPD, MCT and Certified CUDA Programmer.
You can find more technical articles on my blog at http://www.adnanboz.com.

Comments and Discussions

 
GeneralMy vote of 4 Pin
jfriedman2-Jan-13 18:36
jfriedman2-Jan-13 18:36 
GeneralMy vote of 5 Pin
xyzabc1233215-Feb-12 15:30
xyzabc1233215-Feb-12 15:30 
SuggestionInstead of copying managed memory to __m128... Pin
kornman0026-Jan-12 11:58
kornman0026-Jan-12 11:58 
QuestionGood article, but... Pin
lemur26-Dec-11 23:55
lemur26-Dec-11 23:55 
SuggestionUseful article Pin
Stam111-Sep-11 2:57
Stam111-Sep-11 2:57 
GeneralRe: Useful article Pin
Adnan Boz8-Sep-11 12:35
Adnan Boz8-Sep-11 12:35 
GeneralMy vote of 5 Pin
René Greiner17-Mar-11 9:02
René Greiner17-Mar-11 9:02 
GeneralRe: My vote of 5 Pin
Adnan Boz3-Apr-11 14:14
Adnan Boz3-Apr-11 14:14 
GeneralMy vote of 5 Pin
Gonzalo Brusella1-Mar-11 5:12
Gonzalo Brusella1-Mar-11 5:12 
GeneralRe: My vote of 5 Pin
Adnan Boz1-Mar-11 16:54
Adnan Boz1-Mar-11 16:54 
GeneralMy vote of 5 Pin
Patrick Kalkman28-Feb-11 3:34
Patrick Kalkman28-Feb-11 3:34 
GeneralRe: My vote of 5 Pin
Adnan Boz28-Feb-11 9:40
Adnan Boz28-Feb-11 9:40 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.