Head-to-head benchmark: C++ vs .NET

Qwertie

4.64/5 (336 votes)

Jun 17, 2011

LGPL3

56 min read

1305062

2353

How fast is C++ compared to C#? Let's compare code ported directly between the two languages.

Download BenchmarkCppVsDotNetV2.zip - 919 KB

Introduction

I love the .NET Framework, C# and IntelliSense, but a lot of people are still saying it's slower than C++, pointing out that video games and "major applications" are still written in C++. Naysayers often point out that apps like Office, Internet Explorer, and Windows Media Player are still written in C++, apparently implying that C++ is still the best choice for performance. But wait, isn't this mainly because those applications are older than the .NET Framework itself, or because the .NET Framework has lousy support for DirectShow, or less-than-ideal support for Direct3D?

Well, I've been using C# for years and rarely had any complaint about its performance--that is, until I wrote a program for a Pocket PC and was shocked to discover that it required 35 seconds to start up.

With this article, I am attempting to make a fair comparison between the performance of C# and C++ (unmanaged), in both desktop and mobile (Windows CE) scenarios. In the first version of this article, I wrote most of the code in C# first, and manually ported it, line-by-line, to C++, replacing standard .NET classes with STL equivalents (some of the code was ported in the other direction). Because this is labor-intensive, however, the variety and scope of those benchmarks were limited. For this updated version, I added three extra cross-language benchmarks written by two different people, plus a variation on the hashtable benchmark using sorted maps.

My main goal is to measure the performance difference in the compiler/JIT and parts of the standard library that are used most often. I'm largely limiting my tests to the C++ standard library anyway, so there's no way we can compare the speed of things like XML parsing or JPG loading, which the .NET Framework BCL can do but the C++ standard libraries cannot. No, my benchmarks will be limited to the following areas:

String handling: std::string vs System.String
Hashtables: hash_map<K,V> vs Dictionary<K,V>
Binary trees: map<K,V> vs SortedDictionary<K,V>
Simple structures: in my work I often end up creating small performance-critical structures, such as a fixed-point type with a single integer field.
Mathematical generics: You have to go to quite some effort to write math code using .NET Generics. Do you also suffer a performance hit?
Simple arithmetic: add and subtract; multiply, divide and modulo by a constant, for different data types. I also try an integer square root algorithm (and its floating-point equivalent, for completeness).
64-bit integers: some compilers deal with these pretty poorly.
Text file scanning: how fast can we read text files line by line?
Sorting
P/Invoke and no-op methods (C# only)

Updates to this article

The original article compiled the code in Visual Studio 2008; beside those results I have added results for Visual Studio 2010 (including .NET 4) and Mono, so you can see the difference. Microsoft dropped support for Windows CE in VS 2010, which is the reason I used VS 2008 in the first place. Of course, I welcome readers to try compiling the code with other compilers or platforms (such as GCC or Windows Phone 7). Be aware that any differences you observe may be caused by using a different processor as well as a different compiler.

My test workstation is a Quad-core 64-bit Intel Q9500 @ 2.83 GHz (but all tests are single-threaded) on Windows 7. The mobile device is a Mentor Ranger 4 with a 600MHz ARM processor, and .NET Compact Framework 3.5.

Numerous individuals gave the original version of this article a "1" vote. Most of them were either unwilling or unable to state specifically why I deserved the lowest possible score, but let me address some of the specific criticisms. My sincere thanks goes out to those people who found imperfections but didn't punish me for it!

In C++, you didn't disable checked iterators: An oversight. I didn't realize that checked iterators were enabled in Release builds. It turns out that VC9 does two different things to slow down your C++ code. One is called "iterator debugging" (enabled in debug builds only), which can sometimes make code ridiculously slow; the other one, called "iterator checking" or "checked iterators" has a smaller effect, but is enabled in Release builds by default. (One 1-voter reports they are disabled by default in VS2010.) I knew about the first, and conflated it with the second. Sorry. This time they will be disabled in every scenario except one.
You ported the code line-by-line between languages: So? Don't y'all have anything specific to say? Did any of you look at the code? I chose a subset of C++ and C# in such a way that it was perfectly reasonable to do the same task in the same manner in both languages. I received only one specific criticism, about the way GenericSum was written in C++. I tried to explain that C++ compilers are equipped with something called an "optimizer" that can perform a trick called "inlining", which (as my graph showed) causes the benchmark results to be just as good as they would be without the unnecessary function call. But, judging by the votes, some people were not convinced. So this time I have changed the C++ code to eliminate the extra function call, but I was forced to remove one of the C++ tests (which shows what happens to performance when the function is virtual). Of course, there are still two other function calls in there, but I think it's pretty standard C++ code at this point, and the optimizer still exists.
In C++, you didn't enable SSE2 or the "fast" floating point model: True. I must admit, my main goal was to benchmark ARM performance, which offers no magic compiler switches to make code run faster. This time I will do separate VC9 tests, once in x86 with default compiler switches and again with architecture=SSE2, floating point model="fast", and no checked iterators, so y'all can see the difference, and a third time with x64 and all bells and whistles. I still think it's reasonable to benchmark without SSE2, since some older computers cannot run an executable that relies on SSE2 (by the way, I have not been able to determine when non-SSE2 processors stopped being manufactured or what percent marketshare SSE2 has.) However, as far as I know all x64 systems have SSE2 (correct me if I'm wrong!). So when you make x64 builds, you may as well enable SSE2. Likewise if you're limiting yourself to current-gen systems (or if you distribute separate builds of your program for older PCs), it is a good idea to enable SSE2.
You didn't use VS2010: you gave me a 1 just for that? Eww. Fine, I'll benchmark VS2010 too. And I'll enable SSE2 and the fast floating point model for all VC10 tests. But you are uninvited from my birthday party.
What about profile guided optimization (PGO): Well, I'm now up to 6 different sets of C++ benchmark results, and I'm not sure how realistic PGO results would be. After all, the input data is the same in every trial, and some of the benchmarks are artificial. So, PGO may not provide the same performance increase on these benchmarks as on real-world code. I have a more realistic (but closed source) C++ benchmark I could run with and without PGO; leave a comment if interested, or simply run PGO on the existing benchmarks yourself.
Compare languages and not implementation of their libraries: As if everyone writes their own hashtables, lists, file management code and string classes from scratch! And don't forget, memory management is part of the standard libraries; are memory allocations are off-limits too? Look, some of my benchmarks did not use any standard libraries, and most make few or no allocations. Ignore the ones that use standard libraries, if they make you uncomfortable for some reason.
Use more complex real life algorithms: You want me to write a large and complex real-life algorithm twice in two different languages? Sure! Just name the algorithm and tell me how much you'll pay for my time.... Tell you what, I found some 3rd-party multi-language benchmarks. They are not ultra-complex, but is it okay if I use those?

Some non-1-voters suggested that ifstream should be avoided in C++ due to its poor performance, and a poor design that prevents high performance. Given the results, they were clearly correct! For this benchmark I added FILE* tests (not direct Win32 API calls, since I prefer portable code.)

There were some other 1-vote criticisms which I couldn't possibly use to improve the article, e.g.:

...Technical fellows at Microsoft (highest technical position in the company!) say that .NET is not conducive to system programming: So don't write your next OS in C#. What does this have to do with me?
Too subjective: why, because I admit to liking C#? So what? Can you name anything specifically wrong with my code?
Serious problems in the methodology: specifically?!?
Your article contains a large number of false and misleading statements: I asked you to name one, but you didn't.
A point for effort: Riiiight.
You really have a twisted mind: I tried not to let it affect the results. I would vote for the Ron Paul/Dennis Kucinich ticket, though.
It is pretty well known that native code is faster: Oh, of course, it was wrong of me to measure the difference! Well, I'm sorry if my CPU betrayed your trust. But my results showed that C++ was slightly faster in general (if you choose good-quality C++ libraries), and dramatically faster on ARM. So we agree with each other, and you gave me a 1 anyway! Plus, C++ wins the new Sudoku test by a wide margin, and I found some new causes of sluggishness in the x64 JIT. Does this mean the C++ lovers will now give me a 5, and C# lovers will suddenly switch their vote to 1?
Marketing stuff from Microsoft?: No.
Please consider the numerous comments and update your benchmark: Please consider the numerous updates and update your vote.
and my favorite, learning C++ might be a good starting point before trying to use it in a benchmark: I've been programming in C++ since I bought my first PC in 1994. Among other things, I wrote a SNES emulator and a GPS system for ARM with custom-optimized drawing primitives, from scratch in C++. Next I suppose you'll tell me I'm not a "real" C++ programmer unless I write my code in assembly...

Here's hoping for a more reasonable discussion this time around.

Before I start, I should mention that there's no easy way to match up garbage collections with benchmarks, so I do a GC before every C# benchmark, and I do not attempt to include that GC time in the totals. I would have liked to print out total GC time at the end, but apparently it is not possible. There is no property or performance counter for total GC time; there is an "instantaneous" measurement for "% GC time" but no obvious way to aggregate it over time. I could disable inter-test GCs, but then some GC time would be misattributed to incorrect places. I could add forced-GC time to the measured time of each test, but that would artificially inflate the C# time numbers (since forced gen-2 collections waste time compared to "natural" GCs). However, GC time shouldn't matter much here, because most of these benchmarks do not allocate a large number of heap objects (except the tests involving strings and SortedDictionary).

Test scenarios

It has become clear that when it comes to desktop software, the target CPU, target JIT (in C#) and compiler switches (in C++) can sometimes affect performance in a big way. So for this new version of the article, I'll be testing several different scenarios (11 in total) so you can see the difference that compiler switches make. I could test more scenarios, but the graph is getting crowded.

"x86 VC9 default": optimized Release build with default compiler switches, x86, VC++ 2008
"x86 VC9 NCI x86": no checked iterators, Release build, default compiler switches, x86, VC++ 2008
"x86 VC9 SSE2 x86": Release build, no checked iterators, SSE2, fast FP model, full optimization, x86
"x64 VC9 SSE2 x64": Release build, no checked iterators, SSE2, fast FP model, full optimization, x64
"x86 VC10 SSE2 x64": Release build, no checked iterators, SSE2, fast FP model, full optimization, x64, VC++ 2010
"x64 VC10 SSE2 x64": Release build, no checked iterators, SSE2, fast FP model, full optimization, x64, VC++ 2010
"x86 Mono": Built in VS2008, run with Mono with default compiler switches
"x86 .NET3": VS2008, Microsoft .NET Framework 3.5, x86 (JIT is the same as .NET 2.0)
"x64 .NET3": VS2008, Microsoft .NET Framework 3.5, x64 (JIT is the same as .NET 2.0)
"x86 .NET4": VS2010, Microsoft .NET Framework 4.0, x64
"x64 .NET4": VS2010, Microsoft .NET Framework 4.0, x64

I also benchmarked with the "VC9 NCI x86" settings plus /Ox (Full Optimization) to see if it was significantly different from the default /O2 (Maximize speed). The difference was tiny in every benchmark, and not worth including as a separate bar on the bar charts.

A couple of people wanted to know whether Mono would be faster if the LLVM JIT (--llvm command-line switch) would be faster. I found that LLVM made no significant difference in the current version for Windows (Mono 2.10.2). Some tests ran slightly faster; the majority ran slightly slower. And there was only a small difference between the first run (when JIT occurs) and subsequent runs, indicating that JIT time is not the reason for the lack of speed increase. The difference with and without --llvm was small enough that I felt it was not worthwhile to include it in the bar charts.

I'll get to the "real" benchmarking in a minute. But first...

Debug versus Release builds

Some C++ developers end up distributing Debug builds so that "assert" statements continue to work, or so that they can debug the production machine if they have to. Unfortunately, this tends to kill performance, as the graph below shows

(Note: I did not update these graphs for the new article. I don't think new graphs would teach us anything new about Debug builds.)

BenchmarkCppVsDotNet/DebugVsReleaseC__.png

In this graph, the scale has been adjusted so that the x86 Release build takes one unit of time. Some operations, most notably function calls and anything involving STL, are dramatically slower in desktop Debug builds. I suspect "Generic sum" is so much slower mainly because the STL vector it scans is slower; and you may find that hash_map is virtually unusable in a debug build, as you'll see.

When I first ran the x86 Debug benchmark, I noticed that it seemed to be "hung". I stopped it in the debugger and found it was running the "string hash_map: 1 Adding items" test. This test fills the hashtable with 2.5 million items, clears the list, and fills it again three more times. Well, when inserting the first item after a clear() command, hash_map gets stuck doing who-knows-what for around 4 minutes, for a total runtime of 16 minutes and 20 seconds (78 times longer than the Release build). I left it running overnight, and discovered that the removal test is even worse. This test tries to remove 10 million items, but the list size is limited to 2.5 million so most of those requests fail. For whatever reason, this test took 10 hours and 37 minutes, which is 9400 times longer than the Release build!

This turned out to be caused by the Visual C++ "iterator debugging" mis-feature. #define _HAS_ITERATOR_DEBUGGING 0 makes it realistically possible to run the benchmark, but the hashtable tests are still extremely slow in debug builds.

This graph also shows that the same piece of code can run at many different speeds depending on your platform and compiler settings. For example, some tasks, especially those involving 64-bit integers (or to a lesser extent, floating point), run faster when targeting x64. Once in awhile, x64 handles a task more slowly (for some reason, reading a file with ifstream::read() is much slower in 64-bit).

The C# results are more consistent:

BenchmarkCppVsDotNet/DebugVsReleaseC_.png

In fact, only a few operations are significantly slower in a C# Debug build compared to a Release build. One reason for this is that the .NET standard library, known as the BCL (Base Class Library), is optimized even when you are running a Debug build. This explains why the speed of the Dictionary<K,V>, file and string parsing tests (which mainly stress the BCL) are almost the same in Debug and Release builds. Additional optimizations seem to be enabled when C# Debug builds run outside the debugger, as these benchmarks did. In fact, the C# x86 Debug build outperformed the C++ x86 Debug build in all tests except two.

But of course, what we really care about is Release builds. The rest of this article will examine Release builds only (using Visual Studio's default compiler settings).

3rd-party benchmarks

To add more "realism" to my benchmark, I incorporated three benchmarks from two other people. I refactored their code slightly, to make the code fit better into my benchmarking framework and (in two benchmarks) to improve the C# version by eliminating 2-dimensional arrays, because performance-sensitive C# code should avoid them. I did not have time to prepare ARM bar charts for these new tests (or any other tests, for that matter).

The first two benchmarks I added were written by a Mr. Attractive Chaos. There are actually 4 benchmarks in his test, but one tests dictionaries, and I already have a benchmark for that, and one tests a particular C RegEx library, whereas I'd prefer to stick with standard libraries here. The first benchmark I kept is a straightforward O(n^3) matrix multiplication, and the second is a high-quality but (to me) hard-to-follow Sudoku solver. Then again, I've never played Sudoku.

Both benchmarks were originally written to use 2-dimensional arrays in C#. Now, writing about the matrix test, the author stated: "this is the first time I write C#". More seasoned C# developers have learned that .NET's multidimensional arrays are no good for performance. They weren't designed to be fast. Instead, it may be better to simulate 2D arrays using either with 1D arrays or "jagged" 2D arrays (double[][] instead of double[,]). The easiest way to eliminate a multidimensional array from your code is to introduce a wrapper such as this:

public struct Array2D<T> // use "class" if you prefer
{
  readonly T[] _a;
  readonly int _cols, _rows;
  
  public Array2D(int rows, int cols)
  {
    _a = new T[rows * cols];
    _cols = cols;
    _rows = rows;
  }
  public Array2D(T[] array, int cols)
  {
    _a = array;
    _cols = cols;
    _rows = _a.Length / cols;
  }
  
  public T[] InternalArray { get { return _a; } }
  public int Cols { get { return _cols; } }
  public int Rows { get { return _rows; } }
  
  // Add a range check on "col" if you want to sacrifice performance for safety
  public T this[int row, int col]
  {
    get { return _a[row * _cols + col]; }
    set { _a[row * _cols + col] = value; }
  }
};

In some conditions, the wrapper will be much faster than a standard .NET 2-dimensional array (type[n,n]), but as the benchmark will show, this is no longer a reliable optimization, and you'll often get better results from a jagged array.

The matrix multiplication test

In this benchmark, two monster 1000x1000 matrixes are multiplied. The C-style multiplication routine is straightforward and hopefully uncontroversial; the inner loop is optimized with "a_i" and "c_j" to avoid unnecessary multiplications, and the "b" matrix is transposed so it can be accessed sequentially. The original version could multiply only square matrixes, but I took the liberty of extending it to handle rectangular ones (untested, mind you).

double* MultiplyMatrix(double* a, double* b, int aRows, int aCols, int bCols)
{
  // int m = aRows, n = bCols, p = aCols; 
  // Note: as is conventional in C#/C/C++, a and b are in row-major order.
  // Note: bRows (the number of rows in b) must equal aCols.
  int bRows = aCols;
  double* x = new double[aRows * bCols]; // result
  double* c = new double[bRows * bCols];

  for (int i = 0; i < aCols; ++i) // transpose (large-matrix optimization)
    for (int j = 0; j < bCols; ++j)
      c[j*bRows + i] = b[i*bCols + j];

  for (int i = 0; i < aRows; ++i) {
    double* a_i = &a[i*aCols];
    for (int j = 0; j < bCols; ++j)
    {
      double* c_j = &c[j*bRows];
      double s = 0.0;
      for (int k = 0; k < aCols; ++k)
        s += a_i[k] * c_j[k];
      x[i*bCols + j] = s;
    }
  }
  delete[] c;
  return x;
}

C# variations

In C# there are various ways the same task could be accomplished. The original version used a multidimensional array (double[n,n]); I added three more methods, in increasing order by speed:

Array2D<double>: As described above, a single-dimensional array that acts two-dimensional.
double[n][n]: A "jagged" array, which is simply an array of arrays. Thus, it requires more memory (16 extra bytes per row compared to Array2D<double> under x86), and more initialization time, but can be faster overall.
double[n*n]: A single-dimensional array accessed via pointer arithmetic, just like the C++ version. This obtains good performance without the need for extra memory.

The method you choose makes a big difference:

As you can see, the version based on pointer arithmetic (double[n*n]) is the fastest, and the jagged (double[n][n]) version is just barely slower.

When you look at x86 results (ignoring Mono), double[,] is clearly the slowest at about 5.8 seconds. Array2D is nearly twice as fast at 3.2 seconds. And double[n][n] is twice as fast again, at 1.6 seconds. However, the x64 results are downright weird. Evidently Microsoft optimized multidimensional arrays for x64, so they are even faster than Array2D. At the same time, there is a "glitch" in x64/.NET4 that makes Array2D slow on that specific version of the JIT.

Note: from now on when I refer to "x86", I'm excluding Mono unless otherwise noted. Mono isn't doing very well so far, and it will continue to do poorly for a lot of these tests. In fact, originally I used a dark gray color for Mono. But those extremely long Mono bars were becoming a distraction, so I changed it to light gray. Like I said, if you don't use Mono, just ignore those long gray bars. If you do use Mono, well, you see that Array2D bar that goes right off the end of the chart? It's 11.75 seconds.

The one constant is that the jagged array is faster than the multidimensional array. I haven't figured out how to look at the assembly to see what's going on yet, but I can think of three reasons why the jagged array would be faster. First, .NET does range checks on all array accesses, except inside a for-loop that iterates from 0 to the end of the array, which the jagged implementation is (mostly) able to do. Second, when using a 2D array, a JIT that is not smart will fail to factor out the multiplication from the inner loop. Third, even if it does factor out the multiplication, it may fail to precompute a pointer to the part of the array being examined (indeed, the GC may make it hard to do that). In contrast, the jagged version needs no multiplications, and we can explicitly cache a reference to the current row of the matrix, like the C++ version does.

The version that uses pointers may avoid a single additional range check in the inner loop, so it should be slightly faster.

This whole discussion points to a general problem with C# performance compared to C++: .NET JITs aren't as good at optimization as C++ compilers. This is natural, because JITs execute at run-time so they can't optimize as aggressively (for fear that the program will start too slowly). In the future one could imagine smarter JITs that produce a slow version of the code and then (after a loop has executed several thousand times) take a time-out to optimize the code as well as a C++ compiler. They already have that kind of optimization technique for interface dispatch. But for everything else? No.

So, if you need fast C# code, you may have to optimize it yourself: manually factor out expressions from the inner loop, switch to jagged arrays, or even introduce pointers. Manual optimization can be a huge pain though, as in this case where the inner-loop multiplication is hidden inside a call to the indexer. So instead you could try NGEN, the results of which have not been included in this article. Maybe next time.

Generics vs templates

Now, the main matrix multiply test uses double arithmetic, but not all programmers are concerned primarily with the speed of double arithmetic. Therefore, I also wrote generic versions of the tests, so you could see the effect of different data types, specifically int and float.

In the first version of this article I tested C# generics to see if the JIT could optimize them as well as a C++ compiler optimizes templates. The results seemed to say yes, but the test was very simple. For this extension of the matrix multiplication test I compared two data types (double and int) with 5 different JIT engines, and in several cases the JIT botches the job. Have a look (<angle brackets> represent generics):

(Note: I threw in a generic float test so you can compare with the C++ float test, but I was too lazy to manually write its non-generic version.)

When you look at the results for x86 using integers, they are fine: generic <int> is exactly the same speed as non-generic int. However, if you look at the results for x64/.NET3, you'll see that the generic double test takes 2.14x longer, and the generic int test takes 3.45x longer. The .NET 4 JIT isn't so bad, but it flubs the integer test a little. And Mono seriously bombs the generic tests, especially involving floating point (the result is 9.7 seconds for <double>, and 11.3 seconds for <float>.)

C++, on the other hand, reveals absolutely no difference between the template <double> version and the original non-template version:

Notice that the bars for double[n*n] and <double>[n*n] are exactly the same. The C++ compiler evaluates templates at compile-time using token substitution, so there is no reason to expect any difference. Therefore, I didn't bother with a non-template int or float version; the numbers are just there so you can compare them with C# if you want.

Note: for the generic tests I eliminated fractional numbers from the matrix in case the data type is "int". Integer overflows do occur during the multiplication, but hopefully this does not mess up the results.

C# vs C++

Setting aside the issue of generics, and avoiding those unpredictable 2D arrays, let's see how C# compares with C++. This chart shows the original C++ version and its direct C# equivalent, plus the [n][n] version which is more appropriate in C#:

The results are clear. First of all, C++ wins in every case. If you use the x86 JIT, C# is only a little slower than C++. But the x64 and Mono JITs perform quite poorly on this test. Note that the jagged array version (double[n][n]) is the most reasonable way to do large-matrix multiplication in C#, because pointer arithmetic is frowned upon in C# culture. You only loose a little speed by using jagged-array version instead of the pointer arithmetic version, but the memory cost (16B per row and slower GCs) should be kept in mind.

Sudoku

The code of this, the second benchmark by Attractive Chaos, is too long to post here. Basically, it involves a lot of array accesses, if-statements and loops, but very little arithmetic and no floating-point. Thus, this test stresses a compiler/JIT's ability to optimize array access, register allocation, and flow control.

This test uses numerous one-dimensional arrays and two two-dimensional arrays. As before, I switched the C# 2D arrays with Array2D wrappers, which increased x86 C# speed from 6 seconds to 5.2 seconds, but might have hurt the x64 speed (I didn't check). here are the results:

In theory, C++ should have an edge in this test because it uses arrays heavily, and it does not simply scan arrays from beginning to end like the Matrix test does. In cases like this, .NET inserts a range check for all array accesses. Also, C++ supports only fixed-size minor array dimensions, while C# supports only dynamic minor array dimensions, so, for example, .NET can't optimize the code on the knowledge that the array called "C" is always 16 bytes per row.

Anyway, VC++ wins big this time, typically running twice as fast as .NET, if not more (and Mono brings up the rear again, at 8.6 seconds). My experience suggests that array range checks can't account for such a big difference, but investigation is hindered because I still don't know how to look at C#'s optimized assembly. Go ahead and laugh at me. Ha!

Polynomials

The third benchmark is by A.D. Corlan. It's a very simple test which computes the value of a 100-degree polynomial using float math.

The first time I ran this "Polynomials" benchmark in C# and C++ (VS 2008, x86), the result was that C++ finished in 0.034 seconds, and C# in 7.085 seconds, making C++ was 208 times faster! Of course, other than the C++ DEBUG hash_map test that took over 10 hours, this was the most extreme result I had ever seen. The C++ result seemed impossible: the outer loop does 10 million iterations and the two inner loops do 100 iterations each, for a total of 2 billion iterations. There was no way so much work could be done in 0.034 seconds on a 2.83 GHz processor. Indeed, looking at the assembly of the outer loop, you can see that the compiler had done something amazing:

		for(int i=0; i<Iterations; i++)
			pu += dopoly(x);
00CCB740  push        ecx  
00CCB741  fld         dword ptr [__real@3e4ccccd (0CD2D54h)] 
00CCB747  mov         dword ptr [esp+40h],0 
00CCB74F  fstp        dword ptr [esp] 
00CCB752  call        Polynomials::dopoly (0CCB650h) 
00CCB757  fstp        dword ptr [esp+40h] 
00CCB75B  fld         dword ptr [esp+40h] 
00CCB75F  add         esp,4 
00CCB762  mov         eax,989680h 
00CCB767  sub         eax,1 
00CCB76A  fld         st(0) 
00CCB76C  fadd        dword ptr [esp+38h] 
00CCB770  fstp        dword ptr [esp+38h] 
00CCB774  jne         Polynomials::Test+37h (0CCB767h)

Notice that jne branches to 0x0CCB767, which is after the call to Polynomials::dopoly; the function is called only once! The C++ optimizer somehow figured out that dopoly() is functional--that is, it always returns the same result given the same input--so it factored out the function call from the loop!

While I am truly impressed with the optimizer's cleverness, this just won't do for a benchmark. The idea was to figure out how long 10 million calls to dopoly() would take, not just one call. So I made a small change: there is an array declared on the stack inside dopoly(). If I move that array to the calling function instead, and pass it in as a parameter, the optimizer no longer believes it can factor out dopoly(), so it is called 10 million times as desired. This change increases running time by a factor of 289, from 0.034 seconds to 9.813 seconds.

Anyway, here's the graph:

The average C++ SSE2 version completes in about 24% less time than the average C# version (ignoring Mono's sad 13 seconds.) The C++ x87 version, though, is pretty slow. While I was investigating the "clever optimizer" issue above, I think I saw the reason why. I'm no x87 expert, but look at this. The compiler unrolled the second loop of dopoly() 10 times, but the unrolled loop looks like this:

		for (j=0; j<100; j++)
			s = x*s + pol[j];
000DB696  fld         dword ptr [esp] 
000DB699  add         eax,28h 
000DB69C  sub         ecx,1 
000DB69F  fmul        st,st(1) 
000DB6A1  fadd        dword ptr [eax-30h] 
000DB6A4  fstp        dword ptr [esp] 
000DB6A7  fld         dword ptr [esp] 
000DB6AA  fmul        st,st(1) 
000DB6AC  fadd        dword ptr [eax-2Ch] 
000DB6AF  fstp        dword ptr [esp] 
000DB6B2  fld         dword ptr [esp] 
000DB6B5  fmul        st,st(1) 
000DB6B7  fadd        dword ptr [eax-28h] 
000DB6BA  fstp        dword ptr [esp] 
000DB6BD  fld         dword ptr [esp] 
000DB6C0  fmul        st,st(1) 
000DB6C2  fadd        dword ptr [eax-24h] 
000DB6C5  fstp        dword ptr [esp] 
...

It's repeatedly storing and loading the loop variable, s. Oops!

C++ vs C#: Hash tables

This test pits two types of .NET Dictionary against the equivalent Microsoft C++ hash_maps (or unordered_map in VS2010). First I tried an <int, int> hashtable, then a <string, string> hashtable.

The string tests use the same keys as the integer tests, but converted to strings. Each test converts 10 million integers to strings (which is all that test "0" does), so you may want to mentally subtract test "0" from all the others.

The results may surprise you:

With an occasional exception, C# wins by a landslide! One particular C++ scenario, x86 VC10 with SSE2, runs better than the others and when handling strings, outperforms C#. However, in most situations VC++ loses big time. The first test, for example, runs over 9 times faster in .NET4 x64 compared to C++ VC10 SSE2.

Why is the difference so large? Well, I suspect Microsoft's (Dinkumware) hash_map is simply terrible. I don't know how it's implemented; the code is so ugly I can't stand to even look at it. Honestly, who indents code like this?

		// TEMPLATE CLASS _Hash
template<class _Traits>
	class _Hash
		: public _Traits	// traits serves as base class
	{	// hash table -- list with vector of iterators for quick access
public:
	typedef _Hash<_Traits> _Myt;
	typedef typename _Traits::_Val_type _Val_type;

	typedef typename _Traits::key_type key_type;
	typedef typename _Traits::key_compare key_compare;
	typedef typename _Traits::value_compare value_compare;
	enum
		{	// various constants
		_Bucket_size = key_compare::bucket_size,
		_Min_buckets = 8,	// min_buckets = 2 ^^ N, 0 < N
		_Multi = _Traits::_Multi};
	typedef list<typename _Traits::value_type,
		typename _Traits::allocator_type> _Mylist;

At least they added some comments to the VS 2010 version.

Anyway, I can say with certainty that it's not the C++ compiler's fault. I know this because I wrote my own version of hash_map for my company, and my implementation runs roughly ten times faster (with int keys). I knew I got better performance from my hashtable, but I didn't realize how dramatic the difference could be until I saw the numbers. Here are the results based on my implementation:

As you can see, C++ is much more competitive now. It wins some races, and loses others. My hash_map is heavily inspired by the design of .NET's Dictionary, and I am unable to guess why my version outperformed C# while doing int queries and removals, yet not inserts.

When it comes to strings, it is worth noticing first of all that they are dramatically slower than integers, regardless of the programming language you use, and for longtime programmers this will come as no surprise. I believe C++ is at a fundamental disadvantage whenever strings are stored in collections. The reason is that C# uses immutable strings, and no work is required to construct or destruct a "copy" of a string. The C++ STL, however, uses mutable strings, which require extra work to copy a string--depending on how std::string is designed, either bookkeeping is required to track how many std::strings point to the same string (this approach makes modifying a string more expensive), or the entire string must be copied (so changing a string is cheaper, but copying is more expensive). In any case, when adding a string like "12345" to the hashtable, some work must be done to make a copy.

On the other hand, the STL does have a slight performance advantage, in that its template classes are specialized and optimized for every specific data type at compile-time. Thus, the compiler can specifically optimize hash_map<string,string> for strings. In contrast, the .NET Framework specializes generics at run-time if they are based on value types (such as Dictionary<int,int>) but not if they are based on reference types (such as Dictionary<string,string>). Thus, there is a slight overhead when the .NET hashtable calls string.GetHashCode and string.Equals. In general I believe the C# design is a better balance, since it avoids code bloat--some types of Dictionaries can share the same machine language code--while still offering higher performance for simple types like "int" and user-defined structures.

Finally, let's see how the Compact Framework performed against VC++ for ARM:

BenchmarkCppVsDotNet/HashtablesArm.png

Yikes! In the integer case, the hashtable queries and removals are over 3 times faster in C++, although insertion is still mysteriously slower. The string tests are where things get bad, and somehow I am not surprised. The .NETCF program I mentioned at the start, the one that needed 35 seconds to start, was busy parsing strings and adding them to a list. String test 1 runs 3.5x slower in C#; test 2 is 9.1x slower; and test 3 is 7 times slower.

Disclaimer: the above tests use sequential integer keys. Therefore the hashtables enjoy virtually no collisions, which means that these test results are probably better than real-world results.

C# vs C++: Sorted maps

Someone pointed out that map<K,V> is more commonly used in C++ than hash_map<K,V>. That's true, because for the longest time C++ didn't offer a hashtable. Now in theory a hashtable should be faster, because it should support O(1) constant-time operations, in contrast to a map which is sorted and necessarily requires O(log N)-time operations.

Strangely, however, the MS STL's map<K,V> is actually faster than its hashtable (hash_map or unordered_map), although not as fast as my hash_map. In the .NET world, SortedDictionary is slower than Dictionary as you would expect... but does it have to be this much slower?

C++ wins by a landslide! C++'s map<K,V> is quite fast, while C#'s SortedDictionary<K,V> is quite slow. Now, I've partly developed a C# B+tree, which may end up being faster than SortedDictionary; leave a comment if interested (I suspect nothing helps me write code like peer pressure!)

C# vs C++: Generic sum

C++ is still widely used for numerical applications--apps in which most of the time is spent in the application code, not the standard library or waiting on I/O--because its compilers are considered very good at optimization. As a bonus, C++ template programming lets you write a numerical algorithm that supports multiple numeric data types.

The generic sum test in C# is based on an enhanced version of a library written in 2004 by Rüdiger Klaehn, which observes that with some extra developer effort, it should be possible (although not especially easy) to write and use C# generics to write fast math code that supports arbitrary number types.

Earlier in this article I talked about the generic version of the matrix multiplication benchmark and the fact that C# doesn't always optimize generics as well as you'd hope. I actually wrote this sum benchmark for the original version of this article, and it may seem a little redundant in version 2, but there are still some interesting observations we can make. For instance, I'm testing fixed-point structures here, which I rely on heavily in my C++ ARM code.

In addition to int, int64, float and double, I tested with FPI8 and FPL16. These are fixed-point data types that I created myself. Many, if not most, mobile devices have poor floating-point support, so I often use fixed-point instead of floating-point on such devices. A fixed-point type in C++ or C# is simply a struct with an integer inside it (FPI8 contains an int, FPL16 contains a 64-bit int). Some of the low bits are treated as fractional. The "8" in FPI8 means there are 8 fraction bits, so FPI8 can store numbers between -16777216 and 16777215.996, with a resolution of about 0.004. The fact that a "struct" is involved is an important part of the benchmark: historically, compilers could not handle a structure that contains a single value as well as that value by itself (outside the structure), but these days most compilers do a good job.

I tested the following generic C# sum method:

public static T GenericSum<T, Math>(List<T> list, Math M) where Math : IMath<T>
{
	T sum = M.Zero;
	for (int i = 0; i < list.Count; i++)
		sum = M.Add(sum, list[i]);
	return sum;
}

Of course, in C++ the natural implementation is simpler:

template<typename T>
T GenericSum(const vector<T>& list)
{
	T sum = (T)0;
	for(int i = 0; i < (int)list.size(); i++)
		sum += list[i];
	return sum; 
}

However, many C++ template libraries (Boost::GIL comes to mind) rely heavily on the compiler to inline tiny functions and eliminate empty structures, so at first I directly ported the convoluted C# code to C++ with the confidence that the compiler would make it fast. However, in an effort to avoid future hooliganism from certain 1-voters, I changed the code to what you see above (although I had to eliminate a virtual-function test as a side-effect). There are still two function calls there that the 1-voters probably didn't think twice about. You could eliminate them by using raw arrays instead of the STL, but why bother? The compiler does optimize it. Let it do its job.

Because adding a number is such simple work, I did 10x as many iterations for this test as for most others. For reference, I also included a non-generic integer-summing method in both languages. Here are the desktop results:

Note: I'll give the ARM results down below.

Note: I just noticed that I forgot to do this test with FPI16. Sorry, I don't have time to re-run all 11 benchmarks.

Funny story. While I was preparing version 2 of this article, I noticed something weird about all the C++ Generic Sum test results: they were all 0.000! WTF? In the debugger, it became clear that the optimizer had completely eliminated the loop that called GenericSum()! This was a reasonable thing for the optimizer to do, because the outer loop ignored the return value. But the puzzle was, why did the optimizer's behavior change? One of my tests, the non-generic sum test (which operates on vector<int> instead of <T>), had not been changed. I downloaded the old code and checked to be sure. No, the non-generic sum code was exactly the same as before, but now it ran in zero time. What had changed? I was using the same compiler (VC9). I checked all the compiler switches--identical! At that point I literally gave up and went home.

The next morning it occurred to me. Checked iterators! I had disabled them, but I had not noticed for several days that my GenericSum results were zero. Why not? Well, the benchmarks were run in random order, but I failed to randomize the random number seed. As luck would have it, the first GenericSum test came quite late among the benchmarks and, apparently, I just never ran the program long enough to see the results!

Anyway, somehow, disabling checked iterators enabled the optimizer to delete my benchmark completely. To fix this, I changed the outer loop to compute the sum of all sums and return it, so that the optimizer couldn't be so aggressive. Then I had a new "problem": the benchmark ran incredibly fast, less than 0.05 seconds for 100 million additions. What was that about? I checked the assembly. This time the results turned out to be legitimate. Disabling checked iterators more than doubled the speed of this benchmark.

By disabling checked iterators, you eliminate the range check on vector accesses. Apparently, this lets the optimizer be more agressive and the test runs really, really fast. Thus, provided checked iterators are off, the C++ version totally blows away C#. However, if you look at the "C++ x86 VC9 default" bar, you'll see what happens if you don't turn them off. In that case, some of the C# tests aren't that much worse than C++.

Something to notice about the C# results is that they are highly variable. The Microsoft x86 JITs don't do too badly, provided that you are adding up integers or FPI8 structures; but as soon as you try double, all 5 of the JITs suck. The x64 .NET3 JIT is especially bad for all data types, and both x64 JITs do worse than their x86 counterparts. And, for some reason, Mono bombs every test including the non-generic one. Hmm, I suspect Mono has some trouble optimizing code that involves structures, since all generic math relies on a special "Math" structure, and it handles the FPI8 structure much worse than plain "int" (the Mono FPI8 test, which overflows the chart range, takes 1.57 seconds.).

Although C++ wins this test by a wide margin, it should be noted that most of the C++ tests are executed with no range check on the array accesses, while the C# tests are run with a range check. I did a quick test to see what happens if NonGenericSum is written to use int[] instead of List<int>:

In this case .NET optimizes out the range check, and C# runs more than twice as fast (more than four times as fast in x64!) and not much slower than the C++ version. However, it should be noted that in real life you can't always use an array instead of List<T>, nor can you always loop through the entire array, and when you don't, .NET won't optimize out the range check. The only other way to eliminate the range check is to rewrite the code to access the array through a pointer.

In C++ it's much more convenient, since you can use the STL and just #define _SECURE_SCL 0; however, this also carries a disadvantage that range checks are disabled throughout the entire program (note that it's not legal to change _SECURE_SCL per-cpp-file, which would violate the One Definition Rule). Hey Microsoft, if you're listening, you could solve this problem in C# with a special intrinsic to perform the range-check explicitly before the loop:

Array.RangeCheck(array1, a, b); // JIT would see this as an optimization directive
Array.RangeCheck(array2, a, b);
for (int i = a; i < b; i++) 
    array1[i] = array2[i]; // No range checks
for (int i = b-1; i >= a; i--)
    ... // No range checks in a reverse loop either

Combined with a smart optimizer and a List<T>.RangeCheck(), List<T> could achieve high speeds too. And that's for verifiable code; within "unsafe" methods, a special attribute could disable all range checks. Of course, MS is probably not listening to me. But if you are, I demand to know why MS .NET still lacks SIMD structures! (VC++ has them, although they are foolishly processor-specific.) But I digress...

C# vs C++: Simple Arithmetic

The "Simple arithmetic" test, admittedly, was a little too simple; two compilers of generally similar quality may optimize small pieces of code differently, so you shouldn't trust these results too much. This is what I did for each numeric type:

double total = 0;
for (int i = 0; i < Iterations; i++)
	total = total - total / 4.0 + (double)i % 100000 * 3.0;

Part of this test is to see whether the compilers can optimize multiplication and modulus by a constant. Here are the desktop results:

There's no clear winner for this test. Generally, C# did a better job in the "double" test (even when C++ is allowed to use SSE2), but a worse job in the FPL8 and FPL16 tests. C# is tied with C++ in the float test, until you enable SSE which causes C++ to win. In the int64 and FPL16 tests, the instruction set (x64 or x86) has a larger effect than the programming language. And then there's Mono. When it comes to handling int64 or the FPI8/FPI16 structures, it does pretty badly, but in the double test it does better than everybody else. What gives, Mono?

Square roots

I bet you've never tried calculating a square root without any floating-point math. But on ARM it's the preferred way. Here's the algorithm:

public static uint Sqrt(uint value)
{
	if (value == 0)
		return 0;

	uint g = 0;
	int bshft = Log2Floor(value) >> 1;
	uint b = 1u << bshft;
	do {
		uint temp = (g + g + b) << bshft;
		if (value >= temp)
		{
			g += b;
			value -= temp;
		}
		b >>= 1;
	} while (bshft-- > 0);
	return g;
}

It's a pretty typical mix of arithmetic and flow control. For reference, I've included the standard double square root (for which x87 and SSE2 offer specialized instructions). Here are the desktop results:

C++ wins this test, but not by a wide margin (unless you count Mono). In the 64-bit tests, the instruction set (x86 or x64) matters more than the programming language.

Notice that "Square root: FPL16" is slower than uint64 (ulong), in both C++ and C#. The reason for this is not that ulong is wrapped in a FPL16 structure. Rather, the reason is that FPL16 has 16 fraction bits and uint64 has none, so the square root algorithm (which operates on a raw uint64) requires more iterations to compute the square root.

ARM arithmetic test

Like I said, I didn't update the ARM graphs for V2 of this article. It didn't feel like there was much point, since the results are so conclusive:

BenchmarkCppVsDotNet/NumericCodeArm.png

Unfortunately for me, the Compact Framework performs dramatically worse in all three arithmetic tests.

For this graph I added ratio-bars to the right side, so you can see just how badly the CF is doing (not easy; I had to work around a bug in Excel.)

In fact, C++ beats the Compact Framework by a huge margin on most of these tests. The "best" results (where .NETCF is only slightly slower) involve the double floating point type, but this is only because most of the time is spent in the system floating-point emulation library, regardless of your programming language. This is a huge disappointment; it suggests that the Compact Framework is incapable of doing math or algorithms quickly, and the results are even worse when using FPI8 or FPL16. Plus, it might not optimize generics very well; notice that the C# "int (non-generic)" test is faster than the "int" (generic) test.

File reading, parsing and sorting

In this test I try out two different ways of reading a 400KB text file with about 17000 lines: either by reading the entire file at once into a buffer, or by reading it line-by-line. The file has been loaded before so that it is in the disk cache. This benchmark does not measure disk performance, nor do I want it to.

In C++ I originally used a std::ifstream, but it turned out that ifstream is slow, so I've added good old FILE* as an alternative in this second edition of the benchmark. In C#, I used System.IO.File wrapped in a StreamReader.

The text file contains "dictionary entries": key-value pairs separated by ":=". The value can span multiple lines; if a line does not contain := then it is assumed to be a continuation of the value on the previous line. The dictionary entries are collected into a C# List or C++ vector, then randomized a bit (to ensure sorting requires some work) and sorted by key in a case-insensitive manner. First, let's see how fast the two languages can read text files:

The "x20" indicates that the file is scanned 20 times, just to eat up time on the clock.

Clearly, C++ ifstream does a terrible job at reading lines from the file one-by-one. So instead let's just sweep that under the rug, and pay attention only to the FILE* result. By the way, VC10 and x64 do this task better than VC9 x86. SSE2 optimizations don't get the credit, since x86 VC9 with SSE2 enjoys no improvement.

Evidently, C++ is faster than C# if you read the whole file as one block. This makes sense, because .NET actually has to do more work in these tests: it is decoding UTF-8 to UTF-16. Standard C++, in contrast, doesn't support conversion between text encodings, and these tests use "char" strings (not wchar_t), so the only extra work C++ has to do is to convert line endings (\r\n to \n), work which C# does too.

It is therefore peculiar that C# wins the second test, reading line-by-line (ReadLine() in C# vs getline() or fgets() in C++). It occurs to me I could have optimized the FILE* version better; for example, fgets() does not return the string length. So, to eliminate the '\n' at the end of the string (necessary to make the code's behavior equivalent to the other two versions), I call strlen() to get the string length, then change the last character to '\0'. It might have been faster to convert to std::string first (which determines the string length) and then remove the last character.

Anyway, let's see the results for parsing and sorting:

Like any clever C++ coder, I optimized tests 2 and 3 by adding empty strings to the end of the vector, and then creating the strings in-place so that it is not necessary to copy them into the vector. Even so, C# is quite competitive in those tests. The winner of the "Parse" test (which involves calling IndexOf, Substring and the C++ equivalents in std::string) depends on the platform (x86 or x64). The one place where C++ definitely wins is during the case-insensitive string sort (std::sort vs List.Sort). I wonder whether the C#'s ability to sort Unicode (É < õ), which _stricmp presumably cannot do, has something to do with this result.

If anyone's keeping score, Mono takes 5.666 seconds to sort the strings.

The Compact Framework may perform better than ifstream:

BenchmarkCppVsDotNet/SimpleParsingAndSortingArm.png

But again, compared to C++, the parse test takes 8 times longer in the Compact Framework, and the sorting test takes a whopping 15 times longer. It would have been nice to look at the disassembly to see why it's so slow, but the debugger reports that disassembly is not supported in the Compact Framework.

P/Invoke

After seeing the dismal results of the Compact Framework tests, the question came up: what if some of the code were written in C++ and called from C#? This could compensate for some of the slowness of the Compact Framework. If you use P/Invoke a lot, though, it’s important to consider how long it takes to cross the boundary between C# and C++.

I tested P/Invoke performance by calling a very simple DLL that I made myself. The functions in that DLL do something trivially simple, such as adding two numbers or returning the first character of a string. Below you can see both .NET desktop and Compact frameworks.

As my tests show, crossing that boundary isn’t what I’d call cheap, even if the arguments and return values are simple integers.

BenchmarkCppVsDotNet/PInvoke.png

On my workstation, you can call Add(int,int) about 25 million times per second in the x86 and x64 versions of .NET (112 cycles per call at 2.83 GHz). Mono is more than twice as fast (45 cycles), and the Compact Framework manages 2.74 million calls per second (219 cycles per call at 600 MHz). Now, these results are not terrible. But non-P/Invoke calls are much faster, as you’ll see below.

Passing strings or structures makes P/Invoke calls much slower in some cases (but not all cases). If the C++ method returns a string then .NET must convert it to a System.String, which of course takes extra time. If the C++ string has type char*, then additionally a type conversion occurs; but since .NET does not support UTF-8 conversion, this conversion is probably a simple truncation from 2 bytes to 1 byte (or expanding 1 bytes to 2 bytes on return). However, you can pass strings to C++ very fast using wchar_t*, since the string is not copied. (Note: if the C++ function modifies the string, be sure to use StringBuilder instead of string, or the immutable string will be modified, which is illegal and dangerous.)

I tested how well P/Invoke handles structures by marshaling between .NET’s System.Drawing.Point structure and Win32’s POINT structure. MakePoint() takes two integers and returns a Point, while GetPointY is passed a Point and returns its Y coordinate. The interesting thing about these two methods is that the x86 marshaller handles them very slowly, but all other CLR versions are fast. The MakePointIndirect() function returns a pointer to a Point (a static variable inside the function), which the C# caller must unpack using Marshal.PtrToStructure. You can see that this is very slow, too. If you want to make sure that a structure is passed or returned to C++ code quickly, be sure to pass it by reference, which the GetPointYIndirect() function does, and do not return a structure by value.

The .NET Compact Framework has some marshaling limitations:

Passing or returning floating point values is not allowed (if I’m not mistaken, this limitation makes no sense, because floating point values are passed no differently than integer values on ARM.)
Passing or returning ANSI (1-byte-per-char) strings is not allowed.
Passing or returning structures is not allowed.

If you have to pass floating point values or structures, you must pass them by reference, not by value. That’s why I made an extra test for adding a pair of double*. Since there is an extra cost for converting a returned double* to a double, you’re better off using an "out double" parameter instead of an IntPtr (i.e. double*) return value; the same guideline applies to structures.

So, how much faster are normal, non-P/Invoke calls? Look how fast do-nothing NoOp() functions can be called:

BenchmarkCppVsDotNet/TrivialMethodCalls.png

You can see here that 10 million non-inlined calls take around 24 milliseconds on the desktop, as compared to about 400 milliseconds for NextInt(), a trivial C function that increments and returns a counter, or 160 milliseconds if NextInt() is called from Mono. Thus, simple P/Invoke calls are about 16 times slower than non-inlined method calls in .NET (and 18-23 times slower than virtual method calls), about 6 times slower in Mono, and 9 times slower in the Compact Framework (380 ms vs 42 ms).

This means that you can never "optimize" .NET code by using P/Invoke to call a function that does very little work. For example, if the C++ version of your code is twice as fast as the C# version, and P/Invoke uses about 110 clock cycles minimum (210 in CF), each C++ function call will have to do (on average) 110 (or 210) clock cycles of work just to “break even” with your C# speed.

And what about inlining? If a method does very little work, these results show that it will be much faster still if it is inlined. The “static no-op” test allows inlining of a static function that does nothing at all, which means that the test effectively just measures an empty for-loop. Clearly, one iteration of the for-loop is much faster than a non-inlined method call (except in the Compact Framework).

Actually, the inlined x64 version seems almost impossibly fast, at 0.001 seconds for 10 million calls. Just to be sure, I temporarily did 100 times as many loop iterations, and found that .NET x64 does one billion iterations in 89.5 milliseconds, or 11.2 billion iterations per second, which is 4 iterations per clock cycle. Maybe the x64 JIT does some fancy loop unrolling, but the debugger wouldn’t let me see the assembly for some reason.

In this "trivial method calls" test I actually started by testing an Add(i, i) method (where i is the loop counter), but it turns out that the difference between Add() and a no-op is very small (2-3 ms on the desktop). In other words, the call itself is much slower than passing "int" arguments or adding the arguments together. So I got rid of the arguments, so that you can see the basic call overhead by itself, with nothing but a for-loop getting in the way.

The "no-inline" test uses the [MethodImplAttribute(MethodImplOptions.NoInlining)] attribute to ensure that a static no-op method is not inlined. Strangely, the method that uses this attribute (which is non-virtual) is consistently slower than a "virtual" no-op method in all desktop CLRs.

Only the Compact Framework agrees with the widely-believed notion that virtual functions are slower, and wow, virtual functions are really very slow on that platform! The results seem to suggest that the Compact Framework uses the same mechanism for virtual dispatch and interface dispatch, which is sad because its interface dispatch is clearly quite slow (9 million calls per second, or 66.6 cycles per call, compared with 25.2 cycles for a non-virtual no-op).

Relative performance between CLRs

I made one more graph that you might find interesting. It’s the relative performance of different implementations of the CLR. All the benchmarks are scaled so that .NET x86 uses 1 unit of time (I must scale the results somehow in order to present all the tests on the same graph.)

I am happy to report that all benchmarks ran without modification under Mono. The Mono results are only shown for x86 because the download Mono page does not offer an x64 version for Windows (or an ARM version for WinCE.)

BenchmarkCppVsDotNet/RelativePerformanceBetweenCLRs.png

Of course, the Compact Framework is not directly comparable as it runs on a different processor, but I included it for completeness, with 10% of the workload (I estimate from a lot of benchmarking experience that the processor itself is, at best, 1/7 as fast as my workstation for single-threaded code, or slower when running complex or memory-intensive code.)

As I mentioned before, some of the P/Invoke tests are very slow under the x86 CLR, so the time measurements for the other platforms are tiny in comparison. x64 outperforms x86 in other areas too, such as "long" arithmetic and string handling, but for some reason the "Generic sum" tests are slower under x64 and Mono, and floating-point handling seems a little slower too.

Mono does okay in general, and does P/Invoke especially well, but it falls behind Microsoft’s x86 CLR in most other tests. The case-insensitive sort, FPI8 arithmetic, long arithmetic and Generic Sum tests run fairly slowly under Mono, but the hashtable (Dictionary) and Square root tests gave decent results.

The Compact Framework, of course, gave a lot of poor results. Even ignoring the floating-point tests (which use software emulation), three tests went right off the right side of the graph. For example, the Sort test took 13.0 times longer than x86, but since the mobile benchmark performs 10% as many sorts, this really means that a C# string Sort would takes 130 times longer on the mobile device than the same C# on my workstation. In fact, I suspect it was a string sorting operation that was the main cause for the 35-second startup time I mentioned at the beginning of this article. Only the P/Invoke tests, and the conversion of "Ints to strings", were relatively quick in the Compact Framework.

Conclusion

In this second edition of the benchmark we can make some new conclusions. The original article tested VS2008 with default optimizations, and the results seemed clear: if you're developing software for the desktop, and (like me) you are equally skilled in C# and C++, it's safe to take advantage of the simpler syntax, rich standard libraries, stellar IntelliSense and reduced development effort that C# offers. Those original results showed only a small performance penalty for C# versus C++ with STL. In the second edition this conclusion needs some modifications. First, let's talk C++. How important are the compile switches and special #defines in VC++?

Once in a while a loop will give you dramatically better performance if you enable SSE/SSE2 (e.g. the Polynomials test). However, usually it doesn't make much difference. I didn't test the "fast" floating-point model separately from the default "precise" model, but it looks like there isn't much difference to worry about.
/Ox performs pretty much the same as the default optimization level, /O2.
_HAS_ITERATOR_DEBUGGING=1 can cause horrendous performance in debug builds (it is disabled by default in Release builds).
_SECURE_SCL=0 can boost performance dramatically in certain cases, particularly in tight loops that iterate over elements of a vector. It makes hash_map/unordered_map faster, but still slower than regular map<K,V>, because Microsoft has given us a really lousy hashtable. Of course, this removes protection against out-of-bounds indexes and iterators, but it puts hair on your chest, right?

Now what about C#? There aren't any special #defines or switches to make C# code run faster, but of course, the way you write your code can make it run faster or slower.

In general, 2-dimensional arrays should be avoided (they are fairly fast on x64 but not x86). It's generally better to use jagged arrays or (if your code isn't partial-trust, of course) pointer arithmetic when you need the best performance. Note that pointers should be used sparingly, not only because it's against C# culture but also because the garbage collector can't move an array while any C-style pointers are pointing to it (remember that you use a "fixed" statement to get the pointer in the first place.) Edit: new versions of the .NET Framework handle two-dimensional arrays faster.
You can get better performance from arrays than List<T>, because (1) if you access an array from 0 to Length-1, range checks are eliminated, and (2) if you need more speed you can use pointers. Such optimizations are the C# equivalent of _SECURE_SCL=0. It's possible to get the best of List<T> and T[] at the same time; I wrote a struct called InternalList<T> for that purpose (not included in the benchmark).
The .NET optimizers aren't as smart as the C++ ones. You may get better performance by doing common-subexpression factoring and loop-unrolling manually. On the plus side, all C# compilers are required to do constant folding.
.NET cannot reliably optimize generics used for math. The x86 JITs do a better job than the x64 ones, but in general, if performance matters you'll have to write your code against a specific numeric data type. It's still possible to hack around this with LeMP or T4 templates.
Memory management concerns (not benchmarked here) are different in C# than C++. In particular, C# memory allocations are extremely fast, but the garbage collector is a wildcard: it may be fast or slow depending on your memory allocation patterns. And Microsoft has not provided any good way to measure GC performance. There are various guidelines to optimize GC: objects should be very short-lived or long-lived (medium-lifetime objects are the slowest), avoid thick pointer trees (e.g. linked lists and SortedDictionary), and so on; but it's really outside the scope of this article.

The most curious new result is the Sudoku test, whose C++ version runs fully twice as fast as C#. Dunno why.

Again, certain "bad" parts of the Microsoft (Dinkumware) STL, like hash_map/unordered_map, are dramatically slower than the equivalent C# and should be avoided. In the C# world, it could be argued that matters are worse, because while you can work around a poor library by writing (or finding) another library, it's harder to work around a compiler that doesn't optimize very well. Now, .NET JITs are not bad, but they can't quite catch up to C++, and certain JITs don't handle certain tasks well (e.g. the x64 JIT was much slower on some tests, but faster than x86 at handling 2D arrays of double). You can sometimes get better performance by fiddling around with pointers, but not always, and who wants to do that anyway?

Meanwhile, the Compact Framework is always slow, with no recourse other than native code. Why so slow? This article by Ryan Chapman sheds some light on the subject. Particularly noteworthy is that the .NET Compact Framework apparently inserts three instructions that are executed on every loop iteration. I'm just speculating, but manual loop unrolling might therefore improve performance significantly. It's not that the CF isn't optimized at all, but you can see that it misses some important optimizations like keeping the loop constant in a register, or combining "add" and "shift" into a single instruction.

Of course, C# coders tend to program in different ways than C++ coders: they use LINQ-to-objects (which can be very slow if used carelessly), they may not tune their algorithms, they may use high-overhead libraries like WPF, prefer XML files to plain text files, use reflection heavily, and so on. These differences can lead to slower programs, but I believe a performance-conscious C# programmer can write programs whose performance is comparable to similarly well-written C++ code. .NET is clearly not as fast as C++ yet, but in principle I see no reason why it couldn't be, given NGen and a lot of TLC from Microsoft.

If you use Mono there will be an extra speed penalty, but depending on what you're doing, it's not too bad. A lot of these tests were hard on Mono, because they tested things it does poorly, such as structure handling and generic Math. Possibly the most realistic tests here are the Square root method, the Sudoku solver and the matrix multiplier. In these tests, Mono was roughly half as fast as C++, and maybe 50% slower than MS .NET, except for Sudoku where Mono was about 27% as fast as C++. On the other hand, it did pretty well in the Dictionary<int,int> test (almost as well as MS .NET).

If you happen to be developing for Windows CE, you'll generally suffer a big performance penalty by using .NET. The Compact Framework can do some tasks well enough (such as reading from a file), but string processing, ordinary loops and arithmetic, structure manipulation and other basic tasks seem to be quite unoptimized. C# would still be acceptable if you won't be doing any complex algorithms or heavy string manipulation. Probably a hybrid solution, where Compact Framework code calls performance-critical C++ over P/Invoke, should be used to overcome its abysmal performance. Mono built for ARM might also offer an improvement, but there is no version currently available for Windows CE.

History

June 17, 2011: Initial version
July 4, 2011: Second version. Added four new benchmarks (Matrix multiply, Sudoku, Polynomials, sorted map). Added Five new platforms. Added over a dozen new graphs. Added/edited the majority of the commentary. Lived in a cave for six more days, one with a comfortable chair and Google.