Introduction
I love the .NET Framework, C# and IntelliSense, but a lot of people are still
saying it's slower than C++, pointing out that video games and "major applications"
are still written in C++. Naysayers often point out that apps like Office, Internet
Explorer, and Windows Media Player are still written in C++, apparently implying
that C++ is still the best choice for performance. But wait, isn't this mainly because
those applications are older than the .NET Framework itself, or because the .NET
Framework has lousy support for DirectShow, or less-than-ideal support for Direct3D?
Well, I've been using C# for years and rarely had any complaint about its performance--that
is, until I wrote a program for a Pocket PC and was shocked to discover that it
required 35 seconds to start up.
With this article, I am attempting to make a fair comparison between the performance
of C# and C++ (unmanaged), in both desktop and mobile (Windows CE) scenarios. In
the first version of this article, I wrote most of the code in C# first, and manually
ported it, line-by-line, to C++, replacing standard .NET classes with STL equivalents
(some of the code was ported in the other direction). Because this is labor-intensive,
however, the variety and scope of those benchmarks were limited. For this updated
version, I added three extra cross-language benchmarks written by two different
people, plus a variation on the hashtable benchmark using sorted maps.
My main goal is to measure the performance difference in the compiler/JIT and
parts of the standard library that are used most often. I'm largely limiting my
tests to the C++ standard library anyway, so there's no way we can compare the speed
of things like XML parsing or JPG loading, which the .NET Framework BCL can do but
the C++ standard libraries cannot. No, my benchmarks will be limited to the following
areas:
- String handling:
std::string vs System.String
- Hashtables:
hash_map<K,V> vs Dictionary<K,V>
- Binary trees:
map<K,V> vs SortedDictionary<K,V>
- Simple structures: in my work I often end up creating small performance-critical
structures, such as a fixed-point type with a single integer field.
- Mathematical generics: You have to go to
quite some effort
to write math code using .NET Generics. Do you also suffer a performance hit?
- Simple arithmetic: add and subtract; multiply, divide and modulo by a constant,
for different data types. I also try an integer square root algorithm (and its
floating-point equivalent, for completeness).
- 64-bit integers: some compilers deal with these pretty poorly.
- Text file scanning: how fast can we read text files line by line?
- Sorting
- P/Invoke and no-op methods (C# only)
Updates to this article
The original article compiled the code in Visual Studio 2008; beside those results
I have added results for Visual Studio 2010 (including .NET 4) and Mono, so you
can see the difference. Microsoft dropped support for Windows CE in VS 2010, which
is the reason I used VS 2008 in the first place. Of course, I welcome readers to
try compiling the code with other compilers or platforms (such as GCC or Windows
Phone 7). Be aware that any differences you observe may be caused by using a different
processor as well as a different compiler.
My test workstation is a Quad-core 64-bit Intel Q9500 @ 2.83 GHz (but all tests
are single-threaded) on Windows 7. The mobile device is a
Mentor Ranger 4
with a 600MHz ARM processor, and .NET Compact Framework 3.5.
Numerous individuals gave the original version of this article a "1" vote. Most
of them were either unwilling or unable to state specifically why I deserved the
lowest possible score, but let me address some of the specific criticisms. My sincere
thanks goes out to those people who found imperfections but didn't punish me for
it!
- In C++, you didn't disable checked iterators: An oversight.
I didn't realize that checked iterators were enabled in Release builds. It turns
out that VC9 does two different things to slow down your C++ code. One is called
"iterator debugging" (enabled in debug builds only), which can sometimes make
code ridiculously slow; the other one, called "iterator checking" or "checked
iterators" has a smaller effect, but is enabled in Release builds by default.
(One 1-voter reports they are disabled by default in VS2010.) I knew about the
first, and conflated it with the second. Sorry. This time they will be disabled
in every scenario except one.
- You ported the code line-by-line between languages: So?
Don't y'all have anything specific to say? Did any of you look at the code?
I chose a subset of C++ and C# in such a way that it was perfectly reasonable
to do the same task in the same manner in both languages. I received only one
specific criticism, about the way GenericSum was written in C++. I
tried to explain that C++ compilers are equipped with something called an "optimizer"
that can perform a trick called "inlining", which (as my graph showed) causes
the benchmark results to be just as good as they would be without the unnecessary
function call. But, judging by the votes, some people were not convinced. So
this time I have changed the C++ code to eliminate the extra function call,
but I was forced to remove one of the C++ tests (which shows what happens to
performance when the function is virtual). Of course, there are still two other
function calls in there, but I think it's pretty standard C++ code at this point,
and the optimizer still exists.
- In C++, you didn't enable SSE2 or the "fast" floating point model:
True. I must admit, my main goal was to benchmark ARM performance, which offers
no magic compiler switches to make code run faster. This time I will do separate
VC9 tests, once in x86 with default compiler switches and again with architecture=SSE2,
floating point model="fast", and no checked iterators, so y'all can see the
difference, and a third time with x64 and all bells and whistles. I still think
it's reasonable to benchmark without SSE2, since some older computers cannot
run an executable that relies on SSE2 (by the way, I have not been able to determine
when non-SSE2 processors stopped being manufactured or what percent marketshare
SSE2 has.) However, as far as I know all x64 systems have SSE2 (correct me if
I'm wrong!). So when you make x64 builds, you may as well enable SSE2. Likewise
if you're limiting yourself to current-gen systems (or if you distribute separate
builds of your program for older PCs), it is a good idea to enable SSE2.
- You didn't use VS2010: you gave me a 1 just for that? Eww.
Fine, I'll benchmark VS2010 too. And I'll enable SSE2 and the fast floating
point model for all VC10 tests. But you are uninvited from my birthday party.
- What about profile guided optimization (PGO): Well, I'm
now up to 6 different sets of C++ benchmark results, and I'm not sure how realistic
PGO results would be. After all, the input data is the same in every trial,
and some of the benchmarks are artificial. So, PGO may not provide the same
performance increase on these benchmarks as on real-world code. I have a more
realistic (but closed source) C++ benchmark I could run with and without PGO;
leave a comment if interested, or simply run PGO on the existing benchmarks
yourself.
- Compare languages and not implementation of their libraries:
As if everyone writes their own hashtables, lists, file management code and
string classes from scratch! And don't forget, memory management is part of
the standard libraries; are memory allocations are off-limits too? Look,
some of my benchmarks did not use any standard libraries, and
most make few or no allocations. Ignore the ones that use standard libraries,
if they make you uncomfortable for some reason.
- Use more complex real life algorithms: You want me to write
a large and complex real-life algorithm twice in two different languages? Sure!
Just name the algorithm and tell me how much you'll pay for my time.... Tell
you what, I found some 3rd-party multi-language benchmarks. They are not ultra-complex,
but is it okay if I use those?
Some non-1-voters suggested that ifstream should be avoided in C++
due to its poor performance, and a poor design that prevents high performance. Given
the results, they were clearly correct! For this benchmark I added FILE*
tests (not direct Win32 API calls, since I prefer portable code.)
There were some other 1-vote criticisms which I couldn't possibly use to improve
the article, e.g.:
- ...Technical fellows at Microsoft (highest technical position in
the company!) say that .NET is not conducive to system programming:
So don't write your next OS in C#. What does this have to do with me?
- Too subjective: why, because I admit to liking C#? So what?
Can you name anything specifically wrong with my code?
- Serious problems in the methodology: specifically?!?
- Your article contains a large number of false and misleading statements:
I asked you to name one, but you didn't.
- A point for effort: Riiiight.
- You really have a twisted mind: I tried not to let it affect
the results. I would vote for the Ron Paul/Dennis Kucinich ticket, though.
- It is pretty well known that native code is faster: Oh,
of course, it was wrong of me to measure the difference! Well, I'm sorry if
my CPU betrayed your trust. But my results showed that C++ was slightly faster
in general (if you choose good-quality C++ libraries), and dramatically
faster on ARM. So we agree with each other, and you gave me a 1 anyway! Plus,
C++ wins the new Sudoku test by a wide margin, and I found some new causes of
sluggishness in the x64 JIT. Does this mean the C++ lovers will now give me
a 5, and C# lovers will suddenly switch their vote to 1?
- Marketing stuff from Microsoft?: No.
- Please consider the numerous comments and update your benchmark:
Please consider the numerous updates and update your vote.
- and my favorite, learning C++ might be a good starting point before
trying to use it in a benchmark: I've been programming in C++ since
I bought my first PC in 1994. Among other things, I wrote a SNES emulator and
a GPS system for ARM with custom-optimized drawing primitives, from scratch
in C++. Next I suppose you'll tell me I'm not a "real" C++ programmer unless
I write my code in assembly...
Here's hoping for a more reasonable discussion this time around.
Before I start, I should mention that there's no easy way to match up garbage
collections with benchmarks, so I do a GC before every C# benchmark, and I do not
attempt to include that GC time in the totals. I would have liked to print out total
GC time at the end, but apparently it is not possible. There is no property or performance
counter for total GC time; there is an "instantaneous" measurement for "% GC time"
but no obvious way to aggregate it over time. I could disable inter-test GCs, but
then some GC time would be misattributed to incorrect places. I could add forced-GC
time to the measured time of each test, but that would artificially inflate the
C# time numbers (since forced gen-2 collections waste time compared to "natural"
GCs). However, GC time shouldn't matter much here, because most of these benchmarks
do not allocate a large number of heap objects (except the tests involving strings
and SortedDictionary).
Test scenarios
It has become clear that when it comes to desktop software, the target CPU, target
JIT (in C#) and compiler switches (in C++) can sometimes affect performance in a
big way. So for this new version of the article, I'll be testing several different
scenarios (11 in total) so you can see the difference that compiler switches make.
I could test more scenarios, but the graph is getting crowded.
- "x86 VC9 default": optimized Release build with default compiler switches,
x86, VC++ 2008
- "x86 VC9 NCI x86": no checked iterators, Release build, default compiler
switches, x86, VC++ 2008
- "x86 VC9 SSE2 x86": Release build, no checked iterators, SSE2, fast FP model,
full optimization, x86
- "x64 VC9 SSE2 x64": Release build, no checked iterators, SSE2, fast FP model,
full optimization, x64
- "x86 VC10 SSE2 x64": Release build, no checked iterators, SSE2, fast FP
model, full optimization, x64, VC++ 2010
- "x64 VC10 SSE2 x64": Release build, no checked iterators, SSE2, fast FP
model, full optimization, x64, VC++ 2010
- "x86 Mono": Built in VS2008, run with Mono with default compiler switches
- "x86 .NET3": VS2008, Microsoft .NET Framework 3.5, x86 (JIT is the same
as .NET 2.0)
- "x64 .NET3": VS2008, Microsoft .NET Framework 3.5, x64 (JIT is the same
as .NET 2.0)
- "x86 .NET4": VS2010, Microsoft .NET Framework 4.0, x64
- "x64 .NET4": VS2010, Microsoft .NET Framework 4.0, x64
I also benchmarked with the "VC9 NCI x86" settings plus /Ox (Full Optimization)
to see if it was significantly different from the default /O2 (Maximize speed).
The difference was tiny in every benchmark, and not worth including as a separate
bar on the bar charts.
A couple of people wanted to know whether Mono would be faster if the LLVM JIT
(--llvm command-line switch) would be faster. I found that LLVM made no significant
difference in the current version for Windows (Mono 2.10.2). Some tests ran slightly
faster; the majority ran slightly slower. And there was only a small difference
between the first run (when JIT occurs) and subsequent runs, indicating that JIT
time is not the reason for the lack of speed increase. The difference with and without
--llvm was small enough that I felt it was not worthwhile to include it in the bar
charts.
I'll get to the "real" benchmarking in a minute. But first...
Debug versus Release builds
Some C++ developers end up distributing Debug builds so that "assert" statements
continue to work, or so that they can debug the production machine if they have
to. Unfortunately, this tends to kill performance, as the graph below shows
(Note: I did not update these graphs for the new article. I
don't think new graphs would teach us anything new about Debug builds.)

In this graph, the scale has been adjusted so that the x86 Release build takes
one unit of time. Some operations, most notably function calls and anything involving
STL, are dramatically slower in desktop Debug builds. I suspect "Generic sum" is
so much slower mainly because the STL vector it scans is slower; and you may find
that hash_map is virtually unusable in a debug build, as you'll see.
When I first ran the x86 Debug benchmark, I noticed that it seemed to be "hung".
I stopped it in the debugger and found it was running the "string hash_map: 1 Adding
items" test. This test fills the hashtable with 2.5 million items, clears the list,
and fills it again three more times. Well, when inserting the first item after a
clear() command, hash_map gets stuck doing who-knows-what
for around 4 minutes, for a total runtime of 16 minutes and 20 seconds (78 times
longer than the Release build). I left it running overnight, and discovered that
the removal test is even worse. This test tries to remove 10 million items, but
the list size is limited to 2.5 million so most of those requests fail. For whatever
reason, this test took 10 hours and 37 minutes, which is 9400 times longer than
the Release build!
This turned out to be caused by the Visual C++ "iterator debugging" mis-feature.
#define _HAS_ITERATOR_DEBUGGING 0 makes it realistically possible to
run the benchmark, but the hashtable tests are still extremely slow in debug builds.
This graph also shows that the same piece of code can run at many different speeds
depending on your platform and compiler settings. For example, some tasks, especially
those involving 64-bit integers (or to a lesser extent, floating point), run faster
when targeting x64. Once in awhile, x64 handles a task more slowly (for some reason,
reading a file with ifstream::read() is much slower in 64-bit).
The C# results are more consistent:

In fact, only a few operations are significantly slower in a C# Debug build compared
to a Release build. One reason for this is that the .NET standard library, known
as the BCL (Base Class Library), is optimized even when you are running a Debug
build. This explains why the speed of the Dictionary<K,V>, file and
string parsing tests (which mainly stress the BCL) are almost the same in Debug
and Release builds. Additional optimizations seem to be enabled when C# Debug builds
run outside the debugger, as these benchmarks did. In fact, the C# x86 Debug build
outperformed the C++ x86 Debug build in all tests except two.
But of course, what we really care about is Release builds. The rest of this
article will examine Release builds only (using Visual Studio's default compiler
settings).
3rd-party benchmarks
To add more "realism" to my benchmark, I incorporated three benchmarks from two
other people. I refactored their code slightly, to make the code fit better into
my benchmarking framework and (in two benchmarks) to improve the C# version by eliminating
2-dimensional arrays, because performance-sensitive C# code should avoid them. I
did not have time to prepare ARM bar charts for these new tests (or any other tests,
for that matter).
The first two benchmarks
I added were written by a Mr. Attractive Chaos. There are actually 4 benchmarks
in his test, but one tests dictionaries, and I already have a benchmark for that,
and one tests a particular C RegEx library, whereas I'd prefer to stick with standard
libraries here. The first benchmark I kept is a straightforward O(n^3) matrix multiplication,
and the second is a high-quality but (to me) hard-to-follow Sudoku solver. Then
again, I've never played Sudoku.
Both benchmarks were originally written to use 2-dimensional arrays in C#. Now,
writing about the matrix test, the author stated: "this is the first time I
write C#". More seasoned C# developers have learned that .NET's multidimensional
arrays are no good for performance. They weren't designed to be fast. Instead, it
may be better to simulate 2D arrays using either with 1D arrays or "jagged" 2D arrays
(double[][] instead of double[,]). The easiest way to eliminate a multidimensional
array from your code is to introduce a wrapper such as this:
public struct Array2D<T> {
readonly T[] _a;
readonly int _cols, _rows;
public Array2D(int rows, int cols)
{
_a = new T[rows * cols];
_cols = cols;
_rows = rows;
}
public Array2D(T[] array, int cols)
{
_a = array;
_cols = cols;
_rows = _a.Length / cols;
}
public T[] InternalArray { get { return _a; } }
public int Cols { get { return _cols; } }
public int Rows { get { return _rows; } }
public T this[int row, int col]
{
get { return _a[row * _cols + col]; }
set { _a[row * _cols + col] = value; }
}
};
In some conditions, the wrapper will be much faster than a standard .NET 2-dimensional
array (type[n,n]), but as the benchmark will show, this is no longer a reliable
optimization, and you'll often get better results from a jagged array.
The matrix multiplication test
In this benchmark, two monster 1000x1000 matrixes are multiplied. The C-style multiplication
routine is straightforward and hopefully uncontroversial; the inner loop is optimized
with "a_i" and "c_j" to avoid unnecessary multiplications, and the "b" matrix is
transposed so it can be accessed sequentially. The original version could multiply
only square matrixes, but I took the liberty of extending it to handle rectangular
ones (untested, mind you).
double* MultiplyMatrix(double* a, double* b, int aRows, int aCols, int bCols)
{
int bRows = aCols;
double* x = new double[aRows * bCols]; double* c = new double[bRows * bCols];
for (int i = 0; i < aCols; ++i) for (int j = 0; j < bCols; ++j)
c[j*bRows + i] = b[i*bCols + j];
for (int i = 0; i < aRows; ++i) {
double* a_i = &a[i*aCols];
for (int j = 0; j < bCols; ++j)
{
double* c_j = &c[j*bRows];
double s = 0.0;
for (int k = 0; k < aCols; ++k)
s += a_i[k] * c_j[k];
x[i*bCols + j] = s;
}
}
delete[] c;
return x;
}
C# variations
In C# there are various ways the same task could be accomplished. The original version
used a multidimensional array (
double[n,n]); I added three more methods,
in increasing order by speed:
Array2D<double>: As described above, a single-dimensional array
that acts two-dimensional.
double[n][n]: A "jagged" array, which is simply an array of
arrays. Thus, it requires more memory (16 extra bytes per row compared to Array2D<double>
under x86), and more initialization time, but can be faster overall.
double[n*n]: A single-dimensional array accessed via pointer
arithmetic, just like the C++ version. This obtains good performance without
the need for extra memory.
The method you choose makes a big difference:
As you can see, the version based on pointer arithmetic (double[n*n])
is the fastest, and the jagged (double[n][n]) version is just barely
slower.
When you look at x86 results (ignoring Mono), double[,] is clearly
the slowest at about 5.8 seconds. Array2D is nearly twice as fast at
3.2 seconds. And double[n][n] is twice as fast again, at 1.6 seconds.
However, the x64 results are downright weird. Evidently Microsoft optimized multidimensional
arrays for x64, so they are even faster than Array2D. At the same time, there is
a "glitch" in x64/.NET4 that makes Array2D slow on that specific version
of the JIT.
Note: from now on when I refer to "x86", I'm excluding Mono
unless otherwise noted. Mono isn't doing very well so far, and it will continue
to do poorly for a lot of these tests. In fact, originally I used a dark gray color
for Mono. But those extremely long Mono bars were becoming a distraction, so I changed
it to light gray. Like I said, if you don't use Mono, just ignore those long gray
bars. If you do use Mono, well, you see that Array2D bar that goes
right off the end of the chart? It's 11.75 seconds.
The one constant is that the jagged array is faster than the multidimensional
array. I haven't figured out how to look at the assembly to see what's going on
yet, but I can think of three reasons why the jagged array would be faster. First,
.NET does range checks on all array accesses, except inside a for-loop
that iterates from 0 to the end of the array, which the jagged implementation is
(mostly) able to do. Second, when using a 2D array, a JIT that is not smart will
fail to factor out the multiplication from the inner loop. Third, even if it does
factor out the multiplication, it may fail to precompute a pointer to the part of
the array being examined (indeed, the GC may make it hard to do that). In contrast,
the jagged version needs no multiplications, and we can explicitly cache a reference
to the current row of the matrix, like the C++ version does.
The version that uses pointers may avoid a single additional range check in the
inner loop, so it should be slightly faster.
This whole discussion points to a general problem with C# performance compared
to C++: .NET JITs aren't as good at optimization as C++ compilers. This is natural,
because JITs execute at run-time so they can't optimize as aggressively (for fear
that the program will start too slowly). In the future one could imagine smarter
JITs that produce a slow version of the code and then (after a loop has executed
several thousand times) take a time-out to optimize the code as well as a C++ compiler.
They already have that kind of optimization technique for interface dispatch. But
for everything else? No.
So, if you need fast C# code, you may have to optimize it yourself: manually
factor out expressions from the inner loop, switch to jagged arrays, or even introduce
pointers. Manual optimization can be a huge pain though, as in this case where the
inner-loop multiplication is hidden inside a call to the indexer. So instead you
could try
NGEN,
the results of which have not been included in this article. Maybe next time.
Generics vs templates
Now, the main matrix multiply test uses double arithmetic, but not all programmers
are concerned primarily with the speed of double arithmetic. Therefore, I also wrote
generic versions of the tests, so you could see the effect of different data types,
specifically int and float.
In the first version of this article I tested C# generics to see if the JIT could
optimize them as well as a C++ compiler optimizes templates. The results seemed
to say yes, but the test was very simple. For this extension of the matrix multiplication
test I compared two data types (double and int) with 5
different JIT engines, and in several cases the JIT botches the job. Have a look
(<angle brackets> represent generics):
(Note: I threw in a generic float test so you can compare with the C++ float
test, but I was too lazy to manually write its non-generic version.)
When you look at the results for x86 using integers, they are fine: generic <int>
is exactly the same speed as non-generic int. However, if you look at the results
for x64/.NET3, you'll see that the generic double test takes 2.14x longer, and the
generic int test takes 3.45x longer. The .NET 4 JIT isn't so bad, but it flubs the
integer test a little. And Mono seriously bombs the generic tests, especially involving
floating point (the result is 9.7 seconds for <double>, and 11.3 seconds for <float>.)
C++, on the other hand, reveals absolutely no difference between the template
<double> version and the original non-template version:
Notice that the bars for double[n*n] and <double>[n*n]
are exactly the same. The C++ compiler evaluates templates at compile-time using
token substitution, so there is no reason to expect any difference. Therefore, I
didn't bother with a non-template int or float version; the numbers are just there
so you can compare them with C# if you want.
Note: for the generic tests I eliminated fractional numbers from the matrix in
case the data type is "int". Integer overflows do occur during the multiplication,
but hopefully this does not mess up the results.
C# vs C++
Setting aside the issue of generics, and avoiding those unpredictable 2D arrays,
let's see how C# compares with C++. This chart shows the original C++ version and
its direct C# equivalent, plus the [n][n] version which is more appropriate
in C#:
The results are clear. First of all, C++ wins in every case. If you use the x86
JIT, C# is only a little slower than C++. But the x64 and Mono JITs perform quite
poorly on this test. Note that the jagged array version (double[n][n])
is the most reasonable way to do large-matrix multiplication in C#, because pointer
arithmetic is frowned upon in C# culture. You only loose a little speed by using
jagged-array version instead of the pointer arithmetic version, but the memory cost
(16B per row and slower GCs) should be kept in mind.
Sudoku
The code of this, the second benchmark by Attractive Chaos, is too long to post
here. Basically, it involves a lot of array accesses, if-statements and loops, but
very little arithmetic and no floating-point. Thus, this test stresses a compiler/JIT's
ability to optimize array access, register allocation, and flow control.
This test uses numerous one-dimensional arrays and two two-dimensional arrays.
As before, I switched the C# 2D arrays with Array2D wrappers, which increased x86
C# speed from 6 seconds to 5.2 seconds, but might have hurt the x64 speed (I didn't
check). here are the results:
In theory, C++ should have an edge in this test because it uses arrays heavily,
and it does not simply scan arrays from beginning to end like the Matrix test does.
In cases like this, .NET inserts a range check for all array accesses. Also, C++
supports only fixed-size minor array dimensions, while C# supports only dynamic
minor array dimensions, so, for example, .NET can't optimize the code on the knowledge
that the array called "C" is always 16 bytes per row.
Anyway, VC++ wins big this time, typically running twice as fast as .NET, if
not more (and Mono brings up the rear again, at 8.6 seconds). My experience suggests
that array range checks can't account for such a big difference, but investigation
is hindered because I still don't know how to look at C#'s optimized assembly. Go
ahead and laugh at me. Ha!
Polynomials
The third benchmark is by A.D.
Corlan. It's a very simple test which computes the value of a 100-degree polynomial
using float math.
The first time I ran this "Polynomials" benchmark in C# and C++ (VS 2008, x86),
the result was that C++ finished in 0.034 seconds, and C# in 7.085 seconds, making
C++ was 208 times faster! Of course, other than the C++ DEBUG hash_map test that
took over 10 hours, this was the most extreme result I had ever seen. The C++ result
seemed impossible: the outer loop does 10 million iterations and the two inner loops
do 100 iterations each, for a total of 2 billion iterations. There was no way so
much work could be done in 0.034 seconds on a 2.83 GHz processor. Indeed, looking
at the assembly of the outer loop, you can see that the compiler had done something
amazing:
for(int i=0 pu += dopoly(x)00CCB740 push ecx
00CCB741 fld dword ptr [__real@3e4ccccd (0CD2D54h)]
00CCB747 mov dword ptr [esp+40h],0
00CCB74F fstp dword ptr [esp]
00CCB752 call Polynomials::dopoly (0CCB650h)
00CCB757 fstp dword ptr [esp+40h]
00CCB75B fld dword ptr [esp+40h]
00CCB75F add esp,4
00CCB762 mov eax,989680h
00CCB767 sub eax,1
00CCB76A fld st(0)
00CCB76C fadd dword ptr [esp+38h]
00CCB770 fstp dword ptr [esp+38h]
00CCB774 jne Polynomials::Test+37h (0CCB767h)
Notice that jne branches to 0x0CCB767, which is after the call to Polynomials::dopoly;
the function is called only once! The C++ optimizer somehow figured out that dopoly()
is functional--that is, it always returns the same result given the same
input--so it factored out the function call from the loop!
While I am truly impressed with the optimizer's cleverness, this just won't do
for a benchmark. The idea was to figure out how long 10 million calls to dopoly()
would take, not just one call. So I made a small change: there is an array declared
on the stack inside dopoly(). If I move that array to the calling function instead,
and pass it in as a parameter, the optimizer no longer believes it can factor out
dopoly(), so it is called 10 million times as desired. This change increases running
time by a factor of 289, from 0.034 seconds to 9.813 seconds.
Anyway, here's the graph:
The average C++ SSE2 version completes in about 24% less time than the average
C# version (ignoring Mono's sad 13 seconds.) The C++ x87 version, though, is pretty
slow. While I was investigating the "clever optimizer" issue above, I think I saw
the reason why. I'm no x87 expert, but look at this. The compiler unrolled the second
loop of dopoly() 10 times, but the unrolled loop looks like this:
for (j=0; j<100; j++)
s = x*s + pol[j];
000DB696 fld dword ptr [esp]
000DB699 add eax,28h
000DB69C sub ecx,1
000DB69F fmul st,st(1)
000DB6A1 fadd dword ptr [eax-30h]
000DB6A4 fstp dword ptr [esp]
000DB6A7 fld dword ptr [esp]
000DB6AA fmul st,st(1)
000DB6AC fadd dword ptr [eax-2Ch]
000DB6AF fstp dword ptr [esp]
000DB6B2 fld dword ptr [esp]
000DB6B5 fmul st,st(1)
000DB6B7 fadd dword ptr [eax-28h]
000DB6BA fstp dword ptr [esp]
000DB6BD fld dword ptr [esp]
000DB6C0 fmul st,st(1)
000DB6C2 fadd dword ptr [eax-24h]
000DB6C5 fstp dword ptr [esp]
...
It's repeatedly storing and loading the loop variable, s. Oops!
C++ vs C#: Hash tables
This test pits two types of .NET Dictionary against the equivalent
Microsoft C++ hash_maps (or unordered_map in VS2010).
First I tried an <int, int> hashtable, then a <string, string>
hashtable.
The string tests use the same keys as the integer tests, but converted to strings.
Each test converts 10 million integers to strings (which is all that test "0" does),
so you may want to mentally subtract test "0" from all the others.
The results may surprise you:
With an occasional exception, C# wins by a landslide! One particular C++ scenario,
x86 VC10 with SSE2, runs better than the others and when handling strings, outperforms
C#. However, in most situations VC++ loses big time. The first test, for example,
runs over 9 times faster in .NET4 x64 compared to C++ VC10 SSE2.
Why is the difference so large? Well, I suspect Microsoft's (Dinkumware)
hash_map is simply terrible. I don't know how it's implemented; the
code is so ugly I can't stand to even look at it. Honestly, who indents code like
this?
template<class _Traits>
class _Hash
: public _Traits { public:
typedef _Hash<_Traits> _Myt;
typedef typename _Traits::_Val_type _Val_type;
typedef typename _Traits::key_type key_type;
typedef typename _Traits::key_compare key_compare;
typedef typename _Traits::value_compare value_compare;
enum
{ _Bucket_size = key_compare::bucket_size,
_Min_buckets = 8, _Multi = _Traits::_Multi};
typedef list<typename _Traits::value_type,
typename _Traits::allocator_type> _Mylist;
At least they added some comments to the VS 2010 version.
Anyway, I can say with certainty that it's not the C++ compiler's fault. I know
this because I wrote my own version of hash_map for my company, and
my implementation runs roughly ten times faster (with int keys). I
knew I got better performance from my hashtable, but I didn't realize how dramatic
the difference could be until I saw the numbers. Here are the results based on my
implementation:
As you can see, C++ is much more competitive now. It wins some races, and loses
others. My hash_map is heavily inspired by the design of .NET's
Dictionary, and I am unable to guess why my version outperformed C#
while doing int queries and removals, yet not inserts.
When it comes to strings, it is worth noticing first of all that they are dramatically
slower than integers, regardless of the programming language you use, and for longtime
programmers this will come as no surprise. I believe C++ is at a fundamental disadvantage
whenever strings are stored in collections. The reason is that C# uses immutable
strings, and no work is required to construct or destruct a "copy" of a string.
The C++ STL, however, uses mutable strings, which require extra work to copy a string--depending
on how
std::string is designed, either bookkeeping is required to track
how many
std::strings point to the same string (this approach makes
modifying a string more expensive), or the entire string must be copied (so changing
a string is cheaper, but copying is more expensive). In any case, when adding a
string like
"12345" to the hashtable, some work must be done to make
a copy.
On the other hand, the STL does have a slight performance advantage, in that
its template classes are specialized and optimized for every specific data type
at compile-time. Thus, the compiler can specifically optimize hash_map<string,string>
for strings. In contrast, the .NET Framework specializes generics at run-time if
they are based on value types (such as Dictionary<int,int>) but not
if they are based on reference types (such as Dictionary<string,string>).
Thus, there is a slight overhead when the .NET hashtable calls string.GetHashCode
and string.Equals. In general I believe the C# design is a better balance,
since it avoids code bloat--some types of Dictionaries can share the same machine
language code--while still offering higher performance for simple types like "int"
and user-defined structures.
Finally, let's see how the Compact Framework performed against VC++ for ARM:
Yikes! In the integer case, the hashtable queries and removals are over 3 times
faster in C++, although insertion is still mysteriously slower. The string tests
are where things get bad, and somehow I am not surprised. The .NETCF program I mentioned
at the start, the one that needed 35 seconds to start, was busy parsing strings
and adding them to a list. String test 1 runs 3.5x slower in C#; test 2 is 9.1x
slower; and test 3 is 7 times slower.
Disclaimer: the above tests use sequential integer keys. Therefore
the hashtables enjoy virtually no collisions, which means that these test results
are probably better than real-world results.
C# vs C++: Sorted maps
Someone pointed out that map<K,V> is more commonly used in C++ than
hash_map<K,V>. That's true, because for the longest time C++ didn't
offer a hashtable. Now in theory a hashtable should be faster, because it should
support O(1) constant-time operations, in contrast to a map which is
sorted and necessarily requires O(log N)-time operations.
Strangely, however, the MS STL's map<K,V> is actually faster than
its hashtable (hash_map or unordered_map), although not
as fast as my hash_map. In the .NET world, SortedDictionary
is slower than Dictionary as you would expect... but does it have to
be this much slower?
C++ wins by a landslide! C++'s map<K,V> is quite fast, while C#'s
SortedDictionary<K,V> is quite slow. Now, I've partly developed a C#
B+tree, which may end up being faster than SortedDictionary; leave
a comment if interested (I suspect nothing helps me write code like peer pressure!)
C# vs C++: Generic sum
C++ is still widely used for numerical applications--apps in which most of the
time is spent in the application code, not the standard library or waiting on I/O--because
its compilers are considered very good at optimization. As a bonus, C++ template
programming lets you write a numerical algorithm that supports multiple numeric
data types.
The generic sum test in C# is based on an enhanced version of
a library written
in 2004 by Rüdiger Klaehn, which observes that with some extra developer effort,
it should be possible (although not especially easy) to write and use C#
generics to write fast math code that supports arbitrary number types.
Earlier in this article I talked about the generic version of the matrix multiplication
benchmark and the fact that C# doesn't always optimize generics as well as you'd
hope. I actually wrote this sum benchmark for the original version of this article,
and it may seem a little redundant in version 2, but there are still some interesting
observations we can make. For instance, I'm testing fixed-point structures here,
which I rely on heavily in my C++ ARM code.
In addition to int, int64, float and
double, I tested with FPI8 and FPL16. These
are fixed-point data types that I created myself. Many, if not most, mobile devices
have poor floating-point support, so I often use fixed-point instead of floating-point
on such devices. A fixed-point type in C++ or C# is simply a struct with an integer
inside it (FPI8 contains an int, FPL16 contains a 64-bit
int). Some of the low bits are treated as fractional. The "8" in FPI8
means there are 8 fraction bits, so FPI8 can store numbers between
-16777216 and 16777215.996, with a resolution of about 0.004. The fact that a "struct"
is involved is an important part of the benchmark: historically, compilers could
not handle a structure that contains a single value as well as that value by itself
(outside the structure), but these days most compilers do a good job.
I tested the following generic C# sum method:
public static T GenericSum<T, Math>(List<T> list, Math M) where Math : IMath<T>
{
T sum = M.Zero;
for (int i = 0; i < list.Count; i++)
sum = M.Add(sum, list[i]);
return sum;
}
Of course, in C++ the natural implementation is simpler:
template<typename T>
T GenericSum(const vector<T>& list)
{
T sum = (T)0;
for(int i = 0; i < (int)list.size(); i++)
sum += list[i];
return sum;
}
However, many C++ template libraries (Boost::GIL
comes to mind) rely heavily on the compiler to inline tiny functions and eliminate
empty structures, so at first I directly ported the convoluted C# code to C++ with
the confidence that the compiler would make it fast. However, in an effort to avoid
future hooliganism from certain 1-voters, I changed the code to what you see above
(although I had to eliminate a virtual-function test as a side-effect). There are
still two function calls there that the 1-voters probably didn't think twice about.
You could eliminate them by using raw arrays instead of the STL, but why bother?
The compiler does optimize it. Let it do its job.
Because adding a number is such simple work, I did 10x as many iterations for
this test as for most others. For reference, I also included a non-generic integer-summing
method in both languages. Here are the desktop results:
Note: I'll give the ARM results down below.
Note: I just noticed that I forgot to do this test with FPI16.
Sorry, I don't have time to re-run all 11 benchmarks.
Funny story. While I was preparing version 2 of this article, I noticed something
weird about all the C++ Generic Sum test results: they were all 0.000! WTF? In the
debugger, it became clear that the optimizer had completely eliminated the loop
that called GenericSum()! This was a reasonable thing for
the optimizer to do, because the outer loop ignored the return value. But the puzzle
was, why did the optimizer's behavior change? One of my tests, the non-generic
sum test (which operates on vector<int> instead of <T>),
had not been changed. I downloaded the old code and checked to be sure. No, the
non-generic sum code was exactly the same as before, but now it ran in zero time.
What had changed? I was using the same compiler (VC9). I checked all the compiler
switches--identical! At that point I literally gave up and went home.
The next morning it occurred to me. Checked iterators! I had disabled them, but
I had not noticed for several days that my GenericSum results were zero. Why not?
Well, the benchmarks were run in random order, but I failed to randomize the random
number seed. As luck would have it, the first GenericSum test came quite late among
the benchmarks and, apparently, I just never ran the program long enough to see
the results!
Anyway, somehow, disabling checked iterators enabled the optimizer to delete
my benchmark completely. To fix this, I changed the outer loop to compute the sum
of all sums and return it, so that the optimizer couldn't be so aggressive. Then
I had a new "problem": the benchmark ran incredibly fast, less than 0.05 seconds
for 100 million additions. What was that about? I checked the assembly. This time
the results turned out to be legitimate. Disabling checked iterators more than
doubled the speed of this benchmark.
By disabling checked iterators, you eliminate the range check on vector accesses.
Apparently, this lets the optimizer be more agressive and the test runs really,
really fast. Thus, provided checked iterators are off, the C++ version totally blows
away C#. However, if you look at the "C++ x86 VC9 default" bar, you'll see what
happens if you don't turn them off. In that case, some of the C# tests aren't that
much worse than C++.
Something to notice about the C# results is that they are highly variable. The
Microsoft x86 JITs don't do too badly, provided that you are adding up integers
or FPI8 structures; but as soon as you try double, all 5 of the JITs suck. The x64
.NET3 JIT is especially bad for all data types, and both x64 JITs do worse than
their x86 counterparts. And, for some reason, Mono bombs every test including the
non-generic one. Hmm, I suspect Mono has some trouble optimizing code that involves
structures, since all generic math relies on a special "Math" structure, and it
handles the FPI8 structure much worse than plain "int" (the Mono FPI8 test, which
overflows the chart range, takes 1.57 seconds.).
Although C++ wins this test by a wide margin, it should be noted that most of
the C++ tests are executed with no range check on the array accesses, while the
C# tests are run with a range check. I did a quick test to see what happens
if NonGenericSum is written to use int[] instead of List<int>:
In this case .NET optimizes out the range check, and C# runs more than twice
as fast (more than four times as fast in x64!) and not much slower than the C++
version. However, it should be noted that in real life you can't always use an array
instead of List<T>, nor can you always loop through the entire array,
and when you don't, .NET won't optimize out the range check. The only other way
to eliminate the range check is to rewrite the code to access the array through
a pointer.
In C++ it's much more convenient, since you can use the STL and just #define
_SECURE_SCL 0; however, this also carries a disadvantage that range checks
are disabled throughout the entire program (note that it's
not legal to change _SECURE_SCL per-cpp-file, which would violate
the One Definition Rule). Hey Microsoft, if you're listening, you could solve this
problem in C# with a special intrinsic to perform the range-check explicitly before
the loop:
Array.RangeCheck(array1, a, b); Array.RangeCheck(array2, a, b);
for (int i = a; i < b; i++)
array1[i] = array2[i]; for (int i = b-1; i >= a; i--)
...
Combined with a smart optimizer and a
List<T>.RangeCheck(),
List<T>
could achieve high speeds too. And that's for verifiable code; within "unsafe" methods,
a special attribute could disable
all range checks. Of course, MS is probably
not listening to me. But if you are, I demand to know why MS .NET still lacks
SIMD structures!
(
VC++ has them,
although they are foolishly processor-specific.) But I digress...
C# vs C++: Simple Arithmetic
The "Simple arithmetic" test, admittedly, was a little too simple; two compilers
of generally similar quality may optimize small pieces of code differently, so you
shouldn't trust these results too much. This is what I did for each numeric type:
double total = 0;
for (int i = 0; i < Iterations; i++)
total = total - total / 4.0 + (double)i % 100000 * 3.0;
Part of this test is to see whether the compilers can optimize multiplication
and modulus by a constant. Here are the desktop results:
There's no clear winner for this test. Generally, C# did a better job in
the "double" test (even when C++ is allowed to use SSE2), but a worse job in
the FPL8 and FPL16 tests. C# is tied with C++ in the
float test, until you enable SSE which causes C++ to win. In the int64
and FPL16 tests, the instruction set (x64 or x86) has a larger
effect than the programming language. And then there's Mono. When it comes to
handling int64 or the FPI8/FPI16 structures, it does
pretty badly, but in the double test it does better than everybody else. What
gives, Mono?
Square roots
I bet you've never tried calculating a square root without any floating-point
math. But on ARM it's the preferred way. Here's the algorithm:
public static uint Sqrt(uint value)
{
if (value == 0)
return 0;
uint g = 0;
int bshft = Log2Floor(value) >> 1;
uint b = 1u << bshft;
do {
uint temp = (g + g + b) << bshft;
if (value >= temp)
{
g += b;
value -= temp;
}
b >>= 1;
} while (bshft-- > 0);
return g;
}
It's a pretty typical mix of arithmetic and flow control. For reference,
I've included the standard double square root (for which x87 and
SSE2 offer specialized instructions). Here are the desktop results:
C++ wins this test, but not by a wide margin (unless you count Mono). In
the 64-bit tests, the instruction set (x86 or x64) matters more than the programming
language.
Notice that "Square root: FPL16" is slower than uint64 (ulong),
in both C++ and C#. The reason for this is not that ulong
is wrapped in a FPL16 structure. Rather, the reason is that
FPL16 has 16 fraction bits and uint64 has none, so
the square root algorithm (which operates on a raw uint64) requires
more iterations to compute the square root.
ARM arithmetic test
Like I said, I didn't update the ARM graphs for V2 of this article. It didn't
feel like there was much point, since the results are so conclusive:

Unfortunately for me, the Compact Framework performs dramatically worse in all
three arithmetic tests.
For this graph I added ratio-bars to the right side, so you can see just how
badly the CF is doing (not easy; I had to work around a bug in Excel.)
In fact, C++ beats the Compact Framework by a huge margin on most of these tests.
The "best" results (where .NETCF is only slightly slower) involve the double floating
point type, but this is only because most of the time is spent in the system floating-point
emulation library, regardless of your programming language. This is a huge disappointment;
it suggests that the Compact Framework is incapable of doing math or algorithms
quickly, and the results are even worse when using FPI8 or FPL16.
Plus, it might not optimize generics very well; notice that the C# "int (non-generic)"
test is faster than the "int" (generic) test.
File reading, parsing and sorting
In this test I try out two different ways of reading a 400KB text file with about
17000 lines: either by reading the entire file at once into a buffer, or by reading
it line-by-line. The file has been loaded before so that it is in the disk cache.
This benchmark does not measure disk performance, nor do I want it to.
In C++ I originally used a std::ifstream, but it turned out that
ifstream is slow, so I've added good old FILE* as an alternative
in this second edition of the benchmark. In C#, I used System.IO.File
wrapped in a StreamReader.
The text file contains "dictionary entries": key-value pairs separated by ":=".
The value can span multiple lines; if a line does not contain := then
it is assumed to be a continuation of the value on the previous line. The dictionary
entries are collected into a C# List or C++ vector, then
randomized a bit (to ensure sorting requires some work) and sorted by key in a case-insensitive
manner. First, let's see how fast the two languages can read text files:
The "x20" indicates that the file is scanned 20 times, just to eat up time on
the clock.
Clearly, C++ ifstream does a terrible job at reading lines from
the file one-by-one. So instead let's just sweep that under the rug, and pay attention
only to the FILE* result. By the way, VC10 and x64 do this task better
than VC9 x86. SSE2 optimizations don't get the credit, since x86 VC9 with SSE2 enjoys
no improvement.
Evidently, C++ is faster than C# if you read the whole file as one block. This
makes sense, because .NET actually has to do more work in these tests: it is decoding
UTF-8 to UTF-16. Standard C++, in contrast, doesn't support conversion between text
encodings, and these tests use "char" strings (not wchar_t), so the
only extra work C++ has to do is to convert line endings (\r\n to \n), work which
C# does too.
It is therefore peculiar that C# wins the second test, reading line-by-line (ReadLine()
in C# vs getline() or fgets() in C++). It occurs to me
I could have optimized the FILE* version better; for example,
fgets() does not return the string length. So, to eliminate the '\n'
at the end of the string (necessary to make the code's behavior equivalent to the
other two versions), I call strlen() to get the string length, then
change the last character to '\0'. It might have been faster to convert to
std::string first (which determines the string length) and then
remove the last character.
Anyway, let's see the results for parsing and sorting:
Like any clever C++ coder, I optimized tests 2 and 3 by adding empty strings
to the end of the vector, and then creating the strings in-place so
that it is not necessary to copy them into the vector. Even so, C#
is quite competitive in those tests. The winner of the "Parse" test (which involves
calling IndexOf, Substring and the C++ equivalents in
std::string) depends on the platform (x86 or x64). The one place where
C++ definitely wins is during the case-insensitive string sort (std::sort
vs List.Sort). I wonder whether the C#'s ability to sort Unicode (É
< õ), which _stricmp presumably cannot do, has something to do with
this result.
If anyone's keeping score, Mono takes 5.666 seconds to sort the strings.
The Compact Framework may perform better than ifstream:

But again, compared to C++, the parse test takes 8 times longer in the Compact
Framework, and the sorting test takes a whopping 15 times longer. It would have
been nice to look at the disassembly to see why it's so slow, but the debugger reports
that disassembly is not supported in the Compact Framework.
P/Invoke
After seeing the dismal results of the Compact Framework tests, the question
came up: what if some of the code were written in C++ and called from C#? This could
compensate for some of the slowness of the Compact Framework. If you use P/Invoke
a lot, though, it’s important to consider how long it takes to cross the boundary
between C# and C++.
I tested P/Invoke performance by calling a very simple DLL that I made myself.
The functions in that DLL do something trivially simple, such as adding two numbers
or returning the first character of a string. Below you can see both .NET desktop
and Compact frameworks.
As my tests show, crossing that boundary isn’t what I’d call cheap, even if the
arguments and return values are simple integers.

On my workstation, you can call Add(int,int) about 25 million times
per second in the x86 and x64 versions of .NET (112 cycles per call at 2.83 GHz).
Mono is more than twice as fast (45 cycles), and the Compact Framework manages 2.74
million calls per second (219 cycles per call at 600 MHz). Now, these results are
not terrible. But non-P/Invoke calls are much faster, as you’ll see below.
Passing strings or structures makes P/Invoke calls much slower in some cases
(but not all cases). If the C++ method returns a string then .NET must convert it
to a System.String, which of course takes extra time. If the C++ string
has type char*, then additionally a type conversion occurs; but since
.NET does not support UTF-8 conversion, this conversion is probably a simple truncation
from 2 bytes to 1 byte (or expanding 1 bytes to 2 bytes on return). However, you
can pass strings to C++ very fast using wchar_t*, since the string
is not copied. (Note: if the C++ function modifies the string, be sure to use
StringBuilder instead of string, or the
immutable string will be modified, which is illegal and dangerous.)
I tested how well P/Invoke handles structures by marshaling between .NET’s
System.Drawing.Point structure and Win32’s POINT structure.
MakePoint() takes two integers and returns a Point, while
GetPointY is passed a Point and returns its Y coordinate.
The interesting thing about these two methods is that the x86 marshaller handles
them very slowly, but all other CLR versions are fast. The MakePointIndirect()
function returns a pointer to a Point (a static variable inside the function), which
the C# caller must unpack using Marshal.PtrToStructure. You can see
that this is very slow, too. If you want to make sure that a structure is passed
or returned to C++ code quickly, be sure to pass it by reference, which the
GetPointYIndirect() function does, and do not return a structure by
value.
The .NET Compact Framework has some marshaling limitations:
- Passing or returning floating point values is not allowed (if I’m not mistaken,
this limitation makes no sense, because floating point values are passed no
differently than integer values on ARM.)
- Passing or returning ANSI (1-byte-per-char) strings is not allowed.
- Passing or returning structures is not allowed.
If you have to pass floating point values or structures, you must pass them by
reference, not by value. That’s why I made an extra test for adding a pair of
double*. Since there is an extra cost for converting a returned
double* to a double, you’re better off using an "out
double" parameter instead of an IntPtr (i.e. double*)
return value; the same guideline applies to structures.
So, how much faster are normal, non-P/Invoke calls? Look how fast do-nothing
NoOp() functions can be called:
You can see here that 10 million non-inlined calls take around 24 milliseconds
on the desktop, as compared to about 400 milliseconds for NextInt(),
a trivial C function that increments and returns a counter, or 160 milliseconds
if NextInt() is called from Mono. Thus, simple P/Invoke
calls are about 16 times slower than non-inlined method calls in .NET (and 18-23
times slower than virtual method calls), about 6 times slower in Mono, and 9 times
slower in the Compact Framework (380 ms vs 42 ms).
This means that you can never "optimize" .NET code by using P/Invoke to call
a function that does very little work. For example, if the C++ version of your code
is twice as fast as the C# version, and P/Invoke uses about 110 clock cycles minimum
(210 in CF), each C++ function call will have to do (on average) 110 (or 210) clock
cycles of work just to “break even” with your C# speed.
And what about inlining? If a method does very little work, these results show
that it will be much faster still if it is inlined. The “static no-op” test allows
inlining of a static function that does nothing at all, which means that the test
effectively just measures an empty for-loop. Clearly, one iteration of the for-loop
is much faster than a non-inlined method call (except in the Compact Framework).
Actually, the inlined x64 version seems almost impossibly fast, at 0.001 seconds
for 10 million calls. Just to be sure, I temporarily did 100 times as many loop
iterations, and found that .NET x64 does one billion iterations in 89.5 milliseconds,
or 11.2 billion iterations per second, which is 4 iterations per clock cycle. Maybe
the x64 JIT does some fancy loop unrolling, but the debugger wouldn’t let me see
the assembly for some reason.
In this "trivial method calls" test I actually started by testing an Add(i,
i) method (where i is the loop counter), but it turns out that the difference
between Add() and a no-op is very small (2-3 ms on the desktop). In
other words, the call itself is much slower than passing "int" arguments
or adding the arguments together. So I got rid of the arguments, so that you can
see the basic call overhead by itself, with nothing but a for-loop getting in the
way.
The "no-inline" test uses the [MethodImplAttribute(MethodImplOptions.NoInlining)]
attribute to ensure that a static no-op method is not inlined. Strangely, the method
that uses this attribute (which is non-virtual) is consistently slower than a "virtual"
no-op method in all desktop CLRs.
Only the Compact Framework agrees with the widely-believed notion that
virtual functions are slower, and wow, virtual functions are really very
slow on that platform! The results seem to suggest that the Compact Framework uses
the same mechanism for virtual dispatch and interface dispatch, which is sad because
its interface dispatch is clearly quite slow (9 million calls per second, or 66.6
cycles per call, compared with 25.2 cycles for a non-virtual no-op).
Relative performance between CLRs
I made one more graph that you might find interesting. It’s the relative performance
of different implementations of the CLR. All the benchmarks are scaled so that .NET
x86 uses 1 unit of time (I must scale the results somehow in order to present all
the tests on the same graph.)
I am happy to report that all benchmarks ran without modification under Mono. The
Mono results are only shown for x86 because the
download Mono
page does not offer an x64 version for Windows (or an ARM version for WinCE.)

Of course, the Compact Framework is not directly comparable as it runs on a different
processor, but I included it for completeness, with 10% of the workload (I estimate
from a lot of benchmarking experience that the processor itself is, at best, 1/7
as fast as my workstation for single-threaded code, or slower when running complex
or memory-intensive code.)
As I mentioned before, some of the P/Invoke tests are very slow under the x86
CLR, so the time measurements for the other platforms are tiny in comparison. x64
outperforms x86 in other areas too, such as "long" arithmetic and string handling,
but for some reason the "Generic sum" tests are slower under x64 and Mono, and floating-point
handling seems a little slower too.
Mono does okay in general, and does P/Invoke especially well, but it falls behind
Microsoft’s x86 CLR in most other tests. The case-insensitive sort, FPI8
arithmetic, long arithmetic and Generic Sum tests run fairly slowly
under Mono, but the hashtable (Dictionary) and Square root tests gave decent results.
The Compact Framework, of course, gave a lot of poor results. Even ignoring the
floating-point tests (which use software emulation), three tests went right off
the right side of the graph. For example, the Sort test took 13.0 times
longer than x86, but since the mobile benchmark performs 10% as many sorts, this
really means that a C# string Sort would takes 130 times longer on
the mobile device than the same C# on my workstation. In fact, I suspect
it was a string sorting operation that was the main cause for the 35-second startup
time I mentioned at the beginning of this article. Only the P/Invoke tests, and
the conversion of "Ints to strings", were relatively quick in the Compact Framework.
Conclusion
In this second edition of the benchmark we can make some new conclusions. The original
article tested VS2008 with default optimizations, and the results seemed clear:
if you're developing software for the desktop, and (like me) you are equally skilled
in C# and C++, it's safe to take advantage of the simpler syntax, rich standard
libraries, stellar IntelliSense and reduced development effort that C# offers. Those
original results showed only a small performance penalty for C# versus C++ with
STL. In the second edition this conclusion needs some modifications. First, let's
talk C++. How important are the compile switches and special #defines in VC++?
- Once in a while a loop will give you dramatically better performance if
you enable SSE/SSE2 (e.g. the Polynomials test). However, usually it doesn't
make much difference. I didn't test the "fast" floating-point model separately
from the default "precise" model, but it looks like there isn't much difference
to worry about.
- /Ox performs pretty much the same as the default optimization level, /O2.
- _HAS_ITERATOR_DEBUGGING=1 can cause horrendous performance in debug builds
(it is disabled by default in Release builds).
- _SECURE_SCL=0 can boost performance dramatically in certain cases, particularly
in tight loops that iterate over elements of a vector. It makes hash_map/unordered_map
faster, but still slower than regular map<K,V>, because Microsoft has given
us a really lousy hashtable. Of course, this removes protection against out-of-bounds
indexes and iterators, but it puts hair on your chest, right?
Now what about C#? There aren't any special #defines or switches to make C# code
run faster, but of course, the way you write your code can make it run faster or
slower.
- In general, 2-dimensional arrays should be avoided (they are fairly fast
on x64 but not x86). It's generally better to use jagged arrays or (if your
code isn't partial-trust, of course) pointer arithmetic when you need the best
performance. Note that pointers should be used sparingly, not only because it's
against C# culture but also because the garbage collector can't move an array
while any C-style pointers are pointing to it (remember that you use a "fixed"
statement to get the pointer in the first place.)
- You can get better performance from arrays than List<T>, because (1) if
you access an array from 0 to Length-1, range checks are eliminated, and (2)
if you need more speed you can use pointers. Such optimizations are the C# equivalent
of _SECURE_SCL=0. It's possible to get the best of List<T> and T[] at the same
time; I wrote a struct called InternalList<T> for that purpose (not included
in the benchmark).
- The .NET optimizers aren't as smart as the C++ ones. You may get better
performance by doing common-subexpression factoring and loop-unrolling manually.
On the plus side, all C# compilers are required to do constant folding.
- .NET cannot reliably optimize generics used for math. The x86 JITs do a
better job than the x64 ones, but in general, if performance matters you'll
have to write your code against a specific numeric data type. It's still possible
to hack around this with
T4 templates... and for that matter, T4 could be a useful alternative or
supplement to C++ template metaprogramming.
- Memory management concerns (not benchmarked here) are different in C# than
C++. In particular, C# memory allocations are extremely fast, but the garbage
collector is a wildcard: it may be fast or slow depending on your memory allocation
patterns. And Microsoft has not provided any good way to measure GC performance.
There are various guidelines to optimize GC: let objects die young where possible,
avoid thick pointer trees (e.g. linked lists or SortedDictionary, a red-black
tree), and so on; but it's really outside the scope of this article.
The most curious new result is the Sudoku test, whose C++ version runs fully
twice as fast as C#. One of these days I should figure out why.
Again, certain "bad" parts of the Microsoft (Dinkumware) STL, like hash_map/unordered_map,
are dramatically slower than the equivalent C# and should be avoided. In the C#
world, it could be argued that matters are worse, because while you can work around
a poor library by writing (or finding) another library, it's harder to work around
a compiler that doesn't optimize very well. Now, .NET JITs are not bad, but they
can't quite catch up to C++, and certain JITs don't handle certain tasks well (e.g.
the x64 JIT was much slower on some tests, but faster than x86 at handling 2D arrays
of double). You can sometimes get better performance by fiddling around
with pointers, but not always, and who wants to do that anyway?
Meanwhile, the Compact Framework is always slow, with no recourse other
than native code. Why so slow?
This article by Ryan Chapman sheds some light on the subject. Particularly noteworthy
is that the .NET Compact Framework apparently inserts three instructions that are
executed on every loop iteration. I'm just speculating, but manual loop unrolling
might therefore improve performance significantly. It's not that the CF isn't optimized
at all, but you can see that it misses some important optimizations like keeping
the loop constant in a register, or combining "add" and "shift" into a single instruction.
Of course, C# coders tend to program in different ways than C++ coders: they
use LINQ-to-objects (which can be very slow if used carelessly), they may not tune
their algorithms, they may use high-overhead libraries like WPF, prefer XML files
to plain text files, use reflection heavily, and so on. These differences can lead
to slower programs, but I believe a performance-conscious C# programmer
can write programs whose performance is comparable to similarly well-written C++
code. .NET is clearly not as fast as C++ yet, but in principle I see no reason why
it couldn't be, given enough TLC from Microsoft or
Xamarin.
If you use Mono there will be an extra speed penalty, but depending on what you're
doing, it's not too bad. A lot of these tests were hard on Mono, because they tested
things it does poorly, such as structure handling and generic Math. Possibly the
most realistic tests here are the Square root method, the Sudoku solver and the
matrix multiplier. In these tests, Mono was roughly half as fast as C++, and maybe
50% slower than MS .NET, except for Sudoku where Mono was about 27% as fast as C++.
On the other hand, it did pretty well in the Dictionary<int,int> test (almost as
well as MS .NET).
If you happen to be developing for Windows CE, you'll generally suffer a big
performance penalty by using .NET. The Compact Framework can do some tasks well
enough (such as reading from a file), but string processing, ordinary loops and
arithmetic, structure manipulation and other basic tasks seem to be quite unoptimized.
C# would still be acceptable if you won't be doing any complex algorithms or heavy
string manipulation. Probably a hybrid solution, where Compact Framework code calls
performance-critical C++ over P/Invoke, should be used to overcome its abysmal performance.
Mono built for ARM might also offer an improvement, but there is no version currently
available for Windows CE.
History
- June 17, 2011: Initial version
- July 4, 2011: Second version. Added four new benchmarks (Matrix multiply,
Sudoku, Polynomials, sorted map). Added Five new platforms. Added over a dozen
new graphs. Added/edited the majority of the commentary. Lived in a cave for
six more days, one with a comfortable chair and Google.