 |
|
 |
Its actually not that surprising once youve used .net languages for a while.
IL code is compiled for the executing processor model at runtime by the JIT compiler.
C++ apps are by default compiled for any processor within the family, and so cant take advantage of any processor specifc features without explicit cases.
You can locate the VS processor pack for your system processor and compile your code for it, but of course, the app then wont run on all platforms.
This also explains why users are finding such wide variations in results.
I also wouldnt be surprised if the IL compiler is actually using SSE intrinsics anyway.
The C# & JIT compilers together are more sophisticated than the VS C/C++ compiler. This is helped by the fact that all true .net languages (i.e. not CLI managed C++, which notably slower than the others) are very simple in comparison to native C/C++.
Anyone who has ever tried optimising C# or VB code will know that you can spend a lot of time implementing techniques that would result in substantial speed increases in C++, which make little or no difference in C#.
As far as some citing .net's memory management as a reason that code should be slower on .net than C++, just remember that C++ will execute instructions sequentially on one core unless otherwise instructed. It can do very little or no optimisation of the heap without programmer intervention.
Conversely, .net manages memory through a separate thread and does very cunning things like reusing objects and heap allocated memory.
|
|
|
|
 |
|
 |
don't use std containers for your access, since you know the number of datas you will manipulate before using the container. Use native array with pre-allocation instead and pointer notation , then compare the results , you will be surprised.(~4 times faster).
regards, Nicky
|
|
|
|
 |
|
 |
You might be better off using System.Diagnostics.StopWatch for durations just to avoid the P/Invoke cost, and also it would be really more convenient to have the C++ test code included as you did with the c# code.
|
|
|
|
 |
|
 |
just follow the link in background section
chesnokov
|
|
|
|
 |
|
|
 |
|
|
 |
|
 |
I'm not sure what you're getting at but I posted code here[^]. You shouldn't have too much trouble getting it to work!
By the way, as mentioned by another poster, the C# version of your code does ***NOT*** use the results of the calculation which means the compiler could totally optimise it away: in C++ this does in fact happen.
Using the programs in the link above with and without printing the result of the calculations:
C++ with result:
FWI: Timer overhead
Overhead = 0.000169084 ms
Starting...Time = 20.8352 ms
Result = 508741359
C++ without result:
FWI: Timer overhead
Overhead = 0.000164702 ms
Starting...Time = 0.000207977 ms
C# with result:
FWI: Timer overhead (due to interop?): Overhead = 0.000426191637145254 ms
Starting...Time = 23.5183278270907 ms
Result = 1232686097
C# without result:
FWI: Timer overhead (due to interop?): Overhead = 0.000405066074911445 ms
Starting...Time = 21.780728814234 ms.
In short, my code does not agree with your conclusions.
Steve
|
|
|
|
 |
|
 |
see the updates, it is using computation results.
otherwise it is your hardware that does not agree with my results
chesnokov
|
|
|
|
 |
|
 |
Chesnokov Yuriy wrote: You will not get more precise time counter.
I beg to differ: http://www.codeproject.com/KB/cs/highperformancetimercshar.aspx[^]
I don't know which hardware your timer is using; maybe it's the same one, but in that case, QueryPerformanceCounter will get you higher precision with the same accuracy[^].
Also, for both your tests and mine I used the same default value of 1000000. Recall that your code and mine yielded the same results: ints are 2msec in unsafe C# and 2msec in C++, doubles are 4msec in unsafe C# and 4msec in C++, 4msec in SSE optimized C++. This is of course taking mihasik's observation[^] into consideration.
Chesnokov Yuriy wrote: and puting bad votes on my paper.
Less drama please. I never voted on your article. Anonymous unexplained voting is among the stupidest things that the internet offers.
Finally, for the record, here are the results with 5000000 in C# using my timer:
Size: 5000000
Not accessing c, safe C#
shorts: 14.1431383038906 ms
ints: 8.24797565053659 ms
longs: 70.4025232257172 ms
floats: 7.82417877132429 ms
doubles: 7.679747006952 ms
decimals: 2580.26788320862 ms
Accessing c, safe C#
shorts: 19.0085611439443 ms c: -17676
ints: 20.3701867136745 ms c: 5037165
longs: 69.7890374335286 ms c: 9905054
floats: 25.0149111130046 ms c: 168.0631
doubles: 26.9442065960897 ms c: 257.050370715542
decimals: 2532.3405628369 ms c: 357.72364894431147243694461501
Accessing c, unsafe C#
doubles: 26.4025176384149 ms c: 68.4500436369281
And your C++ binary:
O:\bin>conv.exe 5000000
chars processing time: 5 ms
-26
shorts processing time: 17 ms
23418
shorts sse2 processing time: 5 ms
23418
ints processing time: 10 ms
1108903
floats processing time: 24 ms
1083117
sse2 intrin processing time: 10 ms
1083117
sse3 assembly processing time: 10 ms
1083117
doubles processing time: 19 ms
1272979
doubles sse2 processing time: 21 ms
1272979
And the C# source code[^] and the timer source code[^].
You can see here that for this particular operation unfortunately, the MSVC compiler optimizes better than the JIT. I'm becoming a little worried that they never will improve the JIT, considering the .NET team doesn't seem to give a damn about performance[^] more than what they've already accomplished.
Which is quite a lot, of course. And I think for most purposes, there's no reason to resort to C++ if the only reason is performance. But according to Rico[^] the performance guy, there's technically no reason that it couldn't be even better.
|
|
|
|
 |
|
 |
Interesting article. And I have some thoughts too share... In fact that's pretty much text and if you don't feel like reading everything, just scroll to the bottom.
If you want to be certain about the results, please ensure that you have AMD CPU drivers installed.
I can remember that when I was evaluating CUDA performance on AMD machine (AMD X2, XP 32bit), I was getting weird, unpredictable results, including negative running time. Of course it's just wrong.
When I installed drivers and dual core optimizer which I downloaded from AMD site, results become more reliable. The variance decreased a lot and I finally started to obtain expectable running time readouts. For time checking in .Net 1.1, I was using a simpler wrapper for DllImport and QueryPerformanceCounter & QueryPerformanceFrequency combo and in .Net 2.0 and later, there's System.Diagnostics.Stopwatch, which does more or less the same.
Another thing, which may increase precision, is calling Thread.Sleep just before the benchmark. Even Thread.Sleep(0) should help. The program gets the CPU access for some time and after that period other process may get some CPU, while your program is suspended. Calling Thread.Sleep will make your test more likely to start just on the beginning of this "CPU is yours" time interval. You can also set thread priority to high or even real time to decrease probability that something interrupts your test even more. It will only decrease the probability (and in fact frequency) of your program getting interrupted. The number of loop iterations have to be high enough, so small delays in CPU access won't disturb the results too much. These might be other programs wanting something to do, IOs etc. With significant loop count, you can also forget about timer calibration.
In case of multi-core processors, you might also consider using affinity lock. Core switching slows things down considerably and may result in erroneous benchmarks especially with so short running time.
Finally - CPU optimizations. It's nothing uncommon that if the compiler finds that some computations don't depend on each other, they're modified to be done in parallel. If you're doing some vector or matrix operations, then such optimization would be present in both real-world program and your benchmark, but if you compare heavily-optimized test to program that is done sequentially, results may differ a lot. Obviously, the best practice is to try to make the test mimic production environment. It's always recommended to check the executable with Reflector. It won't tell exactly how the machine code will look like, but still it's really useful.
People often blame .Net for being slow, while usually it's just high startup time and poorly written program with tons of function that appear to be simple but in fact are not.
To sum things up:
Install AMD drivers (& dual core optimizer), affinity lock, high/rt priority, Thread.Sleep, huge loop count, think about optimizations
|
|
|
|
 |
|
 |
that's not convolution at all, that's just a scalar product.
The difference is: the first is an O(N^2) algorithm (except you do not use FFT, but that's another story...). This one O(N).
convolution is bound by processor speed, and you will note large differences. Scalar product is bound by the access to memory speed.
Furthermore any optimizer will remove the unused loop, except you do not care to *use* (typically: print) the result of your computation. Actually I do not know what are you measuring at all. Other posts have pointed it out.
35ms for 5000000 doubles is ridiculous: that is 1.4 MB/s memory bandwidth! or 0.17 MFLopg... that's a little tiny for a processor theoretically capable of some GIGAFLOPS.
I suggest you check better your result, before of posting it again, and let it be compatible with existing benchmarks (cfr with ATLAS, or INTEL MKL). You will never reach the results of dgemm routines (higly optimized cache usage), but you should get close - at least with c++ - to the daxpy result.
|
|
|
|
 |
|
 |
Scalar product or dot product:
http://hyperphysics.phy-astr.gsu.edu/Hbase/vsca.html
a * b = |a|*|b|*cos(angle between vectors)
I do measure the time needed to perform c = a * b' I presumed that should be obvious from my code.
tic();
for (int i = 0; i < size; i++)
c += a[i] * b[i];
int time = toc();
print(c);
Please do compile that code to get GIGAFLOPS. I need that operation of 2 vectors multiplication c = a * b'
I've seen only messages suggesting ridiculous purpose of the article but no real better code results.
Do post here a function doing c = a * b' in few ms for 5000000 sized float or double vectors before you put further unsupported comments
chesnokov
|
|
|
|
 |
|
 |
Chesnokov Yuriy wrote: a * b = |a|*|b|*cos(angle between vectors)
= sum(a[i]*b[i]) = your code
your title is wrong. You are not timing a convolution algo, just a scalar product. That's a math definition. A convolution of two vectors gives out a vector. The convolution of an image with a kernel gives out an image, not a "scalar" number.
A very good practical reference is Numerical Recipies: http://www.nr.com
On the site you can also browse the old versions, but that's a book that is really worthwhile to have.
I have to apologize because I did wrong estimates of the numbers - but u or anybody else should at least have challenged it! These numbers are the reference for any benchmark. So:
22ms for 5000000 doubles (that's my best c++ result), c += a[i]*b[i] (2 floating op per cycle) are 5000000*2/0.022 = 454MFlops.
As memory access, c += a[i]*b[i] (2 memory accesses per cycle, 8 bytes operands) are 5000000*8*2/0.022 = 3.6 GB/s
These numbers make absolutely sense for the platform u are using (that u have forgotten to mention in the article... if it was an 8088, my wrong numbers would be reasonable )
I did not want to blame neither to kid! Just give my contribution. Partly correct and partly wrong. You need not to say I am kidding here. I was honestly convinced of what I have written. Challenge it! Discuss it! and accept the critics.
Regards,
Sigismondo
|
|
|
|
 |
|
 |
Did anyone try unrolling the loops in C++, it usually makes a huge gain that way, and I am sure the C# JIT does exactly that for the inner multiply-add loops.
|
|
|
|
 |
|
 |
You want to unroll 5000000 loops? Good luck!
Steve
|
|
|
|
 |
|
 |
I assume you meant that as a joke But just incase, I will clarify what I meant:
replace:
for (int i = 0; i < size; i++)
c += a[i] * b[i];
by this:
int i;
for(i = 0; i < (size-4); i+=4)
c += a[i+0] * b[i+0] +
a[i+1] * b[i+1] +
a[i+2] * b[i+2] +
a[i+3] * b[i+3];
for(;i < size;i++)
c += a[i] * b[i];
After that simple change the routine can do 4 multiply/adds without having to evaluate the (i < size) bit every time. You could probably unroll it to do 32 or 64 operations, but that depends on your CPUs level1 cache size.
|
|
|
|
 |
|
 |
it is important to keep i local for(int i=0; i < (size-4); i+=4)
otherwise a lot of time spent in C# on fetch\store of loop variable i
unrolled - original results
shorts 4loop: 14 ms
ints 4loop: 16 ms
longs 4loop: 75 ms
floats 4loop: 15 ms
doubles 4loop: 24 ms
decimals 4loop: 2611 ms
rolled up code - as above
shorts 4loop: 3 ms
ints 4loop: 3 ms
longs 4loop: 16 ms
floats 4loop: 3 ms
doubles 4loop: 4 ms
decimals 4loop: 333 ms
|
|
|
|
 |
|
 |
My loop times were wrong, still faster but just marginally
shorts 4loop: 14 ms
ints 4loop: 14 ms
longs 4loop: 67 ms
floats 4loop: 12 ms
doubles 4loop: 18 ms
decimals 4loop: 1297 ms
|
|
|
|
 |
|
 |
ints are 2msec in unsafe C# and 2msec in C++, doubles are 4msec in unsafe C# and 4msec in C++, 4msec in SSE optimized C++. This is of course taking mihasik's observation[^] into consideration.
Need a higher precision timer to get meaningful results on my machine. Managed C# performs about the same too.
Here's the code for doubles:
static void doubles()
{
double[] a = new double[size];
double[] b = new double[size];
Random rnd = new Random();
for (int i = 0; i < size; i++)
{
a[i] = rnd.NextDouble() - 0.5;
b[i] = rnd.NextDouble() - 0.5;
}
fixed(double* _pa = a)
fixed (double* _pb = b)
{
double* pa = _pa;
double* pb = _pb;
PerformanceCounter.Tic();
double c = 0.0;
for (int i = 0; i < size; i++)
c += *(pa++) * *(pb++);
Console.WriteLine("c: " + c);
}
Console.WriteLine(String.Format(" doubles: {0} ms", PerformanceCounter.Toc()));
a = null;
b = null;
}
modified on Saturday, April 5, 2008 5:25 PM
|
|
|
|
 |
|
 |
In your c# application you are not using result of calculation anywhere. So, JIT-compiler generating machine code considering this fact (probably, some of calculations not executed, i don't know). If you print result of your calculations (variable "c") - you will get different picture. And it will be slower than C++.
|
|
|
|
 |
|
 |
You're right.
If I remember correctly, the JIT chops out operations that don't do anything to any used variables.
|
|
|
|
 |
|
 |
Right you are. Here's the results from the unaltered program:
shorts: 3 ms
Result = -8605
ints: 4 ms
Result = -2131161
longs: 15 ms
Result = 3379471
floats: 6 ms
Result = 82.94669
doubles: 8 ms
Result = -27.5339920794054
decimals: 506 ms
Result = -81.40936806228595463257761890
Now here's the same program with the lines that output the results removed:
shorts: 3 ms
ints: 1 ms
longs: 14 ms
floats: 2 ms
doubles: 1 ms
decimals: 496 ms
Steve
|
|
|
|
 |
|
 |
The C# version:
using System;
using System.Runtime.InteropServices;
namespace InnerProductCS
{
internal class Timer
{
static Timer()
{
long freq;
QueryPerformanceFrequency(out freq);
m_ToMS = 1000.0 / freq;
}
public void Tick()
{
QueryPerformanceCounter(out m_Then);
}
public void Tock()
{
long now;
QueryPerformanceCounter(out now);
m_Time = now - m_Then;
}
public double Time()
{
return m_Time * m_ToMS;
}
[DllImport("Kernel32.dll")]
private static extern bool QueryPerformanceCounter(out long lpPerformanceCount);
[DllImport("Kernel32.dll")]
private static extern bool QueryPerformanceFrequency(out long lpFrequency);
private static double m_ToMS; private long m_Then;
private long m_Time;
}
class Program
{
static void Main(string[] args)
{
Timer t = new Timer();
Console.Write("FWI: Timer overhead (due to interop?): ");
double total = 0;
const int reps = 1000;
for (int i = 0; i < reps; ++i)
{
t.Tick();
t.Tock();
total += t.Time();
}
Console.WriteLine("Overhead = " + total/reps + " ms");
const int array_size = 5000000;
int[] nums1 = new int[array_size];
int[] nums2 = new int[array_size];
Random rand = new Random();
for (int i = 0; i < array_size; ++i)
{
nums1[i] = rand.Next();
nums2[i] = rand.Next();
}
Console.Write("Starting...");
t.Tick();
int sum = 0;
for (int i = 0; i < array_size; ++i)
{
sum += nums1[i] * nums2[i];
}
t.Tock();
Console.WriteLine("Time = " + t.Time() + " ms");
Console.WriteLine("Result = " + sum);
}
}
}
The C++ version:
The pre-compiled header:
#pragma once
#include "targetver.h"
#include <stdlib.h>
#include <iostream>
#include <algorithm>
#include <functional>
#include <numeric>
#include <tchar.h>
#include <windows.h>
And the program itself:
#include "stdafx.h"
using namespace std;
namespace {
class Timer
{
public:
static void StaticInit()
{
LARGE_INTEGER freq;
QueryPerformanceFrequency(&freq);
m_ToMS = 1000.0 / freq.QuadPart;
}
void Tick()
{
QueryPerformanceCounter(&m_Then);
}
void Tock()
{
LARGE_INTEGER now;
QueryPerformanceCounter(&now);
m_Time.QuadPart = now.QuadPart - m_Then.QuadPart;
}
double Time() const
{
return m_Time.QuadPart * m_ToMS;
}
private:
static double m_ToMS;
LARGE_INTEGER m_Then;
LARGE_INTEGER m_Time;
};
double Timer::m_ToMS;
}
int _tmain(int argc, _TCHAR* argv[])
{
Timer::StaticInit();
Timer t;
cout << "FWI: Timer overhead" << endl;
double total = 0;
static const int reps = 1000;
for (int i = 0; i < reps; ++i)
{
t.Tick();
t.Tock();
total += t.Time();
}
cout << "Overhead = " << total/reps << " ms" << endl;
static const size_t array_size = 5000000;
int *pInts1 = new int[array_size];
int *pInts2 = new int[array_size];
generate_n(pInts1, array_size, &rand);
generate_n(pInts2, array_size, &rand);
cout << "Starting...";
t.Tick();
int sum = 0;
for (int i = 0; i < array_size; ++i)
{
sum += pInts1[i] * pInts2[i];
}
t.Tock();
cout << "Time = " << t.Time() << " ms" << endl;
cout << "Result = "<< sum << endl;
delete pInts1;
delete pInts2;
return 0;
}
And the results:
C#:
FWI: Timer overhead (due to interop?): Overhead = 0.000463770704693501 ms
Starting...Time = 26.6975530025226 ms
Result = -391167045
C++:
FWI: Timer overhead
Overhead = 0.000208534 ms
Starting...Time = 21.4013 ms
Result = 508741359
Steve
|
|
|
|
 |
|
 |
???
As you can see from my articles for 5000000 vectors
C++
ints processing time: 16 ms
C#
ints: 7 ms
maybe you're using illegal VS version
The code is as plain as it is. Check my timers if you do not believe them with sleep()
chesnokov
|
|
|
|
 |
|
 |
The code I used is as simple as it gets: it only uses one test case; it actually uses the result of the calculation to ensure the entire thing isn't optimised away and it doesn't use an external timer dll. I used Visual Studio 2008 for both the C++ program and the C# version. This fact remains - the huge difference in your programs doesn't occur in mine!
Steve
|
|
|
|
 |