Introduction
This article intends to show you how to retrieve the precise performance value of a given algorithm/function/machine-instruction. No more 'high precision' or 'less
than accurate' or anything like that. The only things we can't measure are instruction(s) that have less than 1 cycle execution. Note that I don't own a 64-bit machine,
so these functions are only tested on x86 machines. Please let me know how it works on a 64-bit machine.
Background
Back in 2005, I had a task where I had to optimize our code. Well, it didn't matter if we didn't have to run it on old servers, but it did.
QueryPerformanceCounter and GetTickCount didn't help because of their inconsistency. The most important thing is we solved it,
but it was Thomas' article that helped me out.
Although his code is good enough for most of you out there, it wasn't for us, so I had to simplify it to suit our needs.
Understanding the code
Let's have a look at the API functions:
#define StartPerfCounter(__start) \
__asm pusha \
__asm xor eax, eax \
__asm cpuid \
__asm rdtsc \
__asm mov dword ptr [__start + 0], eax \
__asm mov dword ptr [__start + 4], edx
#define StopPerfCounter(__stop) \
__asm rdtsc \
__asm mov dword ptr [__stop + 0], eax \
__asm mov dword ptr [__stop + 4], edx \
__asm popa
static __forceinline unsigned __int64 CalcOverhead(
unsigned __int64 __start, unsigned __int64 __stop)
{
StartPerfCounter(__start) StopPerfCounter(__stop) return __stop - __start}
static __forceinline unsigned __int64 CalcPerf(unsigned __int64 __start,
unsigned __int64 __stop, __int64 __overhead)
{
return __stop - __start - __overhead;
}
Both StartPerfCounter and StopPerfCounter are inline assembly macros. There are two reasons why I did this.
- Stack frame is used when a non-inline function gets called. We're not going to get any precise value with stack frame because the stack frame routine has inconsistent performance.
- What about inline function?! Well, we can't rely on that either because that's our compiler's decision to implement whether our function gets inlined or not.
If you look at the first line on StartPerfCounter and the last line on StopPerfCounter, we have pusha and popa instructions.
They are there to prevent bugs in the optimized application. Let's move on to xor and cpuid on StartPerfCounter.
__asm xor eax, eax
__asm cpuid
Both the xor and cpuid instructions inside StartPerfCounter are there to flush the pipeline and prevent out-of-order instructions.
If you don't get it, they just help rdtsc to get the precise value.
Now you may wonder why I didn't add another xor and cpuid instruction before the second rdtsc instruction. There is a nasty bug if another
cpuid instruction is added before the second rdtsc, I found this bug on old Pentiums (Pentium 3 and Pentium 4). This bug produces
an inconsistent overhead. Removing the second cpuid fixes this bug.
#define StopPerfCounter(__stop) \
__asm xor eax, eax \
__asm cpuid \
__asm rdtsc \
__asm mov dword ptr [__stop + 0], eax \
__asm mov dword ptr [__stop + 4], edx \
__asm popa
If you want to use this performance counter method in your application, you should test this buggy cpuid instruction. If it's working fine on your machine, then use it
if you want to. Just remember, do not make any assumption that it will work on other machines.
Using the code
Here is how to use it:
unsigned __int64 start = 0, stop = 0, overhead = 0, perf = 0;
overhead = CalcOverhead(start,stop);
The first line is declaring 4 64-bit variables, don't forget to initialize them. You're going to get warnings if you do. Apparently VC++ isn't concerned if a variable gets
a return value within an inline assembly code.
To get the performance of your function is pretty straightforward. You just need to call your function inside StartPerfCounter and StopPerfCounter.
StartPerfCounter(start);
StopPerfCounter(stop);
perf = CalcPerf(start,stop,overhead);
printf("No operation costs %I64u cpu cycle(s) \n",perf);
That's it! It can't be easier than that. By the way, if you want to measure your functions in a dynamic way then you may want to read articles about
function pointers, functors, and callbacks.
Tests to verify the method validity
I used rdtsc to test this method. The first test has nothing. The second test is one rdtsc instruction, the third test
is two rdtsc instructions, and so on.
unsigned __int64 start = 0, stop = 0, overhead = 0, perf = 0;
overhead = CalcOverhead(start,stop);
StartPerfCounter(start);
StopPerfCounter(stop);
perf = CalcPerf(start,stop,overhead);
printf("No operation costs %I64u cpu cycle(s) \n",perf);
StartPerfCounter(start);
__asm rdtsc
StopPerfCounter(stop)
perf = CalcPerf(start,stop,overhead)printf("1 rdtsc instruction costs %I64u cpu cycle(s) \n",perf)
StartPerfCounter(start)__asm rdtsc
__asm rdtsc
StopPerfCounter(stop)
perf = CalcPerf(start,stop,overhead)printf("2 rdtsc instructions cost %I64u cpu cycle(s) \n",perf)
StartPerfCounter(start)__asm rdtsc
__asm rdtsc
__asm rdtsc
StopPerfCounter(stop)
perf = CalcPerf(start,stop,overhead)printf("3 rdtsc instructions cost %I64u cpu cycle(s) \n",perf)
On my machine, the test results are good. I have a consistent 80 CPU cycles increment for each of my tests.


Note that modern processors execute most of their individual instructions less than 1 cycle. Your performance results depend on your pipeline architecture,
data dependency, and processor architecture.
The most awesome thing we can measure is the infamous Hello World function:

I won't deny CodeProject has been a great resource for me. It's always there when I need it. I'm gonna be honest. It took me years before I registered.
And it also took me 2 years before I decided to write an article (hmm.. was it article(s) with s?). And finally I'm here (with no more lazy attitude) and able
to complete it before it is another 2 years.
History
- 11/26/2011
- Fixed data type.
- Added more details.
- Added a warning for bug on old Pentiums.
- Changed inline to
__forceinline to support VC++.
- Uploaded code and demo with bug fixes.
- 11/25/2011
- Added
pusha and popa instructions to prevent bugs in the optimized application.
- Uploaded code and demo with bug fixes.