Benchmark Lessons Learned

Having spent far too much time benchmarking different code recently, here are some lessons learned! In no particular order:
- Make sure all power saving systems on your machine are turned off: The graph shows a benchmark running in a single process. The process is running the same calculation in JVM COBOL over and over again. Although the process was occupying a single core of the machine at near 100%, this only represented at 50% load on the CPU as a whole. The OS was not clever enough to figure this out, and so it kept scaling back the CPU clock speed. When some spurious IO operation or other activity caused the CPU usage to drop further, the scaling cut in even more. This trashes the benchmark results, especially when comparing dissimilar software technologies with different CPU load characteristics.
- Make sure nothing else is running on your machine: OK, that is pretty much impossible on a modern machine. However, I shudder to think how many times a benchmark has gone horribly wrong for me because Outlook chose 'that very moment' to update all its folders. A nasty one to check for is your web browser. Flash animations and AJAX pages can randomly consume lots of CPU or disk and you'd never know it. Check using 'top' or process explorer (*nix or Windows).
- Test real scenarios: It is very easy to assume a simple loop will be enough to measure the performance of a language or technology. It is not! Most modern compilers have optimizers which specifically target loops. Simple loops get optimized massively or optimized away altogether. However, the more complex loops which occur in real life do not get optimized away. So, a simple loop can produce results which are in no way similar to real world performance.
Here is a COBOL program with a simple loop:
123456$set sourceformat(variable)
01 my-group.
03 counter pic s9(9) comp-5.
03 a pic s9(9) comp-5.
03 b pic s9(9) comp-5.
03 r pic s9(9) comp-5.
move 123456789 to a b r
perform varying counter from 1 by 1
until counter = 1000000
compute r = (a + b) / (a - b)
compute r = (r + b) / (a - b)
compute r = (r + b) / (a - b)
compute r = (r + b) / (a - b)
compute r = (r + b) / (a - b)
end-perform
.
We can see that a
and b
are invariant and that r
is the loop constant (it is the same after 1 or 1000 iterations). Further, the counter has a fixed value at the end of the loop which can be deduced at compile time. In my post on getting to understand JVM performance, I showed a benchmarking technique using JavaScript. Here are the results for 64 bit Windows with 64 bit COBOL (with the opt option) and a 64 bit Sun/Oracle JVM:
Launch Overhead:
Results:
=========
JVM
Maximum Time: 250
Minimum Time: 0
Mean Time: 7.875
Total Time: 252
Native
Maximum Time: 112
Minimum Time: 48
Mean Time: 53.28125
Total Time: 1705
Simple Loop:
Results:
=========
JVM
Maximum Time: 283
Minimum Time: 78
Mean Time: 86.59375
Total Time: 2771
Native
Maximum Time: 58
Minimum Time: 49
Mean Time: 50.4375
Total Time: 1614
Launch overhead being the performance when COBOL does nothing at all (a program which says 'goback.'). It is clear from the results that for native Micro Focus, the difference between doing nothing at all and performing the loop is less than the error in the benchmarking technique. The variation is probably due to other processes on the machine using CPU and memory. This is an impressive bit of optimization from the Micro Focus native code generator.
JVM COBOL does not do as well. It has trivial launch overhead, so in this test, it is tens if not hundreds of times slower than the native COBOL. Is this result realistic of real world programs? We can show that it is not, because constant value loops do not occur very much in production code (why have a loop which does nothing?). The JVM COBOL does not optimize away this loop.
What happens if the loop is less predictable? Will the difference in therelative performance of JVM and native change?
Now we can look at a slightly more complex program:
123456$set sourceformat(variable)
01 my-group.
03 counter pic s9(9) comp-5.
03 a pic s9(9) comp-5.
03 b pic s9(9) comp-5.
03 r pic s9(9) comp-5.
01 resuts.
03 rr pic x(40).
03 rc pic x(40).
move 123456788 to b
move 100 to r
perform varying counter from 1 by 1
until counter = 1000000
move 123456789 to a
compute r = counter / 100
compute r = counter - r * 100
compute a = a + r
compute r = (a + b) / (a - b)
compute r = (r + b) / (a - b)
compute r = (r + b) / (a - b)
compute r = (r + b) / (a - b)
compute r = (r + b) / (a - b)
if r = 0
exit perform
end-if
end-perform
move r to rr
move counter to rc
.
In this program, a
and r
are no longer loop constants. Also, the loop can exit before counter = 1000000. This means that the optimizers can no longer optimize away the loop. This sort of complex branching logic is much more typical of the way business logic runs in real programs.
Here is the result:
Results:
=========
JVM
Maximum Time: 449
Minimum Time: 189
Mean Time: 202.9375
Total Time: 6494
Native
Maximum Time: 167
Minimum Time: 145
Mean Time: 148.8125
Total Time: 4762
Accounting for a launch overhead of 1.7 seconds, this benchmark shows native COBOL running only some 2.0 times faster than JVM COBOL.
From this, it is abundantly clear that simple loops are in no way appropriate for benchmarking!
- Pick representative hardware: Do not benchmark a database application on a machine with one SATA drive and expect that to resemble running on a machine with a serial SCSI RAID! Even the performance using the x86 and x64_86 instruction set on the same machine can be radically different.
Below are the results for the benchmark I discussed above, but this time using 32 bit x86 code on the 64 bit machine:
Results:
=========
JVM
Maximum Time: 467
Minimum Time: 204
Mean Time: 219.125
Total Time: 7012
Native
Maximum Time: 247
Minimum Time: 216
Mean Time: 224.15625
Total Time: 7173
This time the launch overhead is lower as well (1.4 seconds for native and 0.2 seconds for JVM). So we have 5.7 seconds for native and 6.8 for JVM COBOL, meaning on 32 bit the native COBOL is running only 1.2 times quicker than JVM, which is substantially different from the 64 bit results on the same exact machine.
- Know your enemy, spend time with it.
For discussion on this topic and others, please visit my personal site!