Benchmarking Lessons Learned the Hard Way

alex turner

5.00/5 (2 votes)

29 Mar 2011CC (ASA 2.5)4 min read

14.4K

Having spent far too much time benchmarking different code recently, here are some lessons learned!

Benchmark Lessons Learned

Having spent far too much time benchmarking different code recently, here are some lessons learned! In no particular order:

Make sure all power saving systems on your machine are turned off: The graph shows a benchmark running in a single process. The process is running the same calculation in JVM COBOL over and over again. Although the process was occupying a single core of the machine at near 100%, this only represented at 50% load on the CPU as a whole. The OS was not clever enough to figure this out, and so it kept scaling back the CPU clock speed. When some spurious IO operation or other activity caused the CPU usage to drop further, the scaling cut in even more. This trashes the benchmark results, especially when comparing dissimilar software technologies with different CPU load characteristics.
Make sure nothing else is running on your machine: OK, that is pretty much impossible on a modern machine. However, I shudder to think how many times a benchmark has gone horribly wrong for me because Outlook chose 'that very moment' to update all its folders. A nasty one to check for is your web browser. Flash animations and AJAX pages can randomly consume lots of CPU or disk and you'd never know it. Check using 'top' or process explorer (*nix or Windows).
Test real scenarios: It is very easy to assume a simple loop will be enough to measure the performance of a language or technology. It is not! Most modern compilers have optimizers which specifically target loops. Simple loops get optimized massively or optimized away altogether. However, the more complex loops which occur in real life do not get optimized away. So, a simple loop can produce results which are in no way similar to real world performance.

Here is a COBOL program with a simple loop:

COBOL

123456$set sourceformat(variable)

   01 my-group.
       03 counter pic s9(9) comp-5.
       03 a       pic s9(9) comp-5.
       03 b       pic s9(9) comp-5.
       03 r       pic s9(9) comp-5.

   move 123456789 to a b r
   perform varying counter from 1 by 1 
               until counter = 1000000
        compute r = (a + b) / (a - b)
        compute r = (r + b) / (a - b)
        compute r = (r + b) / (a - b)
        compute r = (r + b) / (a - b)
        compute r = (r + b) / (a - b)
   end-perform
   .

We can see that a and b are invariant and that r is the loop constant (it is the same after 1 or 1000 iterations). Further, the counter has a fixed value at the end of the loop which can be deduced at compile time. In my post on getting to understand JVM performance, I showed a benchmarking technique using JavaScript. Here are the results for 64 bit Windows with 64 bit COBOL (with the opt option) and a 64 bit Sun/Oracle JVM:

Launch Overhead:

Results:
=========
JVM
    Maximum Time: 250
    Minimum Time: 0
    Mean    Time: 7.875
    Total   Time: 252
Native
    Maximum Time: 112
    Minimum Time: 48
    Mean    Time: 53.28125
    Total   Time: 1705
 
Simple Loop:

Results:
=========
JVM
    Maximum Time: 283
    Minimum Time: 78
    Mean    Time: 86.59375
    Total   Time: 2771
Native
    Maximum Time: 58
    Minimum Time: 49
    Mean    Time: 50.4375
    Total   Time: 1614

Launch overhead being the performance when COBOL does nothing at all (a program which says 'goback.'). It is clear from the results that for native Micro Focus, the difference between doing nothing at all and performing the loop is less than the error in the benchmarking technique. The variation is probably due to other processes on the machine using CPU and memory. This is an impressive bit of optimization from the Micro Focus native code generator.

JVM COBOL does not do as well. It has trivial launch overhead, so in this test, it is tens if not hundreds of times slower than the native COBOL. Is this result realistic of real world programs? We can show that it is not, because constant value loops do not occur very much in production code (why have a loop which does nothing?). The JVM COBOL does not optimize away this loop.

What happens if the loop is less predictable? Will the difference in therelative performance of JVM and native change?

Now we can look at a slightly more complex program:

123456$set sourceformat(variable)
 
       01 my-group.
           03 counter pic s9(9) comp-5.
           03 a       pic s9(9) comp-5.
           03 b       pic s9(9) comp-5.
           03 r       pic s9(9) comp-5.
       01 resuts.
           03 rr      pic x(40).
           03 rc      pic x(40).
 
       move 123456788 to b
       move 100       to r
       perform varying counter from 1 by 1 
                until counter = 1000000
            move 123456789 to a
            compute r = counter     / 100
            compute r = counter - r * 100
            compute a = a + r
            compute r = (a + b) / (a - b)
            compute r = (r + b) / (a - b)
            compute r = (r + b) / (a - b)
            compute r = (r + b) / (a - b)
            compute r = (r + b) / (a - b)
            if r = 0
                exit perform            
            end-if
       end-perform
       move r       to rr
       move counter to rc
       .

In this program, a and r are no longer loop constants. Also, the loop can exit before counter = 1000000. This means that the optimizers can no longer optimize away the loop. This sort of complex branching logic is much more typical of the way business logic runs in real programs.

Here is the result:

Results:
=========
JVM
    Maximum Time: 449
    Minimum Time: 189
    Mean    Time: 202.9375
    Total   Time: 6494
Native
    Maximum Time: 167
    Minimum Time: 145
    Mean    Time: 148.8125
    Total   Time: 4762

Accounting for a launch overhead of 1.7 seconds, this benchmark shows native COBOL running only some 2.0 times faster than JVM COBOL.

From this, it is abundantly clear that simple loops are in no way appropriate for benchmarking!

Pick representative hardware: Do not benchmark a database application on a machine with one SATA drive and expect that to resemble running on a machine with a serial SCSI RAID! Even the performance using the x86 and x64_86 instruction set on the same machine can be radically different.

Below are the results for the benchmark I discussed above, but this time using 32 bit x86 code on the 64 bit machine:

Results:
=========
JVM
    Maximum Time: 467
    Minimum Time: 204
    Mean    Time: 219.125
    Total   Time: 7012
Native
    Maximum Time: 247
    Minimum Time: 216
    Mean    Time: 224.15625
    Total   Time: 7173

This time the launch overhead is lower as well (1.4 seconds for native and 0.2 seconds for JVM). So we have 5.7 seconds for native and 6.8 for JVM COBOL, meaning on 32 bit the native COBOL is running only 1.2 times quicker than JVM, which is substantially different from the 64 bit results on the same exact machine.

Know your enemy, spend time with it.

For discussion on this topic and others, please visit my personal site!

License

This article, along with any associated source code and files, is licensed under The Creative Commons Attribution-ShareAlike 2.5 License