At the core of a virtual machine is a mechanism which resolves the current opcode and executes the appropriate operation. While there are numerous methods to accomplish this task, typical tutorials on the subject elect to use a switch-on-opcode implementation. For some, this choice is made simply to limit the complexity of the example. Still others, however, argue the switch-on-opcode method is preferable for the sake of performance.
The intent of this article is to:
- enumerate typical methods of resolving opcodes
- implement each in a manner consistent with usage
- measure and compare the performance of each
In so doing, this article demonstrates:
- the switch-on-opcode method is not as desirable as it would first appear
- the performance impact of a virtual function call is negligible (at least in this context)
- write first, measure and optimize later, is a good strategy
I recently had the opportunity to design and implement a virtual machine for a project upon which I was working. Not wanting to begin a design plagued with performance issues, a list of possible opcode resolution methods was made and an implementation of each tested. The four possible methods where identified as:
- Method 0: switch-on-opcode with processing in place
- Method 1: switch-on-opcode with processing deferred to function
- Method 2: array of function pointers (the opcode being an index into the array)
- Method 3: array of pointers to functional objects (again opcode being an index)
The second method (method 1) is simply a minor variation of the first. It is an attempt to better organize the code by having each opcode implemented in its own function. One expects this to always be slower than the first method.
The switch-on-opcode methods where rejected immediately from consideration. Not on the grounds of performance, but for the sake of extensibility (and clarity). It was important for the project that the VM be easily extendable (opcodes altered and added by the application programmer). However, for testing purposes, the switch-on-opcode methods were included for the sake of completeness and to provide a baseline for comparisons.
Prior to measurement, the order of the methods in the list reflects a naïve estimate of speed (potentially fastest to slowest).
The Test Code
The code included with this article tests each of the four methods previously listed, and dumps the results to the screen and a file. The test is divided into a series of passes. Each pass creates a program (a list of random opcodes) larger than the previous, and runs the program multiple times for each method. In an attempt to be fair, the order of the machines tested each “run” is randomized, and the minimum and maximum times thrown out before the average execution time is calculated.
Timing measurements are made using the RDTSC Pentium instruction. Since the time in seconds is not necessary, all times are kept in terms of "ticks". Execution on a Pentium machine is the only requirement of the code.
For those who wish to play with the included code, first refer to the file: Parameters.h. This file defines a number of parameters used by the test. Of the various parameters,
COMPLEX_OPERATION are the most interesting. The first,
VALID_OPCODE_COUNT, defines the size of the instruction set used by the virtual machines.
COMPLEX_OPERATION, if defined, forces the use of a non-trivial operation per opcode. Otherwise, a simple arithmetic operation is used.
Each method is implemented as a "machine" and is located in its own file. For the remainder of this article, the terms method and machine are synonymous.
The test program was executed a number of times and results plotted. The two charts above represent typical results. Given opcodes with low computational overhead (i.e., simple operations), the method of opcode resolution had a definite impact on performance. Note, for simple opcodes, the method which employs virtual functions (method 3) is noticeably faster than the switch-on-opcode methods (0 and 1).
As the overhead of the operations increase (i.e., complex operations), the impact on execution time due to opcode resolution techniques becomes less significant.
The test code was compiled and run on a 2.4GHz Pentium machine using VC++ .NET and GCC 3.3.3. A count of 600000 is equal to 250uS.
The first conclusion is one which is stated often: It is more constructive to spend design time considering issues of extendibility and maintainability over issues of performance. Until actual measurements are made, true performance problems cannot be identified and corrected.
The second conclusion: Make sure you are testing what you need to. If one were to test VM implementations using an unrealistically small instruction set and empty “test” opcodes, the switch-on-opcode method might perform better than the others simply because the dependence upon opcode determination is artificially high.
The final conclusion: Make sure you are testing what you think you are. Empty “test” opcodes or assignments of unused “test” variables may be optimized away by your compiler. You might be proceeding with a design which is not as fast as you think.