Solution 1 missed one very important aspect: you need to exclude JIT-compilation from measurement. Please see:
http://en.wikipedia.org/wiki/Just-in-time_compilation.
This is easy to do: you need to start time measuring when all the methods involved in the call after you start timing have been already called before at least once. For example, you can call everything to be timed at least once, and only then start your measurements.
Failure to do so is a very common mistake.
Another advice: collect enough statistics. The observed statistical dispersion of timing result can be surprisingly high, as timing depends on a lot of random factors.
—SA