Two things: firstly, if you want to generate a true, absolute value frequency square wave in code, they don't do it in C: the compiler does not know what you are trying to achieve, and throws in code which will generally sabotage the effort. In addition, any future update to the compiler may produce different code, which could drastically change the output waveform. Use assembler instead.
Secondly, I've not used a C8051f005 so I don't know how it's clock speed relates to execution time, but pretty much unless the process executes all instructions in the same number of clock pulses, and that number is pretty small, you are unlikely to get a 2MHz clock out of a 16Mhz processor - even if it executes instructions in 1 code period, you still only have eight instructions to generate a good square wave - you might get away with a skewed wave, but even then it could be tight.
The frequency slowdown with a faster processor is unlikely unless there are problems: it could be a scope trace artefact (try changing the timebase and see if what you are measuring is actually a harmonic - digital scopes are prone to this) or it could be that you are overclocking the processor and it just can't cope - without knowing a lot more about your hardware I can't tell.
A quick look at the C8051f005 specs (
http://www.silabs.com/Support%20Documents/TechnicalDocs/C8051F005-Short.pdf[
^]) shows that you have timer drives to your output ports - use one of them instead! They will generate a clean, square waveform with much better accuracy than you will in code, and require no processing overhead to achieve!