The everyday development that I do targets desktop processors. The processors are more than capable of meeting the computational demands and performances requirements so it's rare that I ever have to think about how long it takes to execute a specific mathematical operation. But while that's the general case, it isn't always the case.
When I was finishing grad school, one of my projects dealt with machine vision and automatic classification of photographs. I had an understanding of the algorithms I needed to implement and how everything was going to fit together. During development of the individual components, I used small datasets that were sufficient for letting me know that the components were working as designed. It wasn't until I had brought all of the components together that I gave the system full sized images to process (around 2 megapixel). I knew that a full sized image would take longer to process, but processing of the image took around three hours! In researching the cause of the slowness, I found that some of the functions I was trying to execute were not natively supported by the machine's processor and were being emulated. Some of the math operations consumed 50 to 100 times more time than the native operations. There was a time constraint on the execution time of this program since I had to demonstrate its execution in class during a presentation. To stay within those constraints, the more expensive mathematical operations were replaced with lookup tables and the program was changed from just being multithreaded to taking full advantage of multiprocessing.
In the past few weeks, I was reminded of the experience after getting a couple of e-mails from other developers that were trying to figure out why their programs were having such poor performance. Both developers were creating programs that performed graphical processing and both were targeting Windows Mobile devices (which use ARMS processors). ARMS processors are available with a broad range of performance characteristics. On the lower end, the processors only support integer operations, have no divide instruction, and typically run around 200 MHz. On the upper end, the processors may have hardware implementation for floating point operations (including a divide instruction), built in 3D graphic accelerators, and run at up to 1 GHz. Both of the developers were testing their programs on devices that had no native floating point support and no divide instruction. So these operations were being emulated. That information alone was enough to answer their questions. But I decided to do a few measurements.
I dug up every ARMs based device running a Microsoft OS that I could find. In my possession, I have several Windows Mobile devices from PocketPC 2002 devices to a newly released device that will soon have Windows Mobile 6.5. The Microsoft Zune is also ARMS based. So I included it in my test. I've also got remote access to a few newly released devices. For the test, I had each device perform a million additions, subtractions, multiplications, and divisions for both integer numbers and double-precision floating point numbers. For the Windows Mobile devices, I did this using native (C-language) programs since the .NET Compact framework does not support floating point operations. For the Zunes I used .NET through the XNA framework. The Zune version of the .NET framework supports floating point operations. Because of the broad range of clock frequencies that these devices used for each device, I used the time taken to perform a million additions as a base measurement. My findings were fairly consistent. In general the devices that had no floating point support took about 30-times longer to perform a division on a double-precision number than the integer addition. Devices that had floating point support took about 2-times longer to perform a double-precision division than addition.
The algorithms that both developers were implementing made heavy use of floating point operations. There had been some expectations made about the performance of the programs when run on a mobile device that had been transferred from experience with desktops. Desktop processors have a more complete set of hardware implemented math operations than mobile processors. Both developers also had the same device which had no floating point support. So it comes as no surprise why the algorithms were running so slow. So what could they do about it? There's no universally satisfying substitute for a capable hardware so the exact solution is going to depend on what one is trying to accomplish. For one developer, an acceptable solution was to use a different algorithm that produced acceptable results and had a much lower computational demand. For the other developer, compromises on the result were not as acceptable so the hardware requirements for his software were revised so that they identified the needed hardware.
The experience from school and the experience of the two developers underscore the importance of making sure that solution's hardware and implementation are in sync with each other.