Fast Integer Square Root For 8051
Fast integer square root computation in 8051 assembly
Introduction
Some microcontroller (MCU) appications need to compute the integer square root (sqrt
) function, quickly (i.e. avoiding division), and using a small number of instructions. This tip shows the implementation of 'Fast Integer Square Root' algorithm, described by Ross M. Fossler in Microchip's application note TB040.
While such application note reports the source code for the Microchip P18C252
MCU, here, the implementation for the 8051
MCU is shown.
Note, just a sketch of the algorithm is presented here, since the exact specification is given by the (provided) C source code. Anyway, I warmly encourage you reading the original Fossler paper.
The Goal
The goal is finding the integer approximation of the square root of a 16-bit
unsigned number, expressed as an unsigned 8-bit
(byte
) number.
Namely, the 16-bit number will be the argument of the assembly routine, the approximation of the number square root will be the routine return value.
The Algorithm
The algorithm is very simple: Iteratively, starting from most significant position (bit 7
of the byte), an 'additional' bit is added to a guess
number, then the squared guess
is compared with the routine argument (say arg
):
- If the
squared guess
is less thanarg
, then we keep the added bit in theguess
, shift the additional bit right and goto next iteration. - On the other hand, if the
squared guess
is greater thanarg
, then we remove the added bit from theguess
, shift the additional bit right and goto the next iteration.
The process terminates either when the squared guess
is equal to the arg
or there is no more additional bit (it was shifted out of its byte).
The Code
The algorithm can be easily formulated using C
code.
uint8_t fast_sqrt_c(uint16_t n)
{
uint8_t g = 0x80; // initial guess
uint8_t b = 0x80; // 'additional' bit is in position 7
for (;;)
{
uint16_t g2 = g*g;
if ( g2 == n )
break; // an exact match is found
if ( g2 > n )
g ^= b; // here g*g > n, remove the added bit
b >>= 1; // shift right the 'addional' bit
if ( b == 0 )
break; // the 'additional' bit was shifted out, the iteration is complete
g |= b; // add the current 'additional bit'
}
return g;
}
The C
is useful because it
- represents a precise specification for the algorithm.
- can be compared with the
sqrt
function of the standardC
library, in order to fully establish the correctness of the algorithm. - can be compiled with a 'suitable' compiler in order to generate
8051
assemebly. Then, the produced assembly can be compared with the hand crafted algorithm implementation (for correctness, speed and MCU usage).
Step 2 was actually performed using GCC
on a Linux
box (the source fast_sqrt_c_test.c code is provided).
Step 3 was actually performed using the SDCC
compiler.
The resulting code (provided as fast_sqrt_c_sdcc.asm is a bit cluttered, so here is reported a 'rearranged' listing, somehow cleaned up:
; Implementation of Ross M. Fossler 'Fast Integer Square Root' algorithm
; (Microchip's application note TB040)
;------------------------------------------------------------
;Allocation info for local variables in function 'fast_sqrt_c'
;------------------------------------------------------------
;n Allocated to registers r6-r7
;g Allocated to registers r5
;b Allocated to registers r4
;g2 Allocated to registers r2-r3
;------------------------------------------------------------
; function fast_sqrt_c
; -----------------------------------------
_fast_sqrt_c:
mov r6,dpl ; function argument low byte
mov r7,dph ; function argument high byte
mov r5,#0x80 ; g = 0x80
mov r4,#0x80 ; b = 0x80
C_Loop:
mov a,r5
mov b,a
mul ab ; a-b = g*g
mov r2,a ; here start the comparison of g2 and n ( g2 == n)
mov r3,b ; g2 = g*g
cjne a,ar6,C_NotEqual
mov a,r3
cjne a,ar7,C_NotEqual
sjmp C_TheEnd ; break if g2 matches n
C_NotEqual:
clr c ; here start the check to establish if g2 is bigger or smaller than n
mov a,r6
subb a,r2
mov a,r7
subb a,r3
jnc C_NotGreater
mov a,r4
xrl ar5,a ; here (g2 > n), emove the 'added' bit
C_NotGreater:
mov a,r4
clr c
rrc a ; shift the additional bit right
mov r4,a
jz C_TheEnd ; no more 'additional' bit to add, exit
mov a,r4
orl ar5,a ; here the 'additional' bit is actually added
sjmp C_Loop
C_TheEnd:
mov dpl,r5
ret ; return g;
;------------------------------------------------------------
SDCC
implementation is, in my opinion, pretty good. Anyway, we can do better with hand-crafted assembly (source provided as fast_sqrt_hand_crafted.asm):
; Hand crafted Implementation of Ross M. Fossler 'Fast Integer Square Root' algorithm
; (Microchip's application note TB040)
; function fast_sqrt_hc
; r6-r7 is n, the function argument
; r5 is g, the guess
; r4 is b, the 'additional' bit
; r2-r3 is g2 = g*g
;
_fast_sqrt_hc:
mov r6, dpl
mov r7, dph
mov r4, #0x80
mov r5, #0x80 ; initialization: function argument n = r6-r7, g = 0x80, b = 0x80
mov a,r5
HC_Loop:
mov b,a
mul ab ; a-b is g2
clr c
subb a,r6
xch a,b
subb a,r7
jc HC_Less ; this means g2 < n
orl a,b
jz HC_TheEnd ; jump to end if (g2 == n)
; here g2 > n, remove the added bit
mov a, r5
xrl a, r4 ; the XOR operration removes the bit
mov r5, a
HC_Less:
clr c
mov a,r4
rrc a ; here the shift of the 'additional' bit is performed
jz HC_TheEnd ; if a is 0 then there is no more 'additional' bit in the byte
mov r4,a
orl a, r5 ; here the 'additional bit' is added
mov r5,a
sjmp HC_Loop
HC_TheEnd:
mov dpl, r5
ret
Some Numbers
The SDCC
produced and the hand crafted one were assembled and compared using the EdSim51DI
simulator.
The results are shown in the following table:
Routine | Code size | Average number of MCU cycles |
SDCC produced | 48 bytes | 170.9 |
Hand crafted | 39 bytes | 140.4 |
Useful 8051 Resources
History
- 27th December, 2020: First release