Introduction
Let us study the problem of computing R cos(a + k b) and R sin(a + k b) for an increasing integer k. It arises when doing successive rotations (with scaling), or
generating points around a circle. A little challenge is to find efficient ways to compute these numbers, by reducing the amount of calls to the
transcendent functions, which are known to be costly even with a floatingpoint accelerator.
Background
[1] William H., Saul A. Teukolsky, William T. Vetterling & Brian P. Flannery, Numerical Recipes in C: The Art of Scientific Computing, CAMBRIDGE UNIVERSITY PRESS, Second Edition,
pp.178179.
[2] Goertzel, G. (January 1958), An Algorithm for the Evaluation of Finite Trigonometric Series, American Mathematical Monthly 65 (1): 34–35.
The Solution
The straightforward approach requires 1 addition, 3 multiplies, 3 assignments, and 2 intrinsic function calls per angle, using a temporary, T. Coded in C, the body of the loop writes:
T= A + K * B;
X= R * cos(T);
Y= R * sin(T);
Computing the angle by successive additions of b
to a
is not recommended because of the accumulation of truncation errors.
The Rotation approach
It is well known from algebra that trigonometric functions enjoy many
properties, including addition formulas, which we plan to exploit to derive incremental procedures:
cos(a + b) = cos(a) cos(b) – sin(a) sin(b), and
sin(a + b) = sin(a) cos(b) + cos(a) sin(b)
For the sake of conciseness, we will use complex number representation with the notation cis(a)
that represents the number cos(a) + i sin(a), also known to be e^{ia} from Euler’s formula. The addition formulas just express the wellknown rule of exponent addition:
cis(a + b) = e^{i}^{(}^{a+b}^{)} = e^{ia} e^{ib} = cis(a) cis(b)
Simple, right? (You can check that by developing the complex product.)
Back to our problem, we know that we have to compute the complex numbers:
P_{k} = R cis(a + k b)
For which we can easily derive this recurrence:
P_{k+1} = R cis(a + (k+1) b) = R cis((a + k b) + b) = R cis(a + k b) cis(b) = P_{k} cis(b), with
P_{0} = R cis(a)
Programmatically, it translates to the code below, taking care not to overwrite the variables too early. This costs 2 adds, 4 muls, and 3 assigns, and we got rid of the function calls:
T= Const0 * X  Const1 * Y;
Y= Const0 * Y + Const1 * X;
X= T;
with the onetime initialization:
X= R * cos(A);
Y= R * sin(A);
Const0= cos(B);
Const1= sin(B);
The Chord approach
The above recurrence has a little flaw in that truncation errors will accumulate over time. The biggest cause of truncation is caused by
the fact that for small b, cos(b) is close to 1 (by McLaurin, cos(b)=1b^{2}/2+b^{4}/24…) and many significant digits of b get lost; sin(b)
also loses significant digits, but the corresponding absolute error is an order of magnitude smaller (sin(b)=b (1b^{2}/6+b^{4}/120…)).
We can fix that by working with the difference between successive P values instead of the P values themselves
^{[}^{1]}. Let us introduce the chords, given by:
Q_{k} = P_{k+1} – P_{k}
We have:
Q_{k} = P_{k} cis(b)  P_{k} = (cis(b)  1) P_{k}, and
P_{k+1} = P_{k} + Q_{k}
The magic in this formula lies in the factor (cis(b)1) = cos(b)1 + i sin(b), because cos(b)1 can be expressed as 2 sin^{2}(b/2) by the
doubleangle formula, now two orders of magnitude smaller. This increases the cost to 4 adds, 4 muls, 3 assigns:
T= X + (Const2 * X  Const1 * Y);
Y= Y + (Const2 * Y + Const1 * X);
X= T;
with
Const1= sin(B);
Const2=  2 * sin(0.5 * B) * sin(0.5 * B);
X= R * cos(A);
Y= R * sin(A);
By the way, history has given 1cos(b) a name: the versed sine.
The Goertzel approach
Goertzel^{[}^{2]} found a clever way to reduce the operation count involved, by developing a second order recurrence. Now, instead of relating
P_{k+1} to just P_{k}, we relate it to P_{k1} as well. From geometric considerations,
we can observe that:
(P_{k+1} + P_{k1}) / 2 = cos(b) P_{k}, or
P_{k+1 }= 2 cos(b) P_{k}  P_{k1}
What is gained here is that we no more multiply P_{k} by a complex number but by a real one. Hence, 2 adds, 2 muls, 6
assigns, using the two extra variables XX
and YY
to keep track of P_{k1}:
T= Const3 * X  XX; XX= X; X= T;
T= Const3 * Y  YY; YY= Y; Y= T;
with corresponding initializations:
Const1= sin(B);
Const2=  2 * sin(0.5 * B) * sin(0.5 * B);
Const3= 2 * cos(B);
X= R * cos(A);
Y= R * sin(A);
XX= X + (Const2 * X + Const1 * Y);
YY= Y + (Const2 * Y  Const1 * X);
The ChordGoertzel approach
As is turns out, unfortunately, Goertzel’s method suffers from numerical instability. And it does it in a stronger way
because of large value cancellation in the subtraction of its two terms. Luckily, we can fix that by using the chord trick once more. We derive:
Q_{k}_{ }= 2 cos(b) P_{k}  P_{k1} P_{k}_{ }= 2 (cos(b)1), and
P_{k}  P_{k1}+ P_{k} = 2 (cos(b)1)
P_{k} + Q_{k1}
Now we have recovered the accurate cos(b)1 factor and removed the cancellation. This results in 4 adds, 2 muls, and 4 assigns, introducing the components of Q_{k1},
DX and DY:
DX= Const4 * X + DX; X+= DX;
DY= Const4 * Y + DY; Y+= DY;
and the initializations:
Const1= sin(B);
Const2=  2 * sin(0.5 * B) * sin(0.5 * B);
Const4= 2 * Const2;
X= R * cos(A);
Y= R * sin(A);
DX=  (Const2 * X + Const1 * Y);
DY=  (Const2 * Y  Const1 * X);
Speed comparison
We ran the above five implementations for one million iterations and timed them. Except for the straight version, running times are
independent of the argument values. The fastest one, Goertzel, only takes 5 ns per iteration on a 2.2 GHz machine, which corresponds to 11 cycles only (!)
Accuracy comparison
We used single precision floatingpoint variables to increase the accuracy degradation. The values in the table correspond to 1000
iterations starting from a=2 with step b=0.001. This is a typical behavior.
Conclusions
Grouping our results we have:
 +
 x
 =
 cos/sin
 Cycles
 Error

Straight
 1
 3
 3
 2
 167
 0.000000

Rotation
 2
 4
 3
 
 12
 0.000024

Chord
 4
 4
 3
 
 15
 0.000000

Goertzel
 2
 2
 6
 
 11
 0.021021

ChordGoertzel
 4
 2
 4
 
 13
 0.000001

Points of Interest
 Do use an incremental algorithm, it is at least ten times faster than the trivial implementation.
 Among the incremental algorithms, the running speeds do not vary so much.
 Chord and ChordGoertzel are both stable and fast.
 On a PC, single and double precision versions run at the same speed and are blazingly fast!
History
This is the second released version after formatting adjustments and use of the cis notation.