## Introduction

Discret cosine’s transformation (DCT) is a well known transformation. It has very wide implementation and is used for compression, processing, e.g., due to real character of coefficients. DCT is used more often than Fourier transform. But for DCT, only one fast algorithm is well known - for 2D processing. For 1D, there is the Byeong Lee’s DCT algorithm. But its algorithm is recursive. Using it is useless for some type of equipments (e.g., GPGPU). For this target, I have created an iterative algorithm, without recursive calling and almost unlimited input data (only power of 2).

## Background

For base was chosen Byeong Lee’s DCT, that is shown in the figure. The algorithm is based on Butterfly distribution. This algorithm is used wide for hardware implementation.

## Using the Code

The developed code is based on two Paths: input with Input Butterfly and output with Output Butterfly.

Massive for keeping data between iterations is one, and allocated from heap by demand. Other massive is used for storing computed cosine’s coefficient. Every iteration divides the Front buffer into two parts: Even and Odd. After copying them into Back, the Front and Back buffers are exchanged.

double *Front, *Back, *temp;
double *Array = new double[N<<1];
double *Curent_Cos = new double[N>>1];
Front = Array;
step = 0;
Back = Front + N;
for(i = 0; i < N; i++)
{
Front[i] = block[i];
}
while(j > 1)
{
half_N = half_N >> 1;
current_PI_half_By_N = PI_half / (double)prev_half_N;
current_PI_By_N = 0;
step_Phase = current_PI_half_By_N * 2.0;
for(i = 0; i < half_N; i++)
{
Curent_Cos[i] = 0.5 / cos(current_PI_By_N + current_PI_half_By_N);
current_PI_By_N += step_Phase;
}
shif_Block = 0;
for(int x = 0; x < step; x++)
{
for(i= 0; i<half_N; i++)
{
shift = shif_Block + prev_half_N - i - 1;
Back[shif_Block + i] = Front[shif_Block + i] + Front[shift];
Back[shif_Block + half_N + i] =
(Front[shif_Block + i] - Front[shift]) * Curent_Cos[i];
}
shif_Block += prev_half_N;
}
temp = Front;
Front = Back;
Back = temp;
j = j >> 1;
step = step << 1;
prev_half_N = half_N;
}
while(j < N)
{
shif_Block = 0;
for(int x = 0; x < step; x++)
{
for(i = 0; i<half_N - 1; i++)
{
Back[shif_Block + (i << 1)] = Front[shif_Block + i];
Back[shif_Block + (i << 1) + 1] =
Front[shif_Block + half_N + i] + Front[shif_Block + half_N + i + 1];
}
Back[shif_Block + ((half_N - 1) << 1)] = Front[shif_Block + half_N - 1];
Back[shif_Block + (half_N << 1) - 1] = Front[shif_Block + (half_N << 1) - 1];
shif_Block += prev_half_N;
}
temp = Front;
Front = Back;
Back = temp;
j = j << 1;
step = step >> 1;
half_N = prev_half_N;
prev_half_N = prev_half_N << 1;
}
for(int i = 0; i < N; i++)
{
block[i] = Front[i];
}
block[0] = block[0] / dN;

## Points of Interest

A very important fact is that after the first iteration, we have only one group of odd coefficients, but after the second iteration, we have two groups of odd coefficients. For those two groups, you will need identical cosine's coefficients. Store for precomputing allows to compute that coefficient only for one group. For another group will be used data from store. It's very useful for computing 1024 and more points.

In the end, I can say that the developed algorithm has more precision than the usual implementation (due to less computing of cosine's coefficient).

## History

- 27
^{th} Jan, 2011: Initial post