Introduction
Universal Patch Finder (UPF) is a freeware designed to index various audio files from multiple directories. Every file can be tagged manually or automatically using plugins. The purpose of this article is to demonstrate how the plugin SampleTagger, part of UPF 1.6, generate spectrograms.
How Fast Fourier Transform works
The goal of FFT is to transform a signal from the time domain to the frequency domain. Depending on which implementation you choose you have to do some computations before and after the FFT is performed. The first thing to do is to generate a very simple signal in a WAV file and pass it to the FFT to see what's happen. SampleTagger use the class FFT2 implemented by Gerald T. Beauregard
The input signal
Here a 10khz sine wave generated with Adobe Audition with sampling rate at 44.1Khz and bit depth at 16 (Standard CD quality). The first thing to do is to normalize the input values to the range [1,1] because the FFT work in this range. SampleReader is responsible to do this normalization.
Figure 1: Test signal 10khz sampled at 44.1Khz in 16 bits
The FFT size
Now we pass the samples (the left channel of the stereo file) to the FFT. We must decide the size of the FFT (SampleTagger use 256 samples). It is very important to understand the following:
 Half of the FFT output is relevant according to the Nyquist frequency. This mean if the input is 256 samples, the FFT produce only 128 values. If the input sample rate is 44100Hz, the 128 values represent frequencies from 0 to 22050Hz.
 The FFT produce frequency samples (or spectral bin). A frequency sample is a complex number with real and imaginary part. The imaginary part give the phase and the real part give the amplitude. We have to compute the magnitude in dB from this to produce a nice spectrogram. The magnitude of a spectral bin is simply the amount of energy for the corresponding frequency.
 Depending of the implementation, the FFT output is often not normalized. We have to normalize the output values.
Here the output using 256 samples. We only have 128 frequency samples. (note: this is the real part of complex numbers)
Figure 2: FFT output with 256 samples
The sampling frequency is 44100Hz, so each spectral bin is (44100/2)/128 = 172Hz. We have a peak near spectral bin 60 which is 60*172=10320 Hz. That's good ! This is our signal.
Here the output using 512 samples. We only have 256 frequency samples.
Figure 3: FFT output with 512 samples
The sampling frequency is 44100Hz, so each spectral bin is (44100/2)/256 = 86Hz. We have a peak near spectral bin 119 which is 119*86=10234 Hz. That's good but...
Did you see the amplitude is not the same ? (Two times bigger because the FFT size is now 512). This is because the output is not normalized. In SampleTagger we just divide the real part with the size of the FFT to do the trick. (Note: imaginary part does not need to be normalized)
fftState.re[i] = fftState.re[i] / fft_size;
Compute magnitudes
Now we can compute the magnitude from complex values. This is done with the good old Pythagorean theorem. Each complex number can be represented in a twodimensional space. The real part is a, and the imaginary part is b.
Figure 4: Magnitude of a complex number a+ib
Magnitudes are stored in a two dimentional array magnitudes[x,y] where x is the nth FFT performed by SampleTagger and y is the nth magnitude in range [0,fft_size/2[
double product = re[y] * re[y] + im[y] * im[y];
magnitudes[x, y] = Math.Sqrt(product);
The final step is to convert magnitude in logarithmic scale because it is more readable. It is very important to understand the following:
 The unit for logarithmic scale is the decibel (dB)

The decibel express the ratio between two physical quantity that can be power or intensity.
If you deal with power, the formula is 10log(P1/P2)
If you deal with intensity, the formula is 20log(I1/I2)
 The output of our FFT is a magnitude, and it is an intensity, not a power. So the formula is 20log(magnitude)
So we convert the magnitude in dB with the following:
magnitudes[x, y] = ((double)20) * Math.Log10(magnitudes[x, y] + Double.Epsilon);
Note the Double.Epsilon to handle the case where magnitude is 0
Here the output in dB for 256 samples
Figure 5: FFT output with 256 samples
Here the output in dB for 512 samples
Figure 6: FFT output with 512 samples
Dealing with reality
The theory behind the FFT suppose the signal is periodic which is absolutely false in our case. This mean when you feed a FFT with 256 samples, the FFT make the assumption that this signal repeat itself to the infinite on both sides.
Figure 7: FFT periodicity
Because the signal is not periodic, artifacts will appear on both sides of the FFT. To prevent this, the input signal must pass through a window function to fade the signal to 0 on both sides.
The window functions can be looked at as attempts to retain as much information of the data in the window as possible whilst shaping the influence of the data at the edges. There is tons of windows available. SampleTagger use the Hanning window (not the Hamming window)
for (int i = 0; i < fft_size; i++)
{
window[i] = (4.0 / fft_size) * 0.5 * (1  Math.Cos(2 * Math.PI * i / fft_size)); }
Each input sample of the FFT is multiplied by the window:
for (int i = 0; i < fft_size; i++)
{
fftState.re[i] = fftState.input[i] * fftState.window[i];
fftState.im[i] = 0;
}
Here an example how the input signal will be presented to the FFT. Note how the sides fade to 0.
Figure 8: Windowed input signal
This brings the following issue: The FFT accuracy is low on window sides. So if you plan to read an entire WAV file using a buffer of 256 samples, the accuracy of your spectrogram will be bad.
To avoid this, you have to use a sliding window:
Figure 9: Sliding Window Analysis
Sliding window analysis is a technique used to increase temporal precision whilst maintaining accuracy in lower frequencies; remember, when using the FFT, there’s a tradeoff between temporal precision (window size) and frequency accuracy (especially in the lower frequencies).
SampleTagger slide the window with half of the FFT size, the following code slide the input signal. (Note: this is not the Hanning windowed signal ! It's the original input signal.)
Array.Copy(fftState.input, fft_size / 2, fftState.input, 0, fft_size / 2);
current_sample_count = fft_size/2;
Generating the audio spectrum
Audio spectrum (also named spectrogram) uses colors to plot magnitudes. Horizontal axis is the time, and vertical axis is the frequencies. As we saw, magnitudes are stored in a two dimentional array magnitudes[x,y]. So we have to choose a color for each magnitude to produce the spectrogram.
Figure 10: Typical Audio spectrum
The idea is to clip magnitudes values into a fixed range like [0dB,108dB]. If a magnitude is higher than 108dB, we force it to 108dB. Then we translate this to the range [0,100] as percentage. Then we pick a color in a gradient of 100 pixels. The following gradient is used by SampleTagger:
Figure 11: Color gradient
Because our FFT size is 256 samples, the spectrogram height is 128. The spectrogram width depends on the length of the input signal. We can calculate how many FFT can be performed in a signal if we know how many samples will be read:
nbFFT = (nbSamples / (fft_size / 2))  1; magnitudes = new double[nbFFT, fft_size/2];
nbFFT = (nbSamples / (fft_size / 2))  1; magnitudes = new double[nbFFT, fft_size/2];
The final step will stretch the spectrogram bitmap to the right size. Universal patch finder use 200x16
private Bitmap RescaleSpectrum(Bitmap spectrum, int width, int height)
{
Bitmap rescaledSpectrum = new Bitmap(width, height,PixelFormat.Format24bppRgb);
using (Graphics graph = Graphics.FromImage(rescaledSpectrum))
{
graph.InterpolationMode = InterpolationMode.High;
graph.CompositingQuality = CompositingQuality.HighQuality;
graph.SmoothingMode = SmoothingMode.AntiAlias;
graph.DrawImage(spectrum, new Rectangle(0, 0, width, height));
return rescaledSpectrum;
}
}
Source code
The source code is provided in Universal Patch Finder SDK
Documentation is provided in CHM format.