Introduction
Universal Patch Finder (UPF) is a freeware designed to index various audio files from multiple directories. Every file can be tagged manually or automatically using plugins. The purpose of this article is to demonstrate how the plugin SampleTagger, part of UPF 1.6, generates spectrograms.
How Fast Fourier Transform Works
The goal of FFT is to transform a signal from the time domain to the frequency domain. Depending on which implementation you choose, you have to do some computations before and after the FFT is performed. The first thing to do is to generate a very simple signal in a WAV file and pass it to the FFT to see what's happen. SampleTagger use the class FFT2 implemented by Gerald T. Beauregard.
The Input Signal
Here, a 10khz sine wave is generated with Adobe Audition with sampling rate at 44.1Khz and bit depth at 16 (Standard CD quality). The first thing to do is to normalize the input values to the range [1,1] because the FFT works in this range. <span target="T:SampleTagger.Parsers.SampleReader">SampleReader</span>
is responsible to do this normalization.
Figure 1: Test signal 10khz sampled at 44.1Khz in 16 bits
The FFT Size
Now we pass the samples (the left channel of the stereo file) to the FFT. We must decide the size of the FFT (SampleTagger
uses 256 samples). It is very important to understand the following:
 Half of the FFT output is relevant according to the Nyquist frequency. This means if the input is 256 samples, the FFT produce only 128 values. If the input sample rate is 44100Hz, the 128 values represent frequencies from 0 to 22050Hz.
 The FFT produce frequency samples (or spectral bin). A frequency sample is a complex number with real and imaginary part. The imaginary part gives the phase and the real part gives the amplitude. We have to compute the magnitude in dB from this to produce a nice spectrogram. The magnitude of a spectral bin is simply the amount of energy for the corresponding frequency.
 Depending on the implementation, the FFT output is often not normalized. We have to normalize the output values.
Here, the output is using 256 samples. We only have 128 frequency samples. (Note: This is the real part of complex numbers.)
Figure 2: FFT output with 256 samples
The sampling frequency is 44100Hz, so each spectral bin is (44100/2)/128 = 172Hz. We have a peak near spectral bin 60 which is 60*172=10320 Hz. That's good ! This is our signal.
Here, the output is using 512 samples. We only have 256 frequency samples.
Figure 3: FFT output with 512 samples
The sampling frequency is 44100Hz, so each spectral bin is (44100/2)/256 = 86Hz. We have a peak near spectral bin 119 which is 119*86=10234 Hz. That's good but...
Did you see the amplitude is not the same ? (Two times bigger because the FFT size is now 512). This is because the output is not normalized. In SampleTagger
, we just divide the real part with the size of the FFT to do the trick. (Note: Imaginary part does not need to be normalized.)
fftState.re[i] = fftState.re[i] / fft_size;
Compute Magnitudes
Now we can compute the magnitude from complex values. This is done with the good old Pythagorean theorem. Each complex number can be represented in a twodimensional space. The real part is a
, and the imaginary part is b
.
Figure 4: Magnitude of a complex number a+ib
Magnitudes are stored in a two dimensional array magnitudes[x,y]
where x
is the n^{th} FFT performed by SampleTagger
and y
is the n^{th} magnitude in range [0,fft_size/2]
.
double product = re[y] * re[y] + im[y] * im[y];
magnitudes[x, y] = Math.Sqrt(product);
The final step is to convert magnitude in logarithmic scale because it is more readable. It is very important to understand the following:
 The unit for logarithmic scale is the decibel (dB).

The decibel expresses the ratio between two physical quantities that can be power or intensity.
If you deal with power, the formula is 10log(P1/P2)
.
If you deal with intensity, the formula is 20log(I1/I2)
.
 The output of our FFT is a magnitude, and it is an intensity, not a power. So the formula is
20log(magnitude)
.
So we convert the magnitude in dB with the following:
magnitudes[x, y] = ((double)20) * Math.Log10(magnitudes[x, y] + Double.Epsilon);
Note the Double.Epsilon
to handle the case where magnitude is 0
.
Here is the output in dB for 256 samples:
Figure 5: FFT output with 256 samples
Here is the output in dB for 512 samples.
Figure 6: FFT output with 512 samples
Dealing with Reality
The theory behind the FFT supposes the signal is periodic which is absolutely false in our case. This means when you feed a FFT with 256 samples, the FFT makes the assumption that this signal repeats itself to the infinite on both sides.
Figure 7: FFT periodicity
Because the signal is not periodic, artifacts will appear on both sides of the FFT. To prevent this, the input signal must pass through a window function to fade the signal to 0 on both sides.
The window functions can be looked at as attempts to retain as much information of the data in the window as possible whilst shaping the influence of the data at the edges. There are tons of windows available. SampleTagger
uses the Hanning window (not the Hamming window).
for (int i = 0; i < fft_size; i++)
{
window[i] = (4.0 / fft_size) * 0.5 *
(1  Math.Cos(2 * Math.PI * i / fft_size));
}
Each input sample of the FFT is multiplied by the window:
for (int i = 0; i < fft_size; i++)
{
fftState.re[i] = fftState.input[i] * fftState.window[i];
fftState.im[i] = 0;
}
Here is an example of how the input signal will be presented to the FFT. Note how the sides fade to 0
.
Figure 8: Windowed input signal
This brings the following issue: The FFT accuracy is low on window sides. So if you plan to read an entire WAV file using a buffer of 256 samples, the accuracy of your spectrogram will be bad.
To avoid this, you have to use a sliding window:
Figure 9: Sliding Window Analysis
Sliding window analysis is a technique used to increase temporal precision whilst maintaining accuracy in lower frequencies; remember, when using the FFT, there’s a tradeoff between temporal precision (window size) and frequency accuracy (especially in the lower frequencies).
SampleTagger
slides the window with half of the FFT size, the following code slides the input signal. (Note: This is not the Hanning windowed signal! It's the original input signal.)
Array.Copy(fftState.input, fft_size / 2, fftState.input, 0, fft_size / 2);
current_sample_count = fft_size/2;
Generating the Audio Spectrum
Audio spectrum (also named spectrogram) uses colors to plot magnitudes. Horizontal axis is the time, and vertical axis is the frequencies. As we saw, magnitudes are stored in a two dimensional array magnitudes[x,y]
. So we have to choose a color for each magnitude to produce the spectrogram.
Figure 10: Typical Audio spectrum
The idea is to clip magnitudes values into a fixed range like [0dB,108dB]. If a magnitude is higher than 108dB, we force it to 108dB. Then we translate this to the range [0,100] as percentage. Then we pick a color in a gradient of 100 pixels. The following gradient is used by SampleTagger
:
gradient.png
Figure 11: Color gradient
Because our FFT size is 256 samples, the spectrogram height is 128. The spectrogram width depends on the length of the input signal. We can calculate how many FFT can be performed in a signal if we know how many samples will be read:
nbFFT = (nbSamples / (fft_size / 2))  1;
magnitudes = new double[nbFFT, fft_size/2];
nbFFT = (nbSamples / (fft_size / 2))  1;
magnitudes = new double[nbFFT, fft_size/2];
The final step will stretch the spectrogram bitmap to the right size. Universal patch finder uses 200x16.
private Bitmap RescaleSpectrum(Bitmap spectrum, int width, int height)
{
Bitmap rescaledSpectrum = new Bitmap(width, height,PixelFormat.Format24bppRgb);
using (Graphics graph = Graphics.FromImage(rescaledSpectrum))
{
graph.InterpolationMode = InterpolationMode.High;
graph.CompositingQuality = CompositingQuality.HighQuality;
graph.SmoothingMode = SmoothingMode.AntiAlias;
graph.DrawImage(spectrum, new Rectangle(0, 0, width, height));
return rescaledSpectrum;
}
}
Source Code
The source code is provided in Universal Patch Finder SDK.
Documentation is provided in CHM format.