Visualizing Sound

HoshiKata

4.94/5 (23 votes)

Nov 7, 2012

CPOL

7 min read

80701

7696

Listen or playback sound and visualize the frequency spread.

Download SoundFiltering.zip - 113.3 KB

Introduction

I always wanted to do some basic sound processing and to be able to visualize the sound, see what speech looks like as a frequency spread, or how music looks.

In order to that, several simple problems need to be solved:

DirectX sound playback -> raw values
DirectX listen -> raw values
Values -> FFT
Graph of an FFT
Way to see how the FFT evolves over time.

Background

Demystifying Sound and Frequency Processing

When processing sound, the data is usually compressed but at some point it becomes just a set of samples in time, or more simply as speaker position over time.

y(t) = F(t)

A frequency is a vibration back and forth at a given rate, so:

When does a frequency start?
When does it end?

The answer is a frequency occurs over a period of time, therefore it has a period of time or "support".

To measure it, you compare it against a known frequency, such that the time over which the comparison is made is important. If the time span is too short, only a few digitized samples are recorded and only a few possible frequencies can be measured. If it is too long, the sense of when the frequency or musical note started and ended is lost.

To provide more background I'll add a bit more math: suppose we hear a chord (3 notes) in the sound, then we could write the generating function as three sine waves:

A "C" major chord on a piano is C4, E4 and G4 = 261.63, 329.63 and 392.0 respectively. Played together:

y(t) = c * sin(261.63 * t ) + e * sin(329.63 * t) + g * sin(392 * t)

Where c, e and g are the magnitudes of each of those notes, and t is in seconds.

Because y(t) is sampled at a rate, we convert t from seconds into the sample index by multiplying by the sample rate (say 22,000 times a second). Something that occurred 261.63 times a second beats against the sampling rate, and repeats every 84th sample (22,000/261.63).

y(i) = c * sin(i / 84) + e * sin(i / 66.74) + g * sin(i / 56.1)

The first thing to notice is every sample y(i) is a mix of the three notes, so in order to get the frequencies and visualize them to perform analysis, a balance of when the frequency occurred is made against the sample rate and the frequencies of interest.

Practical FFT basics

Suppose you have a continuous signal y(t), and you wanted to extract the frequencies, then you could for any frequency "f" of interest:

Y(f) = integral ( y(t) * sin(f * t) * dt.

Because Y(f) is continuous AND infinite in length:

Y(f) = integral ( c* sin(a*t) * sin(f * t) * dt) => c if f== a, otherwise 0.

The result is either "c" or 0. Either f is exactly the same frequency as "a" and the result is "c" OR they beat against each other, and over an infinite region of support they cancel and the result is zero.The Fast Fourier Transform (FFT) is a numerical way of converting an array of time domain samples into an array of frequency domain measurements.

The problem is, y(t) isn't continuous (we're sampling it), the tones start and end (edges induce frequency spread), and an FFT looks at a window of the data, not all of it. Thus a 1024 point FFT of data at 22kHz measures the frequencies observed in a region of 1/22nd of a second, and it can't detect things slower than 1/22nd of a second. It also can't detect things faster than half the sampling rate (Nyquist theory).

In reality, if something is slower than the sampling window it appears as a raising or lower of the whole data set: an increase of the "DC" or "direct current" portion of the signal. Frequencies faster than the sampling rate beat against the sampling rate. For example if you had a frequency 3x the sampling rate, the sine wave would wrap around between the samples, and it would appear to be a lower frequency in the data. This means you can't tell if a signal is faster or slower without measuring how it is passing the sampling rate (without measuring the phase).

Based on these effects, it is important to setup the FFT so it is spectrally wide enough to catch the frequencies of interest and narrow enough so that we can see when a tone starts and ends (in this case, roughly 1/22nd of a second), and most frequencies of interest in music / voice. It's also good to know that it isn't perfect, a "C" chord has very exact frequencies and they won't fall exactly into specific FFT array "bins".

FFT -> Graphics considerations

Because an FFT blurs when a note or frequency starts or ends, and it blurs within the FFT bins a bit, I decided on a stack of FFT's where the strength of the energy in an FFT bin is color coded as a pixel.

Buffering

There are numerous articles that describe how DirectX can playback or listen to a microphone. The code provides a decent example of how to do that. What is interesting is how threading is used in this example. Data is pulled from the file, then the buffer is added to a list of buffers to work on. A Thread then pulls off data from the arrays and performs as many FFT's as needed.

For example, the data arrived event produces 44,000 samples, then 22,000 samples. This would result in two buffers pushed off to the work queue (44k, then 22k). The FFT size is 1024. The worker thread can then asynchronously process bunches of data, then publish the results (an event) so it can be viewed.

Overlapped FFT

An FFT converts time domain data into frequency domain data, so if the rhythm starts or begins around the edge of one data set or another, the edge of it starting will cause some ringing. To avoid this and create a better sense of "locality": establish the "when" of a given frequency starting or stopping, the data set is broken into overlapped regions that can be FFT'd. If there was 22k data points, and the FFT size was 1024, overlapped into forths, the first FFT starts at 0, then at 1024/4, then 1024/2, 1024*3/4, 1024, 1024 + 1024/4, ...

FFT Visualization

The FFT visualization doesn't keep track of each FFT. Instead to draw N columns of FFT, it simply shifts the existing image over by N columns and appends.The problem is sound has a lot of dynamic range and to see the data, I scale the appending columns by tracking the minimum and maximum value.

Using the code

There are a few cool things in the code provided:

DirectX sound capture, and playback, an FFT, and some graphics widgets for future use.
Each is more or less standalone and usable in a future product,
Code is (or at least was meant to be) easy to read / understand.

FFT

There are a lot of standard things you do with an FFT depending on the application. Input/output data can be real, complex, magnitudes, interlaced or separate complex, integer, float or doubles or some mixture. Buffers are sometimes reused for output or are separate.

The API I provide for the FFT and inverse are generally:

Complex -> (I)FFT -> Complex, Real -> (I)FFT -> Complex, Real -> (I)FFT -> Magnitude

For example:

/// <summary>
/// Compute the FFT forward in place for real data.
/// </summary>
/// <param name="realIn">Reals</param>
public static void Forward(double[] reals)

Graphs

The graph class is pretty basic but it is fast, and does re-size / redraw and uses a timer to do some event coalescence. The API is relatively simple and rather self explanatory.

Color TextColor;  
List<Line> Lines 
bool ShowLegend  {get; set;}
bool ShowAxis  {get; set;}
bool ShowGrid{get; set;}
void TriggerRedraw();

The Line class describes an array of points, and a color to use for them.

string Name;  
bool Show; 
double [] X; 
double [] Y;
Color LineColor;
void Update(double [] array);  
float LineWidth;

To graph an array of data you can either specify the X axis or not. The following graphs an array:

if (graph1.Lines.Count < 1)
{
    graph1.Lines.Add(new Graph.Line());
}
graph1.Lines[0].Update(myArray);
graph1.TriggerRedraw();

Points of Interest

Nyquist theory: http://en.wikipedia.org/wiki/Nyquist_frequency

FFT: http://en.wikipedia.org/wiki/FFT