Silverlight Pronunciation Test

Marcelo Ricardo de Oliveira

5.00/5 (12 votes)

Apr 30, 2012

CPOL

11 min read

47877

1747

How to create a pronunciation test tool using Silverlight and Python

Download PitchContour.zip

Download snack2.2.zip

Introduction
System Requirements
User Interface
Playing Sample Voice
Recording The User's Voice
Uploading User's Voice File To Server
Extracting Pitch Contour
Extracting Wave Form
Displaying Pitch Contour
Displaying Wave Form
Calculating Score
Displaying Score
Final Considerations
History

Introduction

The reason for this article and the application accompanying it is an idea of an automated pronunciation test I've been flirting with for a few months now. Due to the difficulties found and some frustration, I thought giving it up three or four times, but in the end the inner "never give up" voice had the upper hand and eventually won. Fortunately, I ended up with a solution, which I admit is not perfect and one that got far away from my initial "development track". But this is how it often goes when you find difficulties, you embrace whatever tools that work for you.

The first problem was to find code or component to generate the so-called "pitch contour" for the analysis. The pitch contour is the melody that follows human voice, more technically the fluctuation in frequency that accompanies voice. I tried hard to find that magical "open source .Net code" which included the pitch contour calculation, searching the internet, but with no success. I found some open source solutions, but sadly they're not written in .Net code (mostly in C++/Python). Sadly, too, I'm not expert in C++ or Python, plus the code is too large to be ported. Also, I'm not expert in the mathematical algorithms (such as Fast Fourier Transform) which are needed for creating a new library from scratch. So I ended up with a "collaboration" between server-side c# program and a console python application. Not particularly pretty, since I initally planned an all-client-side, managed code, but it works and that's what matters. I hope some code hero in the .Net community come up with a more elegant solution fot that.

The second problem is trying to compare the user voice against the predefined exercise voice and providing a score. How can I compare both pitch contours? I had no tools for such task, so I had to come up with a new one. It took me many hours of work and still it's not perfect, but the only one I had so far. Like in the previous problem, a code hero would save the day here.

System Requirements

Make sure you follow these 3 steps:

1. The following software are needed for running Pronunciation Test provided with this article:

Visual Studio 2010 or Visual C# Web Developer

2. Also, you must download the Python 2.2 software and make sure it is installed in C:\Python22 folder. This is so because the source code only works with the application stored at C:\Python22 folder. Since the app was made using only Python 2.2, I can't tell if it will work with other versions of Python.

3. Finally, you must download the snack2.2.zip file and copy them to the C:\Python22\tcl folder. Without this folder, the application will not work.

User Interface

The user interface is 100% Silverlight. It's clean and, I must admit, sowewhat inspired by Metro design. The buttons perform very basic functions: moving to previous and next exercises, playing sample voice and user voice, and recording user's voice.

As with many XAML projects, this one makes use of MVVM (Model-View-ViewModel) pattern. In short, there is no code-behind for "click event" for the buttons, as well as there is no "object.property = new value" instruction (actually, there are a couple of event handlers, but just in a situation where using MVVM appeared to be be impossible). Instead, our buttons use MVVM-style Commanding, and the other visual elements properties are bound to the properties of the underlying ViewModel class.

Playing Sample Voice

For this application, I included 2 speeches, taken from the Free Sound website, so there is no problem regarding to copyrights about those voices. I've just included 2 sample files, in order to enable the next/previous functionality and keep the .zip source code as small as possible. Those files are sample01.wav and sample02.wav, and located at the PitchContour.Web\Files\sample01.wav folder.

You can change or add further sample audio files if you want, but be warned that there are some conditions that must be met:

Files must have .wav extension.
Files must be mono.

These requirements are imposed by the tools I've chosen for the app. If you are interested in adding files that does not meet these conditions, or event record your own voice, then you might be interested in installing Audacity, an excellent free tool for recording/editing audio.

But how do we actually play the Sample Voice in our application? First, we have a standard MediaElement control directly in our XAML code, dedicated to the task of playing the sample voice:

<MediaElement x:Name="sampleVoiceMediaElement" Width="450" Height="250" Stretch="Fill" AutoPlay="True" 
Position="{Binding SampleVoiceMediaPosition, Mode=TwoWay}" MediaOpened="sampleVoiceMediaElement_MediaOpened"/>

In the above snippet, we can notice that the MediaElement's Position proprety is bound to the SampleVoiceMediaPosition property of the underlying ViewModel class. Also, the MediaOpened event is handled by the sampleVoiceMediaElement_MediaOpened function in the code behind class.

Let's take a look at the MediaOpened event first. Until the media is opened, we don't know (and still don't have access to) the value of the media duration (the duration is needed for the calculation and positioning of the playing cursor), so we must read this value and store it in the ModelView instance:

    private void sampleVoiceMediaElement_MediaOpened(object sender, RoutedEventArgs e)
    {
        viewModel.SampleVoiceDuration = this.sampleVoiceMediaElement.NaturalDuration;
    }

Now that the duration is known, we're able to calculate the percentage of the MediaElement's Position in face of this duration and then draw a green rectangle representing the progress of the player. But before that, we must first store the result of the calculation in another property (SampleVoiceMediaBorderWidth) of the ViewModel instance:

public TimeSpan SampleVoiceMediaPosition
{
    get
    {
        return sampleVoiceMediaPosition;
    }
    set
    {
        sampleVoiceMediaPosition = value;
        NotifyPropertyChanged("SampleVoiceMediaPosition");

        if (sampleVoiceDuration.HasTimeSpan)
        {
            if (sampleVoiceDuration.TimeSpan.TotalMilliseconds > 0)
            {
                var x = (double)(value.TotalMilliseconds / sampleVoiceDuration.TimeSpan.TotalMilliseconds)
                * CANVAS_WIDTH;
                SampleVoiceMediaBorderWidth = x;
            }
        }
    }
}

Now that the SampleVoiceMediaBorderWidth is updated, we just need to pass that value to the width of the rectangle that represents the progress bar in our view. Fortunately, since we are using MVVM, there is already a Border element (our cursor, actually) which Width property is already wired to the SampleVoiceMediaBorderWidth property:

    public double SampleVoiceMediaBorderWidth
    {
        get
        {
            return sampleVoiceMediaBorderWidth;
        }
        set
        {
            sampleVoiceMediaBorderWidth = value;
            NotifyPropertyChanged("SampleVoiceMediaBorderWidth");
        }
    }

    <Border x:Name="brdSampleVoiceCursor" BorderBrush="DarkGreen" 
    BorderThickness="1" Height="100" Width="{Binding SampleVoiceMediaBorderWidth, Mode=TwoWay}"  
    HorizontalAlignment="Left" VerticalAlignment="Center">
        <Border.Background>
            <LinearGradientBrush StartPoint="0,0" EndPoint="0,1">
                <GradientStop Offset="0" Color="#fff"/>
                <GradientStop Offset="0.5" Color="#8f8"/>
                <GradientStop Offset="1" Color="#8f8"/>
            </LinearGradientBrush>
        </Border.Background>
    </Border>

In short: as the MediaElement plays, the position value is passed to the ViewModel and the rectangle value is calculated, which in turn is passed back to the Border (cursor) element.

Recording The User's Voice

There are some solutions involving audio recording and Silverlight on the web. I particularly liked the one proposed by Ondrej Svacina's blog. I must say, that for the audio recording part, I simply copied his code, but in the end there are some noticeable differences between our interfaces:

Ondrej's code allows for downloading the audio file locally (my code just uploads it to the server).
He included a pair of buttons for start/stop the recorder. Mine has a single switch on/off recording button.
His interface shows an analog counter to track the recorder progress (mine shows none).

You just need to click the recorder button to start recording your voice, and then click it once again to stop recording:

Uploading User's Voice File To Server

Once the voice is recorded, the application will start uploading it to the server. For this functionality, I initially had no code of mine, so I had to resort to someone who already had done that. That's why I chose Michael Washington's great Silverlight Simple Drag And Drop / Or Browse View Model / MVVM File Upload Control article. Although Michael's initial article had a very different goal from min, fortunately it provided me with a nice Silverlight and web server plumbing that was needed to perform the voice upload functionality.

Extracting Pitch Contour

As I stated at the beginning of the article, unfortunately I didn't managed to find or conceive a managed code for extracting the pitch contour from the .wav voice file. But nevertheless I came up with a solution by using Snack, and a small python program via command-line. In their own words:

"The Snack Sound Toolkit is designed to be used with a scripting language such as Tcl/Tk or Python. Using Snack you can create powerful multi-platform audio applications with just a few lines of code. Snack has commands for basic sound handling, such as playback, recording, file and socket I/O. Snack also provides primitives for sound visualization, e.g. waveforms and spectrograms. It was developed mainly to handle digital recordings of speech, but is just as useful for general audio. Snack has also successfully been applied to other one-dimensional signals. The combination of Snack and a scripting language makes it possible to create sound tools and applications with a minimum of effort. This is due to the rapid development nature of scripting languages. As a bonus you get an application that is cross-platform from start. It is also easy to integrate Snack based applications with existing sound analysis software."

The Pitch Contour Extraction is done by a script written in Python:

from Tkinter import *
import tkSnack
import pickle

class Speech:
	def Analyze(self, inputFile, outputFile):
		root = Tk()
		tkSnack.initializeSnack(root)
		mySound = tkSnack.Sound()
		mySound = tkSnack.Sound(load=inputFile)
		f = open(outputFile, "w")
		data = mySound.pitch()
		pickle.dump(data, f)
		f.close()
		return()

speech = Speech()
speech.Analyze('{source}', "{destination-pitch}")

The above script is quite simple: first, it make references to the libraries (tkSnack, pickle). Then a new instance of Speech class is made, and the Analyze function is called, passing the source (.wav) file and the destination (.txt) file containing the list of pitch values.

This destination file will contain a list of values representing the pitch variation, that is, the variation in frequency. As expected, male voices will have a lower average values than women voices. These values will be later read by the application and displayed on the wave form. This is how the resulting .txt pitch file looks like (each value has a preceding 'F' letter):

(F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F216.0
F214.0
F212.0
F213.0
F212.0
F210.0
F204.0
F206.0
F202.0
F196.0
F190.0
F178.0
F160.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F222.0
.
.
.
F0.0
tp0
.

But as we have mentioned before, the python code is not called directly by the .Net application. Instead, we instantiate a Process class and invoke the Start method, passing both the source .wav file and the destination .txt file:

    public void GeneratePitchFile()
    {
        var pythonFolder = ConfigurationManager.AppSettings["PythonFolder"];
        var extractPitchProgram = ConfigurationManager.AppSettings["ExtractPitchProgram"];

        var pythonExe = System.IO.Path.Combine(pythonFolder, "python.exe");
        var extractPitchDestinationPath = System.IO.Path.Combine(pythonFolder, 
        string.Format(@"lib\{0}", extractPitchProgram));
        var pitchResultPath = filePath.Replace(".wav", ".txt");
        var waveResultPath = filePath.Replace(".wav", "-wave.txt");
        using (var sr = new StreamReader(Path.Combine(appFolder, "ExtractPitch.py")))
        {
            using (var sw = new StreamWriter(extractPitchDestinationPath, false))
            {
                var fileString = sr.ReadToEnd()
                    .Replace("{source}", filePath.Replace(@"\", @"\\"))
                    .Replace("{destination-pitch}", pitchResultPath.Replace(@"\", @"\\"))
                    .Replace("{destination-wave}", waveResultPath.Replace(@"\", @"\\"));
                sw.Write(fileString);
            }
        }

        Process process = Process.Start(pythonExe, extractPitchDestinationPath);
        process.EnableRaisingEvents = true;
        process.Exited += (sender, args) =>
        {
            process.Close();
        };
    }

One might argue that the Python code could have been ported to the Iron Python to work directly with .Net code. In fact, I tried it, but that doesn't work because the Snack library in its turn relies on other C++ libraries. This is a deal breaker, so unfortunately the porting can't be made.

Extracting Wave Form

The wave form is extracted directly from the .wav file. I used the code provided by user pj4533 in Show Wave Form article:

    public ObservableCollection<int> GetPoints(double canvasWidth, double canvasHeight)
    {
        Read();

        var points = new ObservableCollection<int>();

        short val = m_Data[0];

        int prevX = 0;
        canvasHeight = CANVASHEIGHT;
        int prevY = (int)(((val + 32768) * canvasHeight) / 65536);

        for (int i = 0; i < m_Data.NumSamples; i += 16)
        {
            val = m_Data[i];

            int scaledVal = (int)(((-val - 32768) * canvasHeight) / 65536);

            points.Add(scaledVal);

            prevX = i;
            prevY = scaledVal;

            if (m_Fmt.Channels == 2)
                i++;
        }

        return points;
    }

It might be noticed that both the pitch contour and the wave form are extracted only after the audio file is uploaded.

Displaying Pitch Contour

We have a Path element just for the pitch contour. That could be accomplished by a Canvas element, but with the Path element you can define points and it automatically draw lines between them:

<Path x:Name="pthPitchCurve" Height="100" Width="500" Stroke="#f00" StrokeThickness="2" Data="{Binding SampleVoicePitchData}" 
HorizontalAlignment="Left" Stretch="None"></Path>

The path element's Data property sets a Geometry that specifies the shape to be drawn. It follows the Path Markup Syntax, which is quite extensive to be explained here, but if Data has a value of "Mx0,y0 x1,y1 x2,y2 x3,y3 ... xn,yn" that means we have a geometry made up by a sequence of line segments defined by the points {x0,y0}, {x1,y1}, {x2,y2}, {x3,y3}, ... {xn,yn}. The Path property is bound to the SampleVoicePitchData property on the ViewModel side. As we saw before, this value was extracted previously from the .wav file and requested via webservice, so it is already available to our Silverlight application:

private string GeneratePitchData(ArrayOfInt pitchValues, int offset, double xAdjustFactor)
{
    var sb = new StringBuilder();
    if (pitchValues.Count() > 0)
    {
        double minPoint = pitchValues.Min();
        double maxPoint = pitchValues.Max();
        double absMaxPoint = Math.Abs(minPoint) > maxPoint ? 
        Math.Abs(minPoint) : maxPoint;
        double xScale = (CANVAS_WIDTH / pitchValues.Count()) * xAdjustFactor;
        double yScale = CANVAS_HEIGHT / (maxPoint - minPoint);

        yScale = PITCHDATAYSCALE;

        var lastYValue = 0;
        var x = 0;
        foreach (var pitch in pitchValues)
        {
            var yValue = pitch;
            var y = LINEBASE - yValue;

            if (yValue > 0)
            {
                if (lastYValue == 0)
                {
                    var pointM = string.Format("M{0},{1} ", (int)(offset + x * xScale), 
                    (int)(y * yScale));
                    sb.Append(pointM);
                }
                var pointL = string.Format("{0},{1} ", (int)(offset + x * xScale), 
                (int)(y * yScale));
                sb.Append(pointL);
            }
            lastYValue = yValue;
            x++;
        }
    }
    else
    {
        DispatcherTimer pitchDataTimer = new DispatcherTimer();
        pitchDataTimer.Interval = TimeSpan.FromMilliseconds(1000);
        pitchDataTimer.Tick += (s, e) =>
            {
                pitchDataTimer.Stop();
                DoGetSampleVoicePitchData(false);
            };
        pitchDataTimer.Start();                
    }
    return sb.ToString();
}

As a result, the pitch contour is displayed in Patch element accordingly:

Displaying Wave Form

The wave form is displayed in a quite similar way: We have another Path element for the wave form:

<Path x:Name="pthWave" Height="100" Width="500" Stroke="#aaa" Data="{Binding SampleVoiceWavePath}" 
HorizontalAlignment="Left" VerticalAlignment="Center" Stretch="None"></Path>

The Path property is bound to the SampleVoiceWavePath property on the ViewModel side. As we saw before, this value was extracted previously from the .wav file and requested via webservice:

private string GenerateWavePath(ArrayOfInt points)
{
    double minPoint = points.Min();
    double maxPoint = points.Max();
    double middlePoint = maxPoint - minPoint / 2;
    double absMaxPoint = Math.Abs(minPoint) > maxPoint ? Math.Abs(minPoint) : maxPoint;

    double xScale = CANVAS_WIDTH / points.Count();
    double yScale = CANVAS_HEIGHT / ((maxPoint - minPoint));

    var sbUserVoiceWavePath = new StringBuilder();
    var yWave = points[0];
    sbUserVoiceWavePath.AppendFormat("M{0},{1} ", 0, (int)(CANVAS_HEIGHT / 2));
    for (var xWave = 1; xWave < points.Count(); xWave++)
    {
        yWave = (int)(points[xWave]);
        var x = string.Format("{0:0.00}", xWave * xScale).Replace(",", ".");
        var y = string.Format("{0:0.00}", (yWave - minPoint) * yScale).Replace(",", ".");
        sbUserVoiceWavePath.AppendFormat("L{0},{1} ", x, y);
    }

    return sbUserVoiceWavePath.ToString();
}

Then this is how the wave form will look like:

Calculating Score

Now that we have all the data (pitch contour and wave forms from both sample voice and user voice), it's up to us to calculate the score. Assuming that the score ranges from a minimum of 0 points to the maximum of 100 points (meaning perfect pronunciation), we must define how to measure this scale.

As stated before, I have no knowledge in audio analysis, so I invented a way of taking the pitch contour in its individual segments and calculating the slope of each segment. That is, a segment can go up or down. The entire pitch contour of the sample speech will then have a collection of slopes, being for example "down-up-down-down-up-down-up", while the user speech will have another set of slopes, for example "down-down-up-down-down-up-up-down". We then compare these sets against each other and provide a score varying from 0 to 100 points, where 0 means no matches and 100 means all segment slopes matched. You can see which slopes are goind down or up through the red and blue arrows in the image below:

And below is the main code for calculating the grade from the pitch contour comparison:

private void GenerateGrade()
{
    var segmentSlopeScore = 0.0;

    var samplePitchValuesLength = GetLastX(this.sampleVoicePitchValues) - 
    GetFirstX(this.sampleVoicePitchValues);
    var userPitchValuesLength = GetLastX(this.userVoicePitchValues) - 
    GetFirstX(this.userVoicePitchValues);
    var pitchValuesLengthError = (double)Math.Abs(userPitchValuesLength - 
    samplePitchValuesLength) / samplePitchValuesLength;
    var sampleSegments = GetPitchSegmentLengthList(this.sampleVoicePitchValues).Where(v => v > 0).ToList();
    var userSegments = GetPitchSegmentLengthList(this.userVoicePitchValues).Where(v => v > 0).ToList();
    var segmentIndex = 0;
    var validSegmentCount = 0;

    RemoveNaNSegments(userSlicedSlopes, userSegments);

    if (sampleSegments.Count() > userSegments.Count())
    {
        RemoveInconsistentSegments(sampleSlicedSlopes, userSlicedSlopes, sampleSegments, userSegments);
    }
    else if (userSegments.Count() > sampleSegments.Count())
    {
        RemoveInconsistentSegments(userSlicedSlopes, sampleSlicedSlopes, userSegments, sampleSegments);
    }

    foreach (var sampleSegment in sampleSegments)
    {
        if (sampleSegment > 0)
        {
            if (userSlicedSlopes.Count() > segmentIndex)
            {
                var currentSampleSlope = sampleSlicedSlopes[segmentIndex];
                var currentUserSlope = userSlicedSlopes[segmentIndex];

                if (!double.IsNaN(currentSampleSlope) && !double.IsNaN(currentUserSlope))
                {
                    if (CheckSlopes(currentSampleSlope, currentUserSlope))
                        segmentSlopeScore++;
                }
                segmentIndex++;
                validSegmentCount++;
            }                    
        }
    }

    var sampleSegmentCount = GetPitchSegmentLengthList(this.sampleVoicePitchValues).Count();
    var userSegmentCount = GetPitchSegmentLengthList(this.userVoicePitchValues).Count();
    var segmentCountError = (double)Math.Abs(userSegments.Count() - sampleSegments.Count()) 
    / sampleSegmentCount;

    Grade = (int)((segmentSlopeScore / validSegmentCount) * 100.0 * (1.0 - segmentCountError));
}

Displaying Score

As we did before with pitch contours and wave formats, the score is displayed by binding a visual element on the XAML side to a given property on the ViewModel class:

<TextBlock x:Name="txtGrade" Text="{Binding Grade}" Foreground="Green" FontSize="45" TextAlignment="Center" VerticalAlignment="Center">
</TextBlock>

The Grade property getter/setter are defined like this:

public int Grade
{
    get
    {
        return grade;
    }
    set
    {
        grade = value;
        NotifyPropertyChanged("Grade");
    }
}

Final Considerations

I hope you have enjoyed the article. And while I hope it can be useful for you, as you can see, there is a lot of room for improvement, and if you have something to say, please leave a comment below.

History

2012-04-29: Initial version.

Silverlight Pronunciation Test

Table of Contents