Silverlight Pronunciation Test





5.00/5 (12 votes)
How to create a pronunciation test tool using Silverlight and Python
Table of Contents
- Introduction
- System Requirements
- User Interface
- Playing Sample Voice
- Recording The User's Voice
- Uploading User's Voice File To Server
- Extracting Pitch Contour
- Extracting Wave Form
- Displaying Pitch Contour
- Displaying Wave Form
- Calculating Score
- Displaying Score
- Final Considerations
- History
Introduction
The reason for this article and the application accompanying it is an idea of an automated pronunciation test I've been flirting with for a few months now. Due to the difficulties found and some frustration, I thought giving it up three or four times, but in the end the inner "never give up" voice had the upper hand and eventually won. Fortunately, I ended up with a solution, which I admit is not perfect and one that got far away from my initial "development track". But this is how it often goes when you find difficulties, you embrace whatever tools that work for you.
The first problem was to find code or component to generate the so-called "pitch contour" for the analysis. The pitch contour is the melody that follows human voice, more technically the fluctuation in frequency that accompanies voice. I tried hard to find that magical "open source .Net code" which included the pitch contour calculation, searching the internet, but with no success. I found some open source solutions, but sadly they're not written in .Net code (mostly in C++/Python). Sadly, too, I'm not expert in C++ or Python, plus the code is too large to be ported. Also, I'm not expert in the mathematical algorithms (such as Fast Fourier Transform) which are needed for creating a new library from scratch. So I ended up with a "collaboration" between server-side c# program and a console python application. Not particularly pretty, since I initally planned an all-client-side, managed code, but it works and that's what matters. I hope some code hero in the .Net community come up with a more elegant solution fot that.
The second problem is trying to compare the user voice against the predefined exercise voice and providing a score. How can I compare both pitch contours? I had no tools for such task, so I had to come up with a new one. It took me many hours of work and still it's not perfect, but the only one I had so far. Like in the previous problem, a code hero would save the day here.
System Requirements
Make sure you follow these 3 steps:
1. The following software are needed for running Pronunciation Test provided with this article:
- Visual Studio 2010 or Visual C# Web Developer
2. Also, you must download the Python 2.2 software and make sure it is installed in C:\Python22 folder. This is so because the source code only works with the application stored at C:\Python22 folder. Since the app was made using only Python 2.2, I can't tell if it will work with other versions of Python.
3. Finally, you must download the snack2.2.zip file and copy them to the C:\Python22\tcl folder. Without this folder, the application will not work.
User Interface
The user interface is 100% Silverlight. It's clean and, I must admit, sowewhat inspired by Metro design. The buttons perform very basic functions: moving to previous and next exercises, playing sample voice and user voice, and recording user's voice.
As with many XAML projects, this one makes use of MVVM (Model-View-ViewModel) pattern.
In short, there is no code-behind for "click event" for the buttons, as well as there is no "object.property = new value" instruction
(actually, there are a couple of event handlers, but just in a situation where using MVVM appeared to be be impossible). Instead, our
buttons use MVVM-style
Playing Sample Voice
For this application, I included 2 speeches, taken from the Free Sound website, so there is no problem regarding to copyrights about those voices. I've just included 2 sample files, in order to enable the next/previous functionality and keep the .zip source code as small as possible. Those files are sample01.wav and sample02.wav, and located at the PitchContour.Web\Files\sample01.wav folder.
You can change or add further sample audio files if you want, but be warned that there are some conditions that must be met:
- Files must have .wav extension.
- Files must be mono.
These requirements are imposed by the tools I've chosen for the app. If you are interested in adding files that does not meet these conditions, or event record your own voice, then you might be interested in installing Audacity, an excellent free tool for recording/editing audio.
But how do we actually play the Sample Voice in our application? First, we have a standard
<MediaElement x:Name="sampleVoiceMediaElement" Width="450" Height="250" Stretch="Fill" AutoPlay="True"
Position="{Binding SampleVoiceMediaPosition, Mode=TwoWay}" MediaOpened="sampleVoiceMediaElement_MediaOpened"/>
In the above snippet, we can notice that the MediaElement's
Let's take a look at the
private void sampleVoiceMediaElement_MediaOpened(object sender, RoutedEventArgs e)
{
viewModel.SampleVoiceDuration = this.sampleVoiceMediaElement.NaturalDuration;
}
Now that the duration is known, we're able to calculate the percentage of the MediaElement's
public TimeSpan SampleVoiceMediaPosition
{
get
{
return sampleVoiceMediaPosition;
}
set
{
sampleVoiceMediaPosition = value;
NotifyPropertyChanged("SampleVoiceMediaPosition");
if (sampleVoiceDuration.HasTimeSpan)
{
if (sampleVoiceDuration.TimeSpan.TotalMilliseconds > 0)
{
var x = (double)(value.TotalMilliseconds / sampleVoiceDuration.TimeSpan.TotalMilliseconds)
* CANVAS_WIDTH;
SampleVoiceMediaBorderWidth = x;
}
}
}
}
Now that the
public double SampleVoiceMediaBorderWidth
{
get
{
return sampleVoiceMediaBorderWidth;
}
set
{
sampleVoiceMediaBorderWidth = value;
NotifyPropertyChanged("SampleVoiceMediaBorderWidth");
}
}
<Border x:Name="brdSampleVoiceCursor" BorderBrush="DarkGreen"
BorderThickness="1" Height="100" Width="{Binding SampleVoiceMediaBorderWidth, Mode=TwoWay}"
HorizontalAlignment="Left" VerticalAlignment="Center">
<Border.Background>
<LinearGradientBrush StartPoint="0,0" EndPoint="0,1">
<GradientStop Offset="0" Color="#fff"/>
<GradientStop Offset="0.5" Color="#8f8"/>
<GradientStop Offset="1" Color="#8f8"/>
</LinearGradientBrush>
</Border.Background>
</Border>
In short: as the
Recording The User's Voice
There are some solutions involving audio recording and Silverlight on the web. I particularly liked the one proposed by Ondrej Svacina's blog. I must say, that for the audio recording part, I simply copied his code, but in the end there are some noticeable differences between our interfaces:
- Ondrej's code allows for downloading the audio file locally (my code just uploads it to the server).
- He included a pair of buttons for start/stop the recorder. Mine has a single switch on/off recording button.
- His interface shows an analog counter to track the recorder progress (mine shows none).
You just need to click the recorder button to start recording your voice, and then click it once again to stop recording:
Uploading User's Voice File To Server
Once the voice is recorded, the application will start uploading it to the server. For this functionality, I initially had no code of mine, so I had to resort to someone who already had done that. That's why I chose Michael Washington's great Silverlight Simple Drag And Drop / Or Browse View Model / MVVM File Upload Control article. Although Michael's initial article had a very different goal from min, fortunately it provided me with a nice Silverlight and web server plumbing that was needed to perform the voice upload functionality.
Extracting Pitch Contour
As I stated at the beginning of the article, unfortunately I didn't managed to find or conceive a managed code for extracting the pitch contour from the .wav voice file. But nevertheless I came up with a solution by using Snack, and a small python program via command-line. In their own words:
"The Snack Sound Toolkit is designed to be used with a scripting language such as Tcl/Tk or Python. Using Snack you can create powerful multi-platform audio applications with just a few lines of code. Snack has commands for basic sound handling, such as playback, recording, file and socket I/O. Snack also provides primitives for sound visualization, e.g. waveforms and spectrograms. It was developed mainly to handle digital recordings of speech, but is just as useful for general audio. Snack has also successfully been applied to other one-dimensional signals. The combination of Snack and a scripting language makes it possible to create sound tools and applications with a minimum of effort. This is due to the rapid development nature of scripting languages. As a bonus you get an application that is cross-platform from start. It is also easy to integrate Snack based applications with existing sound analysis software."
The Pitch Contour Extraction is done by a script written in Python:
from Tkinter import *
import tkSnack
import pickle
class Speech:
def Analyze(self, inputFile, outputFile):
root = Tk()
tkSnack.initializeSnack(root)
mySound = tkSnack.Sound()
mySound = tkSnack.Sound(load=inputFile)
f = open(outputFile, "w")
data = mySound.pitch()
pickle.dump(data, f)
f.close()
return()
speech = Speech()
speech.Analyze('{source}', "{destination-pitch}")
The above script is quite simple: first, it make references to the libraries (tkSnack, pickle). Then a new instance of
This destination file will contain a list of values representing the pitch variation, that is, the variation in frequency. As expected, male voices will have a lower average values than women voices. These values will be later read by the application and displayed on the wave form. This is how the resulting .txt pitch file looks like (each value has a preceding 'F' letter):
(F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F216.0
F214.0
F212.0
F213.0
F212.0
F210.0
F204.0
F206.0
F202.0
F196.0
F190.0
F178.0
F160.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F0.0
F222.0
.
.
.
F0.0
tp0
.
But as we have mentioned before, the python code is not called directly by the .Net application. Instead, we instantiate a
public void GeneratePitchFile()
{
var pythonFolder = ConfigurationManager.AppSettings["PythonFolder"];
var extractPitchProgram = ConfigurationManager.AppSettings["ExtractPitchProgram"];
var pythonExe = System.IO.Path.Combine(pythonFolder, "python.exe");
var extractPitchDestinationPath = System.IO.Path.Combine(pythonFolder,
string.Format(@"lib\{0}", extractPitchProgram));
var pitchResultPath = filePath.Replace(".wav", ".txt");
var waveResultPath = filePath.Replace(".wav", "-wave.txt");
using (var sr = new StreamReader(Path.Combine(appFolder, "ExtractPitch.py")))
{
using (var sw = new StreamWriter(extractPitchDestinationPath, false))
{
var fileString = sr.ReadToEnd()
.Replace("{source}", filePath.Replace(@"\", @"\\"))
.Replace("{destination-pitch}", pitchResultPath.Replace(@"\", @"\\"))
.Replace("{destination-wave}", waveResultPath.Replace(@"\", @"\\"));
sw.Write(fileString);
}
}
Process process = Process.Start(pythonExe, extractPitchDestinationPath);
process.EnableRaisingEvents = true;
process.Exited += (sender, args) =>
{
process.Close();
};
}
One might argue that the Python code could have been ported to the Iron Python to work directly with .Net code. In fact,
I tried it, but that doesn't work because the
Extracting Wave Form
The wave form is extracted directly from the .wav file. I used the code provided by user pj4533 in Show Wave Form article:
public ObservableCollection<int> GetPoints(double canvasWidth, double canvasHeight)
{
Read();
var points = new ObservableCollection<int>();
short val = m_Data[0];
int prevX = 0;
canvasHeight = CANVASHEIGHT;
int prevY = (int)(((val + 32768) * canvasHeight) / 65536);
for (int i = 0; i < m_Data.NumSamples; i += 16)
{
val = m_Data[i];
int scaledVal = (int)(((-val - 32768) * canvasHeight) / 65536);
points.Add(scaledVal);
prevX = i;
prevY = scaledVal;
if (m_Fmt.Channels == 2)
i++;
}
return points;
}
It might be noticed that both the pitch contour and the wave form are extracted only after the audio file is uploaded.
Displaying Pitch Contour
We have a
<Path x:Name="pthPitchCurve" Height="100" Width="500" Stroke="#f00" StrokeThickness="2" Data="{Binding SampleVoicePitchData}"
HorizontalAlignment="Left" Stretch="None"></Path>
The path element's
private string GeneratePitchData(ArrayOfInt pitchValues, int offset, double xAdjustFactor)
{
var sb = new StringBuilder();
if (pitchValues.Count() > 0)
{
double minPoint = pitchValues.Min();
double maxPoint = pitchValues.Max();
double absMaxPoint = Math.Abs(minPoint) > maxPoint ?
Math.Abs(minPoint) : maxPoint;
double xScale = (CANVAS_WIDTH / pitchValues.Count()) * xAdjustFactor;
double yScale = CANVAS_HEIGHT / (maxPoint - minPoint);
yScale = PITCHDATAYSCALE;
var lastYValue = 0;
var x = 0;
foreach (var pitch in pitchValues)
{
var yValue = pitch;
var y = LINEBASE - yValue;
if (yValue > 0)
{
if (lastYValue == 0)
{
var pointM = string.Format("M{0},{1} ", (int)(offset + x * xScale),
(int)(y * yScale));
sb.Append(pointM);
}
var pointL = string.Format("{0},{1} ", (int)(offset + x * xScale),
(int)(y * yScale));
sb.Append(pointL);
}
lastYValue = yValue;
x++;
}
}
else
{
DispatcherTimer pitchDataTimer = new DispatcherTimer();
pitchDataTimer.Interval = TimeSpan.FromMilliseconds(1000);
pitchDataTimer.Tick += (s, e) =>
{
pitchDataTimer.Stop();
DoGetSampleVoicePitchData(false);
};
pitchDataTimer.Start();
}
return sb.ToString();
}
As a result, the pitch contour is displayed in
Displaying Wave Form
The wave form is displayed in a quite similar way: We have another
<Path x:Name="pthWave" Height="100" Width="500" Stroke="#aaa" Data="{Binding SampleVoiceWavePath}"
HorizontalAlignment="Left" VerticalAlignment="Center" Stretch="None"></Path>
The
private string GenerateWavePath(ArrayOfInt points)
{
double minPoint = points.Min();
double maxPoint = points.Max();
double middlePoint = maxPoint - minPoint / 2;
double absMaxPoint = Math.Abs(minPoint) > maxPoint ? Math.Abs(minPoint) : maxPoint;
double xScale = CANVAS_WIDTH / points.Count();
double yScale = CANVAS_HEIGHT / ((maxPoint - minPoint));
var sbUserVoiceWavePath = new StringBuilder();
var yWave = points[0];
sbUserVoiceWavePath.AppendFormat("M{0},{1} ", 0, (int)(CANVAS_HEIGHT / 2));
for (var xWave = 1; xWave < points.Count(); xWave++)
{
yWave = (int)(points[xWave]);
var x = string.Format("{0:0.00}", xWave * xScale).Replace(",", ".");
var y = string.Format("{0:0.00}", (yWave - minPoint) * yScale).Replace(",", ".");
sbUserVoiceWavePath.AppendFormat("L{0},{1} ", x, y);
}
return sbUserVoiceWavePath.ToString();
}
Then this is how the wave form will look like:
Calculating Score
Now that we have all the data (pitch contour and wave forms from both sample voice and user voice), it's up to us to calculate the score. Assuming that the score ranges from a minimum of 0 points to the maximum of 100 points (meaning perfect pronunciation), we must define how to measure this scale.
As stated before, I have no knowledge in audio analysis, so I invented a way of taking the pitch contour in its individual segments and calculating the slope of each segment. That is, a segment can go up or down. The entire pitch contour of the sample speech will then have a collection of slopes, being for example "down-up-down-down-up-down-up", while the user speech will have another set of slopes, for example "down-down-up-down-down-up-up-down". We then compare these sets against each other and provide a score varying from 0 to 100 points, where 0 means no matches and 100 means all segment slopes matched. You can see which slopes are goind down or up through the red and blue arrows in the image below:
And below is the main code for calculating the grade from the pitch contour comparison:
private void GenerateGrade()
{
var segmentSlopeScore = 0.0;
var samplePitchValuesLength = GetLastX(this.sampleVoicePitchValues) -
GetFirstX(this.sampleVoicePitchValues);
var userPitchValuesLength = GetLastX(this.userVoicePitchValues) -
GetFirstX(this.userVoicePitchValues);
var pitchValuesLengthError = (double)Math.Abs(userPitchValuesLength -
samplePitchValuesLength) / samplePitchValuesLength;
var sampleSegments = GetPitchSegmentLengthList(this.sampleVoicePitchValues).Where(v => v > 0).ToList();
var userSegments = GetPitchSegmentLengthList(this.userVoicePitchValues).Where(v => v > 0).ToList();
var segmentIndex = 0;
var validSegmentCount = 0;
RemoveNaNSegments(userSlicedSlopes, userSegments);
if (sampleSegments.Count() > userSegments.Count())
{
RemoveInconsistentSegments(sampleSlicedSlopes, userSlicedSlopes, sampleSegments, userSegments);
}
else if (userSegments.Count() > sampleSegments.Count())
{
RemoveInconsistentSegments(userSlicedSlopes, sampleSlicedSlopes, userSegments, sampleSegments);
}
foreach (var sampleSegment in sampleSegments)
{
if (sampleSegment > 0)
{
if (userSlicedSlopes.Count() > segmentIndex)
{
var currentSampleSlope = sampleSlicedSlopes[segmentIndex];
var currentUserSlope = userSlicedSlopes[segmentIndex];
if (!double.IsNaN(currentSampleSlope) && !double.IsNaN(currentUserSlope))
{
if (CheckSlopes(currentSampleSlope, currentUserSlope))
segmentSlopeScore++;
}
segmentIndex++;
validSegmentCount++;
}
}
}
var sampleSegmentCount = GetPitchSegmentLengthList(this.sampleVoicePitchValues).Count();
var userSegmentCount = GetPitchSegmentLengthList(this.userVoicePitchValues).Count();
var segmentCountError = (double)Math.Abs(userSegments.Count() - sampleSegments.Count())
/ sampleSegmentCount;
Grade = (int)((segmentSlopeScore / validSegmentCount) * 100.0 * (1.0 - segmentCountError));
}
Displaying Score
As we did before with pitch contours and wave formats, the score is displayed by binding a visual element on the XAML side to a given property on the ViewModel class:
<TextBlock x:Name="txtGrade" Text="{Binding Grade}" Foreground="Green" FontSize="45" TextAlignment="Center" VerticalAlignment="Center">
</TextBlock>
The
public int Grade
{
get
{
return grade;
}
set
{
grade = value;
NotifyPropertyChanged("Grade");
}
}
Final Considerations
I hope you have enjoyed the article. And while I hope it can be useful for you, as you can see, there is a lot of room for improvement, and if you have something to say, please leave a comment below.
History
- 2012-04-29: Initial version.