Voice synthesis with Microsoft SAPI

mihsathe

4.81/5 (9 votes)

Jun 25, 2010

CPOL

4 min read

50177

2574

Know the various basics of voice synthesis using the Microsoft Sound API...

Download source - 69.5 KB

Introduction

Microsoft provides a great tool for both speech recognition and synthesis. It is called Microsoft Speech API. Here I'll introduce the various features you can use in the synthesis of speech with SAPI. Please find a full application in the downloads.

Use of Code

For speech synthesis, we get the System.Speech.Synthesis namespace. The main class is the SpeechSynthesizer which runs in-process. You can set its output to be the default audio device or a wav stream/file. Then you simply call SpeakAsync() with the text to be spoken. For customization, you can specify the voice to be used. Right now, for Vista, we only get one voice: 'Microsoft Anna'. I've seen some other demos with a voice called 'Microsoft Lili', which I believe spoke Chinese. What was really interesting about that voice is it could also speak English, which made the voice sound like a native Chinese speaker speaking English ... very cool. Supposedly, you can get other voices by installing the MUI packs on Vista ... but I have yet to track any of these down to try it out. XP should have some other voices to use like 'Microsoft Mary' and 'Microsoft Sam'. For the synthesizer, you can also customize its volume and rate of speaking. For 'pitch', you can change a prompt's emphasis, or do this it SSML using the <prosody> tag.

Speaking, in general, is as simple as this:

SpeechSynthesizer synth = new SpeechSynthesizer();
synth.SpeakAsync("Hello World.");

Volume, by default, is 50 (range 0 to 100), and rate, by default, is 0 (range -10 to +10). You can change them with the variables synth.Volume and synth.Rate, respectively. If you are working with WPF or Silverlight, you can simply assign a slider for this. There are two methods in the SpeechSynthesizer class to speak:

Speak(): Will speak in sync with the current thread.
SpeakAsync(): Will speak in a different thread. So obviously, changes made in the volume or rate won't affect the parameters during runtime. They will be changed in the next cycle.

One of the coolest features is that we can directly use Wave files (.wav) as an I/O medium for the sound. You can set your output to anything like the default audio device, a wave file, an audio stream etc. This is an example for a wave file output.

synth.SetOutputToWaveFile("output.wav");
synth.Speak(textBox1.Text);
synth.SetOutputToDefaultAudioDevice();
MessageBox.Show("done");

Basically, what I've done here is, first set the output to a Wave file (it'll be created if not present), then made it to speak in the Wave file. Then put back the output to the default device for further operations.

Now, there is an XML based language called SSML by W3C that specifies the standards of speech delivery. You can completely specify how a speech should be spoken. Fortunately, SAPI supports this standard. We can generate a prompt from SSML to directly deliver a speech.

Some useful tags of SSML:

audio: To take an input from some Wave file.
emphasis: Specifies that the enclosed text should be spoken with emphasis.
enumerate: An automatically generated description of the choices available to the user.

It specifies a template that is applied to each choice in the order they appear in the menu element, or in a field element that contains option elements.

phoneme: Specifies a phonetic pronunciation for the contained text. The format of the representation is vendor-specific, and does not always use the IPA alphabet. See your vendor documentation for details.
prosody: Specifies prosodic information for the enclosed text such as pitch, duration, range, contour etc.

You can speak out from an SSML file using PromptBuilder, like this:

PromptBuilder pb = new PromptBuilder();
pb.AppendText("Hello..");
try
{
  pb.AppendSsml("SSML.xml");
}
catch (Exception exc) 
{
  MessageBox.Show(exc.Message);
}
synth.SpeakAsync(pb);

Another interesting feature of this API is that we can get an XML output of whatever we speak. And the reason I say it's important is that SSML can work as an intermediate language for programs written in any platform. For example, you speak something on an ASP.NET website, then create an XML out of it and pass to a Web Service. This service will return the same file to its client based on Java. So you could achieve a good interoperability in two programs. Here's the way to work with it:

PromptBuilder myPrompt = new PromptBuilder();
myPrompt.AppendText(textBox1.Text);
MessageBox.Show(myPrompt.ToXml());

In the End

Well, in the next one, I'll write about sound recognition.