Click here to Skip to main content
Click here to Skip to main content

Voice Recognition and Synthesis Using the Intel® Perceptual Computing SDK

, 18 Nov 2013
Rate this:
Please Sign up or sign in to vote.
Voice Recognition and Synthesis Using the Intel® Perceptual Computing SDK

Intel® Developer Zone offers tools and how-to information for cross-platform app development, platform and technology information, code samples, and peer expertise to help developers innovate and succeed. Join our communities for the Internet of Things, Android, Intel® RealSense™ Technology and Windows to download tools, access dev kits, share ideas with like-minded developers, and participate in hackathon’s, contests, roadshows, and local events.

Related articles:
Developer's Guide for Intel® Processor Graphics for 4th Generation Intel® Core™ Processors
Touch and Sensors
How to Create a Usable Touch UI
How to Adjust Controls for Touch

Abstract: Ever since touch devices became popular, there has been a sense that we can improve on the methods that we use to interact with our technology. Voice recognition and voice synthesis are going to play a large part in the way we interact with systems in the future. This technological future we have been envisioning for the last few decades is now, finally achievable. Intel® Perceptual Computing SDK is a set of tools designed to help us achieve it.

Why are voice recognition and voice synthesis so important?

Imagine a world without sound, a world where you couldn't tell someone what you wanted; a world, if you like, where your boss couldn't give you feedback and tell you how great a job you are doing. Sounds pretty grim, doesn't it? Surprisingly though, we've been happy enough to interact with our computers in this very way. But what if there was a better way? A way where the computer provides instant vocal feedback. A way where your voice controls the computer. What if you were freed of the need to actually use a keyboard or screen, and yet you still have meaningful control of your applications?

Who can forget that scene in Star Trek where Scottie picked up the mouse and spoke into it? Oh, how we laughed at the idea of being able to really tell a computer what to do. Why, you'll always have to use a mouse and keyboard won't you?

Recently, I had the opportunity to work with the Intel Perceptual Computing SDK to see what it can do. Part of the SDK covers speech recognition and speech synthesis, and I will cover what I discovered about this as we progress through this article. Along the way, we'll write some code and talk about things such as accents, context, free form text, and dictionaries. Oh, we'll have a wonderful time, and I hope that you'll want to incorporate speech into your applications.

A particular area of interest to me is how we can build more accessible applications. I'm not just talking about making applications compliant with various disability legislations, but how we can make applications work in environments where touching a screen or a mouse/keyboard isn’t possible or practical. For example, if you're baking in the kitchen and your hands are covered in flour, you don't want to be touching your screen. However, with perceptual computing, you could be trying out a new recipe, following a top chef as they demonstrate it, and advance to the next stage of the baking process simply by using voice commands.

Why what you say and how you say it matters

With the technology that's available now, it's getting to the stage that voice recognition is becoming more and more straightforward. So, I can’t see any reason for not using it right now, can you? Well, there may be one or two reasons I'll cover now. When I was writing my first application with speech recognition in it, the code was easy to write. As you’ll soon see, it was very simple. The problem came when I actually ran the application and tried to test it. Living in the North East of England, my accent is quite broad, and when I spoke, the recognition modules had a tough time decoding what I was saying. This was because the API, at the time, was geared towards a nice, neutral American accent. Fortunately, the team at Intel is really on the ball, so new language packs are rolling out to ease the recognition of other accents.

You, on the other hand, have a neutral American accent and you're raring to go. "Is there anything else I need to know?" I hear you ask. Well, yes there is. Speech recognition isn't that hard to code now, but getting your app to understand context is. What do I mean by context? Real speech recognition takes you beyond just using a dictionary of a few words. Speech recognition means that your application really needs to be able to "understand" what you meant when you said "Open the file menu" because you might also say "Click the file option" or "Select the first menu item." This is beyond the scope of this article, but if you are interested in integrating this level of ability in to your applications, I would suggest that you spend time researching Natural Language Processing.

One final note before we get to the code. I find that using a high quality microphone is a great help when using speech recognition.

A basic voice recognition sample

By this point, you may be wondering if I'm going to show you any code. Well, you need wonder no more—let’s take a look at a basic C# example. This is possibly the simplest code you've ever seen, but it is a great illustration of how much work Intel has put into the SDK, and how much it has done to help developers get started with the tools. To keep things simple, this is going to be a Windows* console application, and we are just going to write whatever the SDK detects to the console window.

Once we have created our console application (I've called mine SpeechRecognition.Sample1), we are going to create a class that inherits from UtilMPipeline. This class, provided as part of the SDK, removes the boiler plate that we need to write (more on this later).

Note: The first few samples are provided in both C# and C++. We will see that the two code bases are virtually identical, so we provide the other examples in C# only.

C# code

using System;
namespace SpeechRecognition.Sample1
{
    public class SpeechPipeline : UtilMPipeline
    {
        public SpeechPipeline()
        {
            EnableVoiceRecognition();
            this.LoopFrames();
        }

        public override void OnRecognized(ref PXCMVoiceRecognition.Recognition data)
        {
            Console.WriteLine(data.dictation);
            base.OnRecognized(ref data);
        }
    }
}

Most of this code is fairly self-explanatory. In the constructor, we enable voice recognition; cunningly enough, the method to do this is called EnableVoiceRecogntion. Then we use the LoopFrames method to tell the SDK to loop through its detection cycle. This effectively puts the application into a big loop. We override the OnRecognized method so that we can write out the words that are recognized. Notice that we are writing out the data.dictation. As we are going to be running the recognition in dictation mode, we access this to get at what was said. When we cover command mode, we'll see what else is available to us to work out what was said.

When we run the application, it's apparent that the SDK waits for natural language breaks before it writes anything out through OnRecognized.

Now, running this code couldn't be simpler. Simply instantiate the class and you can talk to your computer.

C++ code

#include <iostream>
#include "util_pipeline.h"

class CppMPipeline : public UtilPipeline
{
public:
  CppMPipeline() 
  {
    EnableVoiceRecognition();
    LoopFrames();
  }

  virtual void PXCAPI OnRecognized(PXCVoiceRecognition::Recognition *data)
  {
    std::wcout << data->dictation
  }
};

As you can see, this is virtually identical to the C# implementation. Getting started with speech recognition really is that straightforward.

Making our own speech pipeline

One of the really surprising things is how little you actually have to do. There's a lot taken care of, behind the scenes, for us. It's worthwhile, at this point, to actually look at what's going on. I'm only going to cover this in C# code here, as the theory is exactly the same for the C++ version. We're going to cover this now because we're going to need this infrastructure later on. So, let's start off by creating our class. This time, it's not going to inherit from UtilMPipeline. (Note that this isn't going to be the exact class that Intel provides, we're going to be making it more convenient for our purposes, but it will provide the same hook points.)

C# code

public class SpeechPipeline : IDisposable
{
  public void Dispose()
  {
  }
}

We make our class disposable because we have a few unmanaged resources that we need to clean up when we finish with it.

Next, we're going to add a constructor to this class. In this constructor, we are going to create a session that we will maintain for the lifetime of this class instance.

C# code

PXCMSession session;
public SpeechPipeline()
{
  PXCMSession.CreateInstance(out session);
}

Here we see a very common pattern in the SDK. CreateInstance actually returns a status pxcmStatus that tells us whether or not the call worked. To get the populated value, the method has an out parameter that provides us with the populated instance. An important thing to note here is that the C# version always uses PXCMSession, but the C++ version can return PXCSession or PXCMSession depending on whether we are using it as single threaded or multi-threaded. Please read the SDK documentation on this because understanding this is absolutely vital if you're writing in C++. Sorry, but the C# version supports multi-threading out of the box.

The eagle-eyed reader may notice that the session is actually a private member. This is because we are going to use it elsewhere in the class, and as it's disposable, let's put it into our Dispose method.

C# code

public void Dispose()
{
  if (session != null)
  {
    session.Dispose();
  }
}

As we saw, in our original classes, we have an EnableVoiceRecognition method. Now that we have a session, let's create our own EnableVoiceRecognition method.

C# code

UtilMCapture capture;
PXCMVoiceRecognition voiceRecognition;
public void EnableVoiceRecognition()
{
    PXCMVoiceRecognition.ProfileInfo pinfo;
    session.CreateImpl<PXCMVoiceRecognition>(PXCMVoiceRecognition.CUID, out voiceRecognition);
    voiceRecognition.QueryProfile(0, out pinfo);
    capture = new UtilMCapture(session);
    capture.LocateStreams(ref pinfo.inputs);
    voiceRecognition.SetProfile(ref pinfo);
    voiceRecognition.SubscribeRecognition(0, OnRecognized);
}

Again, we are going to create a member variable (voiceRecognition) that will be used to actually control the voice recognition. To initialize it, we call session.CreateImpl, using the PXCMVoiceRecognition type as the generic type. Once we have this, we call QueryProfile to access the parameters that can be used to configure the voice recognition. The next couple of lines instantiate one of the more interesting parts of the system, the UtilMCapture allows us to pull together multiple streams of input, such as the audio or video stream, in one easy and consistent manner. Finally, we set up the profile that we are going to use for the voice recognition and subscribe to the voice recognition event. Our OnRecognized method looks like this now:

C# code

public void OnRecognized(ref PXCMVoiceRecognition.Recognition data)
{
    Console.WriteLine(data.dictation);
}

One thing to be aware of is that the dictation property is actually a Unicode string. While this doesn't have much of a practical effect in our C# code, it is something that we have to be aware of when we are using it in C++.

Now, both of our capture and voiceRecognition members are disposable, so we'll add them into our Dispose method like this:

C# code

if (voiceRecognition != null)
{
    voiceRecognition.ProcessAudioEOS();
    voiceRecognition.Dispose();
}
if (capture != null)
{
    capture.Dispose();
}

There's an unfamiliar looking method in there. That method tells our application that the audio stream has come to a stop, and it should process any streams internally that haven't been cleared yet. This helps to ensure that we don't leave things in an unstable state.

We've come a long way here, and if we look back, we see that there's one thing left for us to hook in to LoopFrames:

C# code

public void LoopFrames()
{
    while (true)
    {
        PXCMAudio sample= null;
        PXCMScheduler.SyncPoint[] syncPoint = new PXCMScheduler.SyncPoint[2];
        try
        {
            capture.ReadStreamAsync(out sample, out syncPoint[0]);
            voiceRecognition.ProcessAudioAsync(sample, out syncPoint[1]);
            PXCMScheduler.SyncPoint.SynchronizeEx(syncPoint);
        }
        finally
        {
            if (sample != null)
                sample.Dispose();
            if (syncPoint != null)
                PXCMScheduler.SyncPoint.Dispose(syncPoint);
        }
    }
}

In the loop, we simply read from the asynchronous audio stream and process the voice recognition audio stream. We aren't going to get a recognizable word or sentence every run through this loop, so we are letting the SDK build up the audio stream for analysis here. The scheduler then effectively marshals things back together via a synchronization point. Again, we are going to be good citizens and dispose of resources when we don't need them.

For those who are keen to know what this all looks like in C++, here it is:

C++ code

#include "stdafx.h"
#include <iostream>
#include <string>
#include "util_pipeline.h"

class MyHandler : public PXCVoiceRecognition::Recognition::Handler 
{
public:
  virtual void PXCAPI OnRecognized(PXCVoiceRecognition::Recognition *data)
  {
    std::wcout << data->dictation << std::endl;
  }
};

class CppPipeline
{
public:
  CppPipeline() 
  {
    PXCSession_Create(&session);

    EnableVoiceRecognition();
    LoopFrames();
  }

  void EnableVoiceRecognition()
  {
    PXCVoiceRecognition::ProfileInfo pinfo;
    session->CreateImpl<PXCVoiceRecognition>(&voiceRecognition);
    voiceRecognition->QueryProfile(0, &pinfo);
    capture = new UtilCapture(session);
    capture->LocateStreams(&pinfo.inputs);
    voiceRecognition->SetProfile(&pinfo);
    voiceRecognition->SubscribeRecognition(0, new MyHandler);
  }

  void LoopFrames()
  {
    while (true)
    {
      PXCSmartSPArray syncPoint(2);
      PXCSmartPtr<PXCAudio> audio;
      capture->ReadStreamAsync(&audio, &syncPoint[0]);
      voiceRecognition->ProcessAudioAsync(audio, &syncPoint[1]);
      syncPoint.SynchronizeEx();
    }
  }

  ~CppPipeline()
  {
    if (voiceRecognition)
    {
      voiceRecognition->ProcessAudioEOS();
      voiceRecognition->Release();
    }
    if (capture)
    {
      capture->Release();
    }
    if (session)
    {
      session->Release();
    }
  }

private:
  PXCSession* session;
  PXCVoiceRecognition* voiceRecognition;
  UtilCapture* capture;
};

Obviously, there is more to this, so let's start by beefing up our two implementations so that they both support different languages (remember, a problem I had originally was it understanding my accent). Please note that this sample relies on you having installed other language packs when you installed the SDK. If you didn't install the packs, please feel free to skip over this section.

The key to being able to manipulate the languages is all handled through the profile. Inside the profile, there is an id called language that specifies the current language. This seems like a good place for us to start manipulating our language. We will use this to both print out our current language, and choose another one. All we need to do is add the following line after we call QueryProfile (assuming we want to use British English instead):

C# code

pinfo.language = PXCMVoiceRecognition.ProfileInfo.Language.LANGUAGE_GB_ENGLISH;

C++ code

pinfo.language = PXCVoiceRecognition::ProfileInfo::LANGUAGE_GB_ENGLISH;

Moving beyond dictation

So far, we've been concentrating on using the dictation facilities of the SDK. If that were all we had available, it would be pretty impressive, but we can go so much further and support command and control functionality by supplying a dictionary that will be used to control the commands we use.

To use a dictionary, we have to create one and add it to the voice recognition after we have enabled voice recognition, and before we start looping through the frames. To do this, we simply need to add a method that looks something like this:

C# code

string[] commands;
public void AddGrammar(string[] grammar)
{
    int gid;
    commands = grammar;
    voiceRecognition.CreateGrammar(out gid);
    for (int i = 0; i < grammar.Length; i++)
    {
        voiceRecognition.AddGrammar(gid, i, grammar[i]);
    }
    voiceRecognition.SetGrammar(gid);
}

Here, we are providing the ability to use a pre-defined array of words/phrases in our application. This is known as a grammar. To use a grammar in the SDK, we call CreateGrammar to create a context that will contain our grammar words and phrases. Next, we add the individual grammar items using AddGrammar before we finally choose which grammar context to apply via SetGrammar.

We save the array of words to a member because that way we receive the command in the OnRecognized method changes. So, let's see what that looks like now.

C# code

public void OnRecognized(ref PXCMVoiceRecognition.Recognition data)
{
    if (data.label < 0)
        Console.WriteLine(data.dictation);
    else
        Console.WriteLine(commands[data.label]);
}

The OnRecognized method looks a lot different. The label property tells us the index of the item that it has identified from the grammar. So, if the label is -1, it means that we are not using dictation mode and we can continue to use the dictation property. If the label is 0 or greater, we simply retrieve the grammar command using the zero-based index in the array.

Speech synthesis

While it's great being able to recognize words and phrases, we can go one better than that and actually have our application talk to us. While text to speech has been available for a while, it's never really taken off other than in specialist form factors (such as a SatNav). The SDK makes creating voice synthesis very easy. Uncharacteristically, Intel hasn't provided a UtilPipeline equivalent piece of code for voice synthesis, so we will roll this functionality into the code we have been writing so far.

Unlike voice recognition, speech synthesis happens in one-off bursts. In other words, it's not waiting for data to come into it from the sensors, so we don't need to put our code into the LoopFrames method. Instead, we are going to create a one-off method to take care of the processing.

C# code

public void Say(string sentence)
{
    if (string.IsNullOrWhiteSpace(sentence)) return;
    PXCMVoiceSynthesis.ProfileInfo pinfo;
    PXCMVoiceSynthesis voiceSynthesis;

    session.CreateImpl<PXCMVoiceSynthesis>(PXCMVoiceSynthesis.CUID, out voiceSynthesis);

    voiceSynthesis.QueryProfile(out pinfo);
    voiceSynthesis.SetProfile(ref pinfo);

    int sid;
    voiceSynthesis.QueueSentence(sentence, out sid);
    var audioWriter = new VoiceOut(pinfo.outputs.info);
    while (true)
    {
        PXCMAudio sample = null;
        PXCMScheduler.SyncPoint syncPoint = null;
        var status = voiceSynthesis.ProcessAudioAsync(sid, out sample, out syncPoint);
        if (status < pxcmStatus.PXCM_STATUS_NO_ERROR) break;
        status = syncPoint.Synchronize();
        audioWriter.WriteAudio(sample);
        syncPoint.Dispose();
        sample.Dispose();
        if (status < pxcmStatus.PXCM_STATUS_NO_ERROR) break;
    }
    audioWriter.Close();
}

The first parts of this method should be familiar by now. We get and set our profile. The interesting part is where we queue up the sentence for the voice synthesizer. There's a fair bit that goes on with this section, so we need to be aware that while the voice synthesizer creates an audio file, it isn't actually responsible for playing the audio file. Instead, we delegate that responsibility to another class that Intel provides in the SDK. This functionality is covered in the code samples you get when you install the SDK and so it is not covered here. After the SDK is installed, the code samples can be located under the base directory of the SDK install at %PCSDK%\framework\CSharp\voice_synthesis.cs\VoiceOut.cs.

Summary

We have covered using the built-in SDK functionality to enable speech recognition. We then looked at how we could recreate this ourselves, using the same techniques that are used in the SDK. Once we had this in place, we saw how easy it was to start recognizing new languages with the SDK.

As well as looking at Command mode, we introduced the ability to work with predefined grammars before we finished the discussion with an introduction of speech synthesis.

Useful Links

Perceptual Computing SDK http://software.intel.com/en-us/vcsource/tools/perceptual-computing-sdk
Perceptual Computing SDK Help http://software.intel.com/sites/landingpage/perceptual_computing/documentation/html/
Perceptual Computing SDK Showcase Applications http://software.intel.com/en-us/vcsource/tools/perceptual-computing-sdk/demos

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.

Intel sample sources are provided to users under the Intel Sample Source Code License Agreement

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Pete O'Hanlon
CEO
United Kingdom United Kingdom
A developer for over 30 years, I've been lucky enough to write articles and applications for Code Project as well as the Intel Ultimate Coder - Going Perceptual challenge. I live in the North East of England with 2 wonderful daughters and a wonderful wife.
 
I am not the Stig, but I do wish I had Lotus Tuned Suspension.
Follow on   Twitter   Google+

Comments and Discussions

 
QuestionVoice Recognition on existing WAV file PinmemberTom Radian22-Jul-14 3:10 
QuestionMy rating of 5 PinpremiumBill_Hallahan21-Jun-14 21:14 
AnswerRe: My rating of 5 PinprotectorPete O'Hanlon22-Jun-14 21:17 
QuestionHow to pause recognision PinmemberGourav Gunjan4-Apr-14 0:03 
AnswerRe: How to pause recognision PinprotectorPete O'Hanlon7-Apr-14 6:03 
GeneralRe: How to pause recognision PinmemberGourav Gunjan8-Apr-14 0:57 
GeneralRe: How to pause recognision PinprotectorPete O'Hanlon8-Apr-14 4:52 
QuestionDosent work for me ! am i doing somthing wrong ? its an winform ! please help PinmemberMember 1042745114-Mar-14 22:22 
AnswerRe: Dosent work for me ! am i doing somthing wrong ? its an winform ! please help PinprotectorPete O'Hanlon20-Mar-14 11:26 
Generalfascinating Pinmemberetax21-Jan-14 20:34 
GeneralNice intro PinmemberRichard Osafo25-Nov-13 12:15 
GeneralRe: Nice intro PinprotectorPete O'Hanlon26-Nov-13 0:33 
QuestionRe: Nice intro PinmemberRichard Osafo26-Nov-13 1:06 
AnswerRe: Nice intro PinprotectorPete O'Hanlon26-Nov-13 1:33 
GeneralRe: Nice intro PinmemberRichard Osafo26-Nov-13 3:54 
GeneralRe: Nice intro PinmemberRichard Osafo26-Nov-13 4:10 
GeneralRe: Nice intro PinprotectorPete O'Hanlon26-Nov-13 4:16 
GeneralGreat Article! PinmemberTorreyABrown19-Nov-13 7:52 
GeneralRe: Great Article! PinprotectorPete O'Hanlon19-Nov-13 8:29 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 18 Nov 2013
Article Copyright 2013 by Pete O'Hanlon
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid