|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionThe included Sophia project is intended to be both instructive and fun. It is, at the most basic level, a chatterbox application with speech synthesis and speech recognition tacked on to it. I originally meant for it to be a showcase of what one can do with the This article provides an overview of the various features of the BackgroundChatterboxes were among the earliest applications adapted for the personal computer. A chatterbox is simply an artificial personality that tries to maintain a conversation with users using pre-defined scripts. One of the earliest examples is Joseph Weizenbaum's Eliza, written in the mid-sixties. It used a scripted psychiatrist's persona to rephrase anything the user typed into the terminal as a question, and then threw the question back. Many of the games available in the early 80's were text-based, and great attention was paid to making text conversations with the computer both involving and immersive. A large part of this involved techniques for fooling the user, to some extent, into believing that the game he or she was playing was actually intelligent. Ada accomplished this by including enough flexibility so that responses to the user seemed spontaneous. Infocom accomplished this in its text-based adventures by using humor and even a certain amount of scripted self-awareness -- for instance, the game narrator could get into moods, at times, that would affect what would happen next. Emulating intelligence was always a high priority in these games. The one thing missing from these emulations was the ability to actually talk to the computer using natural language. Even though the movies of the time presented this as something that could be easily accomplished (remember WarGames?), it never was. As speech recognition technology got better, however, the gaming industry also became more visually oriented and less interested in the experiments that were being done with artificial personalities. In the interim period, between then and now, the text-based experience has been sustained mostly by hobbyists who continue to write adventure games for the Z-machine specification created by Infocom, as well as new chatterbox scripts that have evolved over the years to converse over a wider variety of topics, and with a wider selection of responses, than the original Eliza. The Sophia project is simply an attempt to bring speech recognition and synthesis to the text-gaming experience. With Microsoft's speech recognition technology and the API provided through the .NET 3.0 Framework's A lot of material about working with Vista and speech recognition can also be found in my introductory article Speech Recognition and Synthesis Managed APIs. If there are aspects of Vista speech recognition that you feel I have breezed through too quickly in this article, it is quite possible the reason is that I have already covered it there. Playing the DemoI will begin by going over what the demo application can do. I will follow this up with an explanation of some of the underlying techniques and patterns. The application is comprised of a text output screen, a text entry field, and a default enter button. The initial look and feel is an IBX XT theme (the first computer I ever played on). This can be changed using voice commands, which I will cover later. There are three menus initially available. The File menu allows the user to save a log of the conversation as a text file. The Select Voice menu allows the user to select from any of the synthetic voices installed on her machine. Vista initially comes with "Anna". Windows XP comes with "Sam". Other XP voices are available depending on which versions of Office have been installed over the lifetime of that particular instance of the OS. If the user is running Vista, then the Speech menu will allow him to toggle speech synthesis, dictation, and the context-free grammars. By doing so, the user will have the ability to speak to the application, as well as have the application speak back to him. If the user is running XP, then only speech synthesis is available, since some of the features provided by .NET 3.0 and consumed by this application do not work on XP. Speech recognition in Vista has two modes: dictation and context-free recognition. Dictation uses context, that is, an analysis of preceding words and words following a given target of speech recognition, in order to determine what word was intended by the speaker. Context-free speech recognition, by way of contrast, uses exact matches and some simple patterns in order to determine if certain words or phrases have been uttered. This makes context-free recognition particularly suited to command and control scenarios, while dictation is particularly suited to situations where we are simply attempting to translate the user's utterances into text. You should begin by trying to start up a conversation with Sophia using the textbox, just to see how it works, as well as her limitations as a conversationalist. Sophia uses certain tricks to appear more lifelike. She throws out random typos, for one thing. She also is a bit slower than a computer should really be. This is because one of the things that distinguish computers from people is the way they process information -- computers do it quickly, and people do it at a more leisurely pace. By typing slowly, Sophia helps the user maintain his suspension of disbelief. Finally, if a text-to-speech engine is installed on your computer, Sophia reads along as she types out her responses. I'm not certain why this is effective, but it is how computer terminals are shown to communicate in the movies, and it seems to work well here, also. I will go over how this illusion is created below. In Command\AIML\Game Lexicon mode, the application generates several grammar rules that help direct speech recognition toward certain expected results. Be forewarned: initially loading the AIML grammars takes about two minutes, and occurs in the background. You can continue to touch type conversations with Sophia until the speech recognition engine has finished loading the grammars and speech recognition is available. Using the command grammar, the user can make the computer do the following things: LIST COLORS, LIST GAMES, LIST FONTS, CHANGE FONT TO..., CHANGE FONT COLOR TO..., CHANGE BACKGROUND COLOR TO.... Besides the IBM XT color scheme, a black papyrus font on a linen background also looks very nice. You can also say the command "PLAY GAME" to get a list of game files that are available in the \Game Data\DATA subfolder. Either say the name of the game or the numeric position of the game in the list (e.g., "TWO") in order to play it. To see a complete list of keywords used by the text-adventure game you have chosen, say "LIST GAME KEYWORDS." When the game is initially selected, a new set of rules is created based on different two word combinations of the keywords recognized by the game, in order to help speech recognition by narrowing down the total number of phrases it must look for. In dictation mode, the underlying speech engine simply converts your speech into words and has the core Using the codeXP vs VistaThe .NET 2.0, .NET 3.0, SAPI 5.3 and the speech engine all come with Vista, so nothing extra needs to be installed in order to get Vista comes with the Microsoft Anna voice installed. An additional voice, Microsoft Lili, can be got by installing the Simple Chinese language pack. To my knowledge, no other synthetic voices are currently available. Dumbing down the applicationUsing humans as the measure, computers do some things poorly, some things well, and some things too well. One of the things it does too well is respond too quickly. It is a tell that one is dealing with a machine and not a person, and with chatterboxes ruins the illusion that you are actually talking with an intelligence. To compensate for this, I slow the response rate down, so that Sophia's responses mimic a person typing. The code responsible for issuing events to the GUI initially pauses in order to emulate consideration, and then iterates through the characters that make up the response provided by the appropriate rules engine, and issues update events to the GUI one character at a time, with an appropriate intermittent pause. public delegate void GenericEventHandler<T>(T val);
public event GenericEventHandler<string> Write;
public void TypeSlow(string outputText)
{
if (null == Write)
return;
Thread.Sleep(500);
Write("Sophia: ");
Thread.Sleep(1000);
SpeakText(outputText);
for (int i = 0; i < outputText.Length; i++)
{
Write(outputText.Substring(i, 1));
Thread.Sleep(50);
}
Write(Environment.NewLine + Environment.NewLine);
}
This in itself goes a long way toward propping up the illusion of an intelligent computer personality. Going off of various movies and TV shows, however, it became clear that we also expect the computer personality to speak to us, though the voice must also be somewhat artificial. In Star Trek, for instance, the voice tends to be monotone. In 2001, HAL's voice is human, but artificially calm. Also, the computer personality's speech typically matches the rate at which she types, as if she is reading aloud as she typed, or else as if we are reading her mind as she composes her response. All this is a bit peculiar, of course, since I am using cinematic idioms to judge what will appear natural to the end user -- all the same, it seems to work, as if the sci-fi movies don't so much predict what the future will be like as shape our expectations regarding that future. The speech synthesizer available through protected SpeechSynthesizer _synthesizer = new SpeechSynthesizer();
protected bool _isSpeechOn = true;
protected string _selectedVoice = string.Empty;
protected void SpeakText(string output)
{
if (_isSpeechOn)
{
_synthesizer.SelectVoice(SelectedVoice);
_synthesizer.SpeakAsync(output);
}
}
public string SelectedVoice
{
get { return _selectedVoice; }
set { _selectedVoice = value; }
}
Advanced GrammarNext, I wanted to add speech recognition to my application, in order to hold two-way conversations with Sophia. There are several ways to do this, using
I wanted more control than the cross-process SR engine provides, however, and I also did not want what I did with the engine to affect any other applications, so I decided to use the in-process For the speech recognition engine to be effective, you must load it up with Creating a dictation grammar is fairly straight-forward. Just instantiate a default instance of the dictation grammar, unload all other grammars from the recognization engine, and then add dictation. protected object grammarLock = new object();
protected void LoadDictation()
{
DictationGrammar dictationGrammar = new DictationGrammar();
dictationGrammar.SpeechRecognized +=
new EventHandler<speechrecognizedeventargs />
(recognizer_DictationRecognized);
lock (grammarLock)
{
_recognizer.UnloadAllGrammars();
_recognizer.LoadGrammar(dictationGrammar);
}
}
There is actually more than one Alternatively, you can handle all speech recognition events from all grammars in one place by creating a delegate to intercept the In addition to the Sophia captures these events and displays them in the GUI, so users can watch the speech recognition process as it occurs. Recognition successes are displayed in white, rejections are displayed in red, while hypotheses are orange.
Creating custom grammars is much more fun than dictation, however, and also provides a greater degree of control. It works best in command and control scenarios, where you only need to match a few select phrases to implement basic commands. In this demo project, I wanted to see how much further I could push that paradigm, and so I implemented grammars that recognize some 30,000 phrases in order to play old Frotz games using speech recognition, and upwards of 70,000 phrases for the underlying AIML-based artificial personality. The Command and Control grammar is the simplest, so I will start there. In dealing with grammars, it is important to remember that a A simple example of building a protected virtual Grammar GetSpeechCommandGrammar()
{
GrammarBuilder gb = new GrammarBuilder();
Choices choices = new Choices();
choices.Add("List Colors");
choices.Add("List Game Keywords");
choices.Add("List Fonts");
gb.Append(choices);
Grammar g = new Grammar(gb);
return g;
}
Another section of the code can set a priority for this grammar, in order to resolve any possible recognition conflicts with other grammars (remember that the higher priority number takes precedence, while a dictation grammar's priority cannot be set); it can give the grammar a name, and it can add an event handler for the public override Grammar[] GetGrammars()
{
Grammar g = GetSpeechCommandGrammar();
g.Priority = this._priority;
g.Name = this._name;
g.SpeechRecognized += new EventHandler<speechrecognizedeventargs />
(SpeechCommands_SpeechRecognized);
return new Grammar[1]{g};
}
public void SpeechCommands_SpeechRecognized
(object sender, SpeechRecognizedEventArgs e)
{
string recognizedText = e.Result.Text;
if (recognizedText.IndexOf
("list colors", StringComparison.CurrentCultureIgnoreCase)>-1)
{
StringBuilder sb = new StringBuilder();
foreach (string knownColor in Enum.GetNames(typeof(KnownColor)))
{
sb.Append(", " + knownColor);
}
Write(sb.ToString().Substring(2));
}
else if (recognizedText.IndexOf
("list fonts", StringComparison.CurrentCultureIgnoreCase) > -1)
{
StringBuilder sb = new StringBuilder();
foreach (FontFamily font in
(new System.Drawing.Text.InstalledFontCollection()).Families)
{
sb.Append(", " + font.Name);
}
Write(sb.ToString().Substring(2));
}
else if (recognizedText.IndexOf
("list game keywords", StringComparison.CurrentCultureIgnoreCase) > -1)
{
if (_gameEngineBot != null)
{
Write( _gameEngineBot.ListGameKeywords());
}
else
Write("No game has been loaded.");
}
}
Finally, the grammar can be added to the in-process speech recognition engine. This was a fairly simple scenario, however, and I want to cover some more complex grammars next. It may be the case that you want to recognize a certain set of keywords, but do not care what comes before or after. For instance, if you want the phrase "Play Game" to be recognized, as well as "Let's Play Game" or even "Whoozit Play Game", you can create a grammar that catches each of these phrases by using the The following example does just this, using grammar builders to create phrases that include wildcards. The grammar builders are then added to a choices object. The choices object is added to another grammar builder object, and finally a grammar is created from that grammar builder. (It should be pointed out that speech recognition is, naturally, not case sensitive. I use ALL CAPS to build grammars so that when a phrase is matched and returned to the GUI from the protected virtual Grammar GetPlayGameGrammar()
{
Choices choices = new Choices();
GrammarBuilder playGameCommand = null;
//match "* Play Game"
playGameCommand = new GrammarBuilder();
playGameCommand.AppendWildcard();
playGameCommand.Append("PLAY GAME");
choices.Add(playGameCommand);
//match "Play Game *"
playGameCommand = new GrammarBuilder();
playGameCommand.Append("PLAY GAME");
playGameCommand.AppendWildcard();
choices.Add(playGameCommand);
//exact match for "Play Game"
choices.Add("PLAY GAME");
return new Grammar(new GrammarBuilder(choices));
}
There is one problem with the If you need to know the missing word, then you should use the So far, you've seen that you can use the grammar builder object to add phrases, add wildcards, and add dictation place holders. In a very powerful variation, you can also append a GrammarBuilder gb = new GrammarBuilder();
Choices choices = new Choices();
GrammarBuilder changeColorCommand = new GrammarBuilder();
Choices colorChoices = new Choices();
foreach (string colorName in System.Enum.GetNames(typeof(KnownColor)))
{
colorChoices.Add(colorName.ToUpper());
}
changeColorCommand.Append("CHANGE COLOR TO");
changeColorCommand.Append(colorChoices);
choices.Add(changeColorCommand);
gb.Append(choices);
Grammar g = new Grammar(gb);
This technique was particularly useful in building the Frotz game grammars. If you recall ever playing these text adventure games (from my youth but perhaps not yours), each game has a vocabulary of 200 or so words. At first blush, this would seem like a lot of keywords to build a grammar out of, given the number of permutations you can create from 200 words; in practice, though, all useful Frotz commands are either single words or two word combinations. By creating grammars that included all the two word combinations that can be built from the available keywords as choices, I ended up with a pretty effective speech recognition tool, even though the final grammar includes tens of thousands of choices. For good measure, I also added each keyword as a single word choice, as well as keyword + dictation combinations. protected virtual Grammar GetGameGrammar()
{
Choices choices = new Choices();
Choices secondChoices = new Choices();
GrammarBuilder before;
GrammarBuilder after;
GrammarBuilder twoWordGrammar;
foreach (string keyword in GameLexicon.GetAllItems())
{
//can't use this character in a grammar
if (keyword.IndexOf("\"") > -1)
continue;
string KEYWORD = keyword.ToUpper();
//wildcard before keyword
before = new GrammarBuilder();
before.AppendDictation();
before.Append(KEYWORD);
//wildcard after keyword
after = new GrammarBuilder();
after.Append(KEYWORD);
after.AppendDictation();
choices.Add(before);
choices.Add(after);
choices.Add(KEYWORD);
secondChoices.Add(KEYWORD);
}
foreach (string firstKeyword in GameLexicon.GetAllItems())
{
//can't use this character in a grammar
if (firstKeyword.IndexOf("\"") > -1)
continue;
string FIRSTKEYWORD = firstKeyword.ToUpper();
twoWordGrammar = new GrammarBuilder();
twoWordGrammar.Append(FIRSTKEYWORD);
twoWordGrammar.Append(secondChoices);
choices.Add(twoWordGrammar);
}
Grammar g = new Grammar(new GrammarBuilder(choices));
return g;
}
Historical note: while you are playing a Frotz game (also known as a Z-Machine game) in Sophia, you will notice that the keywords are sometimes truncated. For instance, there is no keyword for the ubiquitous "lantern", but there is one for "lanter". This was a technique employed in the original games to handle wildcard variations and misspellings. Bot Command PatternIn building
This design succeeds in handling at least two scenarios: one in which typed text is entered through the main interface, and two, when a spoken phrase is recognized by a particular grammar associated with a particular bot. When only text is entered using the keyboard, it is impossible to know which bot contains the correct handler. In this case, it is important that each bot is linked to another bot in serial fashion. The
The _aimlEngine = new AIMLBotAdapter(aIMLFolderPath);
_aimlEngine.OnUserInput += new GenericEventHandler<string />(DisplayUserInput);
_aimlEngine.OnStart += new EventHandler(EnableSelectedGrammar);
_aimlEngine.OnBotInfoResponse += new GenericEventHandler
Gotchas!For this application, I wanted to use the async methods of the speech synthesizer as well as the asyc methods of the speech recognizer, so that screen updates and text entry could all occur at the same time as these other activities. One of the problems in doing this is that the synthesizer and the recognizer cannot process information at exactly the same time and will throw errors if this is attempted, and so I had to throw in lots of synchronization locks to make sure that the recognizer was disabled whenever the synthesizer was active, and then turned on again when the synthesizer was done. This would have all been a lot simpler had I simply used the synchronous If you encounter any bugs in the code, come up with a better design for the Further Reading
Code Project articles
Article History
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||