Voice-Activated Web Browsing

Geoff Bailey

4.58/5 (9 votes)

Jan 22, 2004

6 min read

90529

5505

This article shows how to voice-activate your website using SAPI 5.1 and ActiveX

Introduction

This article describes an ActiveX control that can be embedded in an html web page to provide a voice-activated menu tree.

To compile the code you will need VC6, Microsoft's Speech SDK 5.1 and the Internet Explorer headers. (If you have WINXP you may already have the required files on board)

The Demo Program

The demo for this package is a simple web page with two <iframe> elements: the first <iframe> embeds the ActiveX control while the second displays the page contents.

After compiling and registering WebVoiceCtl.dll, look for a folder called demo and double click on the file inside called WebVoice.html. You should see the tree control in the left frame, shown above. Press the Voice button and be patient while the large speech engines are loaded.

Once loaded, you can speak "go to class one" to start the navigation. The control should respond with "Please confirm class one" to which you may reply "positive". The requested item should then be displayed in the right frame.

Speak "help" at any time to get a list of the active commands. If you've just navigated to a page, the help response will be "[scroll] up, down, top bottom; go back or navigate". Speak your scroll commands then say: "navigate" to return to navigation mode.

Hint: turn the volume on your speakers down to avoid feedback into the microphone.

Background

The code attached to this article demonstrates the following technology:

ATL, ActiveX (and wide character string manipulation)
Tree view searching, expanding and collapsing
Owner drawn buttons, edit controls and static controls
Image lists, overlays (and painting inside an ATL composite control)
Using Microsoft's MSXML parser to load and manipulate an XML file
Using C++ to interface with the web browser and the html page
SAPI 5.1, speech recognition and text to speech engines and Visemes

Of course you don't have to understand all of the items above to use this control in your projects but you may find some of the solutions (a couple of which credit other Code Project articles) interesting.

Creating Your Own Menu Tree

Your menu items are read from the file "data/WebVoice.xml" (name is currently hardcoded), which contains information for both the menu-tree and the SAPI grammar. It's contents are stored in an array of KEY structures for later retrieval. A short XML sample file and the KEY structure are shown below:

  <!-- WebVoice.xml -->
  <menu>
    <item>
      <mid>1</mid>                    <!- menu item id -->        
      <pid>0</pid>                    <!- parent id -->        
      <txt>Class One</txt>            <!- menu text and grammar phrase -->   
      <ref>../html/class1.html</ref>  <!- hyperlink reference -->
    </item>
    <item>
      <mid>2</mid>
      <pid>1</pid>
      <txt>Source One</txt>
      <ref>../html/src1.html</ref>
    </item>
    <!- more items here -- >
</menu>

typedef struct tag_key
{
  int mid;
  int pid;
  int chd;
  HTREEITEM hItem;
  HTREEITEM hParent;
  char txt[32];
  char ref[128];
}KEY;

KEY aKeys[NUMBER_OF_KEYS];

You must be careful to ensure that the menu item IDs are numbered sequentially and that the parent ID refers to an item that is above the current item in the tree. No error checking is currently performed while loading so an invalid XML file will cause the control to crash.

SAPI Initialization

The WebVoice control handles SAPI initialization in the function InitSapi() as follows:

Creates the speech engine.
Creates a recognition context.
Sets a notification mechanism (windows message) for callback from the recognition engine.
Sets recognition event interests.
Loads specific grammar files
Creates the text to speech engine (TTS)
Sets TTS event interests.
Sets a notification mechanism (windows message) for call back from the TTS engine.
Sets the active rule

The Speech SDK documentation and examples clearly show the required SAPI initialization calls so I won't cover that here. However, the static grammar file and the dynamic grammar require some explanation.

SAPI Grammar

SAPI grammars may be loaded statically from an XML file or dynamically at runtime. The WebVoice control uses both methods. The static part is loaded from Grammar.xml, which has the following format:

<GRAMMAR LANGID="409">
  <DEFINE>
    <ID NAME="RID_Tree"     VAL="1001"/>
    <ID NAME="RID_MenuItem" VAL="1004"/>
  </DEFINE>
  <RULE ID="RID_Tree" TOPLEVEL="ACTIVE">
    <L>
      <P>open</P>
      <P>go to</P>
    </L>
    <RULEREF REFID="RID_MenuItem" />
  </RULE>
  <RULE ID="RID_MenuItem"  DYNAMIC="TRUE">
    <L PROPID="RID_MenuItem">
      <P VAL="1">Dummy Item</P>
    </L>
  </RULE>
  <!-more rules -->
</GRAMMAR>

As you can see this file snippet creates two rules: the first rule, RID_Tree, defines the starting navigation phrases then references the second rule called RID_MenuItem. The second rule holds a dummy phrase that will be replaced at runtime with the names of your menu items. This file is compiled into Grammar.cfg by SAPI's gc.exe then loaded into a resource inside the DLL. The dynamic rules are added as follows:

HRESULT CWebVoice::LoadGrammar()
{
  USES_CONVERSION;  
  HRESULT hr;

  SPPROPERTYINFO pi; 
  ZeroMemory(&pi,sizeof(SPPROPERTYINFO));
  pi.ulId      = RID_MenuItem;  // property ID
  pi.vValue.vt = VT_UI4;

  // add menu items to the dynamic grammar rule
  for(int i=0; i < m_nNumKeys; i++) {
    pi.vValue.ulVal = i+1;     // Property_Value == data_index + 1
    hr=m_cpGrammar->AddWordTransition(hRule,NULL,
         T2W(aKeys[i].txt),L" ",SPWT_LEXICAL,1,&pi);
    if(FAILED(hr)) return hr;
  }

  // add a wildcard phrase
  pi.vValue.ulVal = 0;
  hr=m_cpGrammar->AddWordTransition(hRule, 
     NULL, L"*", L" ", SPWT_LEXICAL, 1, &pi);
  if(FAILED(hr)) return hr;

  hr=m_cpGrammar->Commit(NULL);                  if(FAILED(hr)) return hr;
  hr=m_cpGrammar->SetGrammarState(SPGS_ENABLED); if(FAILED(hr)) return hr;
  return hr;
}

Note that each new phrase (taken from aKeys[i].txt) is assigned a property ID of RID_MenuItem and a unique property value (between 1 and m_nNumKeys) then added to the grammar with the AddWordTransition() function. Note also that a wild card rule ("*") is added at the end to catch spoken phrases not covered in the grammar.

Recognition

The recognition engine compares your spoken words to the active grammar rule. When either a recognition or a false recognition is made by the engine, your callback routine is called to handle the request. The following shows a section of the recognition handler:

void CWebVoice::ExecuteCommand(ISpRecoResult *pPhrase, HWND hWnd)
{
  USES_CONVERSION;
  SPPHRASE *pElements;
  static int ind;
  int pos;

  if (SUCCEEDED(pPhrase->GetPhrase(&pElements))) {  
    m_cpRecoCtxt->Pause(NULL);           // pause recognition while loading
    
    switch (pElements->Rule.ulId ) {
    case RID_Tree:
      pos=pElements->pProperties->vValue.ulVal;
      ind=pos-1;              // store the index into the data array
      SetActiveRule(RID_Confirm);        // change the active rule
      wcscpy(wcs,L"Please confirm: \r\n");
      wcscat(wcs,T2W(aKeys[ind].txt));
      HandleReply(0,wcs);
      break;
    case RID_Confirm:
      pos=pElements->pProperties->vValue.ulVal;
      switch(pos) {
      case 1: 
        HandleConfirm(ind);       // expand the tree and navigate to item 
        SetActiveRule(RID_View);  // change the active rule
        break;
      case 2:
      default:SetActiveRule(RID_Tree); HandleReply(MID_Tree,NULL); break; 
      break;
      }
    // more cases for other rules
    default:  SetActiveRule(RID_Tree); HandleReply(RID_Tree,NULL);  break;
    }
    ::CoTaskMemFree(pElements);
    m_cpRecoCtxt->Resume(NULL);
  }
}

When a navigation rule is matched, it's property value is stored in the static variable ind and, after confirmation, is passed to the HandleConfirm(ind) funtion which uses it to index the data array (aKeys[ind]) and retrieve the correct data item. If successful the tree view will be opened to show the selection and the hyperlink will be navigated

Points of Interest

Every time I write an ActiveX control or a Web Browser plugin in ATL I have to re-learn how to use wide character strings; and SAPI uses wide character strings exclusively. If your code does not have to run on Win98 then you can just define UNICODE and as long as your strings are defined as TCHAR*, the usual API calls will work fine. But if discounting Win98 users is not an option then you are forced to convert from multibyte to wide-string whenever you use the Win32 API. Fortunately, ATL has a wonderful set of conversion macros defined in <atlbase.h>. You just place the macro USES_CONVERSION at the beginning of each function that needs to convert strings then use the W2T() or T2W() macros to perform the conversion. I have no doubt that the overhead to these macros is alarming -after all they have to allocate memory, copy the string then release memory in each conversion call. However these macros are so convenient and tidy that I've even started including <atlbase.h> in my MFC programs.

Another problem I encountered was the need to use owner draw buttons -the standard dialog-box grey does not cut it on a web page. In MFC I would override the WM_CTLCOLOR message and change the background colour there. In ATL I found that I had to make the buttons owner-drawn then handle the WM_DRAWITEM message. All well and good but then I discovered that I needed both a toggle button and a momentary button and that I now needed to code the required responses myself. It was all great fun but it took some time before I was able to get to the SAPI part of the code.

The Microsoft Speech SDK 5.1 is a 68 MB download and if you need to package the SAPI runtime modules with your code, you must download the full redistribution package which is 131.58 MB.

Unfortunately Microsoft does not package the runtime modules by themselves. Either your clients must download the SDK (including the extra 30 MB of developer code and documentation) or you must prepare a runtime module package yourself as a separate download from your application.

Revisions

29 January 2004 -Subclassed picture control to avoid Win98 problems and fixed small script error in the demo