audio_ostream - A Text-to-Speech ostream

Adi Shavit

4.88/5 (25 votes)

Mar 6, 2007

CPOL

7 min read

191103

5313

An article explaining how to add Text-To-Speech to an application, using an ostream interface

Screenshot - audio_ostream.png

Introduction

In this article I'll show you how to add Text-to-Speech (TTS) capabilities to your program.

You'll be able to do it with, essentially, 1 line of code, using the familiar standard ostream syntax.

Additionally, I'll show how using open source C++ tools can make your code short (my whole code is less than 50 lines), reliable, more robust and more general than the original APIs.

What I'll show:

How to add simple TTS to your program.
A simple use of COMSTL and various other STLSoft components.
A simple example of how to use boost::iostreams

Background

I recently had to add audio outputs to a program (running on Windows).

Microsoft's SAPI SDK provides a COM interface through which wide character strings can be spoken via SAPI's TTS engine. The Code Project has many articles explaining how to use SAPI to varying degrees of complexity. So why another?

Well, there were some additional features that I wanted that did not exist in those articles.

As little or no COM hassle. Ideally, it should work within the simplest Console application.
Full (transparent) support for types other than wide-char. e.g. char*, std::strings and even ints, floats, etc.
Intuitive (or at least familiar) syntax

To achieve these goals I developed audio_ostream.

audio_ostream is a full-fledged std::ostream which supports any type that has an operator<<().

You can have as many audio_ostreams as you like all working in parallel.

To handle COM issues, I used the wonderful COMSTL library which takes care of all the delicate and brittle COMplications, such as (un-)initialization, resource (de-)allocation, reference counting etc.

boost::iostreams is used to provide the full std::ostream support with very little effort writing boilerplate code.

Since both boost::iostreams and COMSTL are header only libraries I decided to make my class header only too. The minor price of this decision is that the SAPI headers will be included into any file that uses audio_ostream.

Using the code

Using the code cannot be easier:

#include "audiostream.hpp"
using namespace std;
using namespace audiostream;
int main()
{
   audio_ostream aout;
   aout << "Hello World!"  << endl;
   // some more code...
   return 0;
}

This little program will, obviously, say "Hello World!".

The audio stream is asynchronous so the program will continue running even while the text is being said (that's why the line // some more code... is there, to allow it to finish speaking). This is conceptually similar to how std::ostreams buffer results until the internal buffer is full and only then the text is displayed.

To use the class:

#include the audiostream.hpp header file.
Create an instance of audio_ostream (or waudio_ostream)
Use the stream as you would any std::ostream.

That's really all you need to do to start using the class.

Pre-Requisites

For the code to compile and run you will need 3 libraries:

For the TTS engine, you will need to install the Microsoft Speech SDK (I used ver. 5.1).
For COMSTL you will need the STLSoft libraries (you'll need STLSoft version 1.9.1 beta 44, or later).
The Boost Iostreams library. You can download Boost here.

Set your compiler and linker paths accordingly (Boost and STLSOft are header only).

Advanced Usage

It's possible to change the voice gender, speed, language and many more parameters of the voice using the SAPI text-to-speech (TTS) XML tags.

Just insert the relevant XML tags into the stream to affect change. The complete list of possible XML tags can be found here.

For example:

audio_ostream aout;
// Select a male voice.
aout << "<voice required='Gender=Male'>Hello World!" << endl; 
aout << "Five hundred milliseconds of silence" << flush << 
    "<silence msec='500'/> just occurred." << endl;

For some reason, the XML tags must be the first items in the SAPI spoken string, without any preceding text. flushing the stream before the tag, as in the example, facilitates this.

You can also call SetRate() with values [-10,10] to control the speed of the speech.

The Magic

The Core Class

The heart of the code is the audio_sink class:

template < class SinkType >
class audio_sink: public SinkType
{
public:
   audio_sink()
   {      
      // Initialize the COM libraries
      static comstl::com_initializer coinit;                         
      // Get SAPI Speech COM object
      HRESULT hr;
      if(FAILED(hr = comstl::co_create_instance(CLSID_SpVoice, _pVoice))) 
          throw comstl::com_exception(
              "Failed to create SpVoice COM instance",hr); 
   } 
   
   // speak a character string
   std::streamsize write(const char* s, std::streamsize n)
   {
      // make a null terminated string.
      std::string str(s,n);                        
      // convert to wide character and call the actual speak method.
      return write(winstl::a2w(str), str.size());  
   }
   
   // speak a wide character string
   std::streamsize write(const wchar_t* s, std::streamsize n)
   {
      // make a null terminated wstring.
      std::wstring str(s,n);                       
      // The actual COM call to Speak.
      _pVoice->Speak(str.c_str(), SPF_ASYNC, 0);   
      return n;
   }
   
   // Set the speech speed.
   void setRate(long n) { _pVoice->SetRate(n); }   

private:      
   // COM object smart pointer.
   stlsoft::ref_ptr< ISpVoice > _pVoice;             
};

There's a lot going on in this little class. Let's tease apart the pieces one-by-one.

COMSTL, stlsoft::ref_ptr<> and ISpVoice

The only member of the class is stlsoft::ref_ptr< ISpVoice > _pVoice.

This is the smart pointer that will handle all the COM stuff for us. The STLSoft class stlsoft::ref_ptr<> provides RAII-safe handling of reference-counted interfaces (RCIs). Specifically, it is ideal for handling COM objects.

We are using it with the ISpVoice interface. From Microsoft's site:

The ISpVoice interface enables an application to perform text synthesis operations. Applications can speak text strings and text files, or play audio files through this interface. All of these can be done synchronously or asynchronously.

In the constructor, we first initialize COM usage via the comstl::com_initializer. This only happens once (since it is a static object), and should not trouble us anymore. To initialize _pVoice we call comstl::co_create_instance() with the CLSID_SpVoice ID. If all goes well, we are now holding an ISpVoice object handle. All reference counting issues will be handled by stlsoft::ref_ptr<>. If the call fails an comstl::com_exception exception is thrown and the class instance will not be created.

To speak some text we just need to call _pVoice->Speak() with a wide character string.

To "speak text" we just need to call _pVoice->Speak() with a wide character string.

However, we would like to support other character types like char*, std::string and more. In fact, we want to support any type that can be converted to a string or wide-string via an operator<<().

Boost Iostreams

boost::iostreams makes it easy to create standard C++ streams and stream buffers for accessing new Sources and Sinks. To rephrase from the site:

A Sink provides write-access to a sequence of characters of a given type. A Sink may expose this sequence by defining a member function write, invoked indirectly by the Iostreams library through the function boost::iostreams::write.

There are 2 pre-defined sinks, boost::iostreams::sink and boost::iostreams::wsink for writing narrow and wide string respectively.

To make our class a Sink and get all its functionality, all we have to do is to derive our class from either of these classes (depending if we want narrow and wide character output). Thus, audio_sink is a template class that derives from its template parameter.

To use our sink and create a concrete ostream, we need to use the boost::iostreams::stream class.

The supporting class is audio_ostream_t:

template < class SinkType >
class audio_ostream_t: public boost::iostreams::stream< SinkType >, 
public SinkType
{
public:
   audio_ostream_t()
   {
      // Connect to Sink
      open(*this);
   }
};
typedef audio_ostream_t< audio_sink< boost::iostreams::sink  > >  
    audio_ostream ;
typedef audio_ostream_t< audio_sink< boost::iostreams::wsink > > 
    waudio_ostream;

This class allows us to combine both the sink and stream objects into a single entity.

Deriving from boost::iostreams::stream gives us all the ostream functionality. This stream objects needs to be initialized with a sink object instance. Thus, we also derive from SinkType (the template parameter) and initialize the boost::iostreams::stream with *this. Another advantage of deriving from SinkType is that it allows us direct access to the sink object. Direct access allows us, for example, to access the SetRate() method directly, to change the speech speed.

Speaking the Text

The boost::iostreams machinery will take care of all the type conversions and ostream syntax. Eventually, audio_sink::write will be called. Although we provided both narrow and wide character string ostreams, SAPI supports only wide character strings. Also, the Sink's write() methods accept non-null-terminated strings and the number of characters to use from the stream.

To address these two issues, we'll convert the continuous stream + size to a null-terminated (w)string using the appropriate std::(w)string constructor.

To speak the narrow character string, we call the wide write version with STLSoft's winstl::a2w() to easily convert from narrow to wide. winstl::a2w() will take care of any required allocations and deallocation of temporary buffers, and of the conversion itself.

Possible Extensions

Having achieved my the design goals, some possible extensions come to mind.

It might be interesting to extend the ostream support even further by using locales for language selection. Wrapping some of the XML tags as ostream manipulators, will give a more natural (or, at least, familiar) syntax. Of course, similar extensions can convert the SAPI Speech Recognition Interfaces into an istream, but that's a completely different ball game.

It might also be desirable to support synchronous (blocking) speech.

Revision History

March 30, 2007 Fixed code to compile and run on MSVS 2005, by using wchar_t instead of unsigned short.
Thanks to Jochen Berteld for pointing out the problem and to Matthew Wilson for pointing out the solution.