SDI v1.0 Document
Copyright 2004 Dong Lin
Content
About SDI
Using SDI
SDI Reference
SDI Environment
SDI Environment Functions
SDI Environment Structures
Speech
Speech Functions
Speech Structures
Earcon
Earcon Functions
Earcon Structures
Earcon Macros
_____________________________________________________
1. About SDI
SDI(Sound Device Interface) is a GDI-like API for Auditory Display. Current version is implemented in C,
not C++, and it has only two primary function: the positioning of speech in virtual 3D world and MIDI
composition using a single string, such as "@I00 C42 D42 E42 D42 C42", representing five piano notes "12321".
In term of software architecture, SDI consists of two parts: SDI Environment and Sound Object. As stated
previously, the two sound objects currently available are Speech and Earcon. SDI Environment is set up and
cleaned up with InitializeSDI and ReleaseSDI, respectively. Sound Objects are created using corresponding
CreateXXX but deleted by the same DeleteSDIObject. They are identified by handles derived(typedefed) from
HSDIOBJECT.
For each type of object, there are a set of generic operations as well as some exclusive operations.
Generic operations include: Play, Pause, Stop, Set/GetPosition and Set/GetVolume. Exclusive operations:
For Speech: CreateSpeech, PlayText(positioning TTS output), AddText (appending text to the end of queue)
and GetPresetVoice(get preset voice parameters); For Earcon: CreateEarcon, SetChannelInstrument(set instrument),
PlaySegment(play music data), PlayNotation(convert string to MIDI output) and ParseNotation(convert string to
MIDI file data). The positioning of earcon is dreadful, and it needs further improvement.
2. Using SDI
* If you want to learn how to use SDI in a minute, please refer to the sample program.
* If you are Chinese or you can speak Chinese, you can turn to NewsEverywhere for a complete online news-reader,
including its source code. You can experience two "persons" presenting different news to you at the same time.
It may sounds unnatual, but it really works.
Because sound can't be rendered in a flash, lots of threads and synchronization objects are used in SDI,
even the simplistic "Sleep(XXXX)" delay. So the system suffers a lot from the intensive use of threading and
thus becoming very instable. The debug is also a nightmare.
There are some guildlines for you to diminish the number of frustrating problems when you use SDI:
1. Keep in mind that SDI is not fully thread-safe.
2. ECI (the API provided by ViaVoice TTS) may has a function limit (possibly due to the evaluation version), so
one more thread is used for each Speech object. In addition, ECI does not support thread-reentering. So please
make sure there are intervals between operations. (It's not a MediaPlayer anyway ^_^ )
3. Get/SetPosition/Volume are always safe, because they are handed over to A3D or DirectX internally. You can
achieve "sound animation" by using them.
4. Earcon is relatively more robust than Speech, except for its restrictions on data. Refer to Play/ParseNotation
and PlaySegment to see how to use Earcon correctly.
5. Some knowledge about MIDI and MIDI file is highly recommanded. Please refer to "MIDI Specification" and
"MIDI File Format" for detailed information on them.
_________________________________________________________________________
3.SDI Reference
File List
sdi.h Functions, Structures, Constants for SDI
sdi32.lib sdi32.dll
Third-Party Component
A3D ia3dapi.h a3dapi.dll
ViaVoice TTS eci.h ibmeci.dll
3.1 SDI Environment
InitializeSDI ReleaseSDI DeleteSDIObject
[Generic Operation]Play Pause Stop SetPosition GetPosition SetVolume GetVolume
SDIVECTOR
3.1.1 SDI Environment Functions
InitializeSDI
Description
Initialize SDI Environment (mainly A3D and Directx objects)
Prototype
BOOL InitializeSDI(LPGUID audioDevice, DWORD a3dStyle, HINSTANCE hInstance, DWORD dwFlag);
Parameters
audioDevice Pointer to audio device GUID, which is obtained by DirectSoundEnumrate.
Refer to DirectX SDK for how to use DirectSoundEnumrate
If it is NULL, the default audio device is used.
a3dStyle Specifying the rendering features of A3D. It is a bitwise OR of the following:
SDI_A3D_1ST_REFLECTION
SDI_A3D_DISABLE_FOCUS_MUTE
SDI_A3D_DISABLE_SPLASHSCREEN
SDI_A3D_GEOMETRIC_REVERB
SDI_A3D_OCCLUSIONS
SDI_A3D_REVERB
SDI_A3D_CL_EXCLUSIVE
SDI_A3D_LEFT_HAND_COORD
SDI_A3D_OUTPUT_HEADPHONES
SDI_A3D_OUTPUT_SPEAKERS_WIDE
SDI_A3D_OUTPUT_SPEAKERS_NARROW
SDI_A3D_OUTPUT_MODE_QUAD
SDI_A3D_STREAMING_PRIORITY_NORMAL
SDI_A3D_STREAMING_PRIORITY_HIGH
SDI_A3D_STREAMING_PRIORITY_HIGHEST
Most of the time, these extra features should not be used. Listing them out is just for the sake of A3D.
hInstance Instance Handle
dwFlag Reserved. Must be 0.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
InitializeSDI should come first before any of other functions being called.
See Also
ReleaseSDI
*******************************************
ReleaseSDI
Description
Clean up SDI Environment.
Prototype
BOOL ReleaseSDI();
Parameters
None.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
It traverses all the sound objects and deletes all of them. But it's highly recommanded that you delete sound
object with DeleteSDIObject when you've finished using them, because them eat too many resources. On average, each
sound object will consume 2M~4M virtual memory. You don't expect your programme to be something like IE, right?
Besides, you'd better wait approximately 1s after ReleaseSDI, to make sure all the thread exit successfully.
See Also
InitializeSDI, DeleteSDIObject
********************************************
DeleteSDIObject
Description
Delete sound object
Prototype
BOOL DeleteSDIObject(HSDIOBJECT hObject);
Parameters
hObject A handle derived (typedefed) from HSDIOBJECT. Currently available type are HSPEECH and HEARCON.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
Though ReleaseSDI traverses all the sound objects and deletes all of them, it's highly recommanded that you
delete sound object with this function when you've finished using them, because them eat too many resources.
On average, each sound object will consume 2M~4M virtual memory. You don't expect your programme to be something
like IE, right?
See Also
ReleaseSDI, CreateSpeech, CreateEarcon
**********************************************
Play
Description
Display sound object
Prototype
BOOL Play(HSDIOBJECT hObject);
Parameters
hObject A handle derived (typedefed) from HSDIOBJECT. Currently available type are HSPEECH and HEARCON.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
Speech Object: Retrieve a text from text queue and start synthesizing and displaying.
Earcon Object: Play from the beginning or the position where the previous Pause was called.
* If the sound object is currently been played, this operation will simply return.
See Also
Pause, Stop
***********************************************
Pause
Description
Pause the displaying.
Prototype
BOOL Pause(HSDIOBJECT hObject);
Parameters
hObject A handle derived (typedefed) from HSDIOBJECT. Currently available type are HSPEECH and HEARCON.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
Speech Object: Stop current synthesis job and get ready for the next job. That is, you can't actually pause a
sound object. I'm really sorry about this. If you are interested in why it is difficult in achieving this ordinary
function, you can turn to the source code. And I'll be grateful if you are willing to share your improvements with me.
Earcon Object: For Earcon, Pause can really pause the playing.
See Also
Play, Stop
************************************************
Stop
Description
Stop (Reset ) sound object
Prototype
BOOL Stop(HSDIOBJECT hObject);
Parameters
hObject A handle derived (typedefed) from HSDIOBJECT. Currently available type are HSPEECH and HEARCON.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
Speech Object: Stop displaying and empty the text queue.
Earcon Object: Stop playing and move the play cursor to the head of the music.
See Also
Play, Pause
*************************************************
Set/GetPosition
Description
Set/Get the position in virtual 3D world.
Prototype
BOOL SetPosition(HSDIOBJECT hObject, SDIVECTOR* pos);
BOOL GetPosition(HSDIOBJECT hObject, SDIVECTOR* pos);
Parameters
hObject A handle derived (typedefed) from HSDIOBJECT. Currently available type are HSPEECH and HEARCON.
pos A pointer to SDIVECTOR, which holds or will hold the position parameters
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
Refer to SDIVECTOR for info on coordinate system.
SetPosition simply passes the request over to A3D(for Speech) or Directx(for Earcon). So it is safe, and will take
effect at once. You can "animate" sound object by using SetPosition.
Though HRTF(Head-Related Transform Function) works quite well in x-z plane, it fails to position sound along y-axis.
So you can't expect much by changing the y value.
See Also
Set/GetVolume
**************************************************
Set/GetVolume
Description
Set/Get sound object volume.
Prototype
BOOL SetVolume(HSDIOBJECT hObject, FLOAT fGain);
FLOAT GetVolume(HSDIOBJECT hObject);
Parameters
hObject A handle derived (typedefed) from HSDIOBJECT. Currently available type are HSPEECH and HEARCON.
fGain Gain. Range: 0~1.0f [0 : mute 1.0f : original]
Return Values
SetVolume : TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there
is a mistake. In fact, most of the errors will result in an assertion. It's just for debuging purpose.
GetVolume : Gain of sound object
Remarks
SetVlume simply passes the request over to A3D(for Speech) or Directx(for Earcon). So it is safe, and will take
effect at once. You can "animate" sound object by using SetVolume.
See Also
Set/GetPosition
***************************************************
3.1.2 SDI Environment Structures
SDIVECTOR
Description
Define the position of sound object in virtual world.
Definition
struct SDIVECTOR
{
float x;
float y;
float z;
};
Remarks
The coordinate system in SDI is consistent with DirectX, using left-handed Cartesian coordinate system, with positive
x-axis pointing to the right, positive y-axis up and z-axis away from you. (It is different from the math I was thaught
here.)
[One can really get confused when choosing the coordinate system. A3D and OpenGL adopt right-handed Cartesian coordinate
system.The reason why I choose the opposite one, well, MS...]
****************************************************
3.2 Speech
CreateSpeech PlayText AddText GetPresetVoice
VOICEPARAM
3.2.1 Speech Functions
CreateSpeech
Description
Create Speech sound object
Prototype
HSPEECH CreateSpeech(DWORD dwECIStyle, VOICEPARAM* pVoice, DWORD dwFlag);
Parameters
dwECIStyle Set ECI(Eloquence Command Interface) attributes.
(ECI is provided by ViaVoice TTS)
It is bitwise OR of the following values.
Please refer to IBM ViaVoice TTS SDK.
SDI_LANGUAGE_GENERAL_AMERICAN_ENGLISH
SDI_LANGUAGE_BRITISH_ENGLISH
SDI_LANGUAGE_MANDARIN_CHINESE
SDI_DONTUSE_ABBR_DICTIONARY
SDI_ANNOTATED_TEXT
SDI_4DIGIT_AS_YEAR
SDI_SAMPLERATE_8000/11024/22048
Most of the time all you need is a language tag.
pVoice Set voice characteristics. You can use preset voice, or customize a unique voice
* Choose one of the following preset voice
SDI_VOICE_ADULTMALE1
SDI_VOICE_ADULTFEMALE1
SDI_VOICE_CHILD
SDI_VOICE_ADULTMALE2
SDI_VOICE_ADULTMALE3
SDI_VOICE_ADULTFEMALE2
SDI_VOICE_ELDERLYFEMALE
SDI_VOICE_ELDERLYMALE
* Use customized parameters. You can fill the VOICEPARAM structure completely from scratch, or first
get preset parameters using GetPresetVoice and then modify them. In either case, you should pass the
pointer to your VOICEPARAM to CreateSpeech.
dwFlag Reserved. Must be 0.
Return Values
If successful, a handle of type HSPEECH is returned for future operation.
If failed, a NULL will be returned.
Remarks
Currently, SDI is not fully thread-safe, so you'd better not pass the handle to other threads.
See Also
PlayText, AddText
******************************************************
PlayText
Description
Empty the text queue and start displaying a new text.
Prototype
BOOL PlayText(HSPEECH hSpeech, PCTSTR psText);
Parameters
hSpeech Speech object handle of type HSPEECH, returned by CreateSpeech
psText Null-terminated text string.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
It's the most important operation for Speech. It feeds ECI with text provided, and starts a new worker thread for
synthesizing.
Make sure there is interval between two subsequent operations.
See Also
AddText
*******************************************************
AddText
Description
Add text to text queue.
Prototype
BOOL AddText(HSPEECH hSpeech, PCTSTR psText);
Parameters
hSpeech Speech object handle of type HSPEECH, returned by CreateSpeech
psText Null-terminated text string.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
AddText merely inserts a new text to text queue without affecting the current displaying job.
Note that there is 1~2s'delay between two texts (for the termination of old thread and start-up of a new one. It's
another side effect caused by the ECI function limit) If the gap is not what you desire, please feed all the text
through a single PlayText.
See Also
PlayText
********************************************************
GetPresetVoice
Description
Get the parameters of preset voice.
Prototype
BOOL GetPresetVoice(int nIndex, VOICEPARAM* pVoice);
Parameters
nIndex Index of preset voice. Cast one the following values into type "int" and pass it to GetPresetVoice.
SDI_VOICE_ADULTMALE1
SDI_VOICE_ADULTFEMALE1
SDI_VOICE_CHILD
SDI_VOICE_ADULTMALE2
SDI_VOICE_ADULTMALE3
SDI_VOICE_ADULTFEMALE2
SDI_VOICE_ELDERLYFEMALE
SDI_VOICE_ELDERLYMALE
pVoice A pointer to VOICEPARAM structure to receive parameters
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
Used with CreateSpeech. Refer to CreateSpeech for how to use it.
See Also
CreateSpeech
*********************************************************
3.2.2 Speech Structures
VOICEPARAM
Description
Define voice parameters
Definition
typedef struct tagVOICEPARAM
{
BYTE breathiness; // 0-100 (100 for whisper)
BYTE gender; // 0: Male 1: Female
BYTE headSize; // 0-100
BYTE pitchBaseline; // 0-100 (Corresponding to 40-442Hz in real world)
// 1 unit for 4.02Hz
BYTE pitchFluctuation; // 0-100 (0 for monotonous voice)
BYTE roughness; // 0-100
BYTE speed; // 0-250 (Corresponding to 70-1297 words per second)
// 1 unit for 4.908 word per second
BYTE volume; // 0-100
} VOICEPARAM, *LPVOICEPARAM;
Remarks
Used with GetPresetVoice and CreateSpeech
See Also
CreateSpeech, GetPresetVoice
***********************************************************
3.3 Earcon
CreateEarcon SetChannelInstrument PlaySegment PlayNotation ParseNotation
NOTERANGE
MAKE_TIMESIGNATURE
3.3.1 Earcon Functions
CreateEarcon
Description
Create Earcon sound object.
Prototype
HEARCON CreateEarcon(DWORD dwTempo, WORD nTicksPerQN, DWORD timeSignature, WORD keySignature);
Parameters
dwTempo Number of microseconds per MIDI quarter-note.
nTicksPerQN Number of delta time ticks per quarter-note.
timeSignature Time signature. Use MAKE_TIMESIGNATURE to compose this parameter.
keySignature Key signature. Use MAKE_KEYSIGNATURE to compose this parameter.
Return Values
If successful, a handle of type HEARCON is returned for future operation.
If failed, a NULL will be returned.
Remarks
These parameters are somewhat confusing. In effect, dwTempo, nTicksPerQN together with timeSignature.dd determine
the actual music tempo. See "MIDI File Format" for all these parameters.
If you are playing data from MIDI file, please get the above params from the file. If you are composing MIDI from
string, you can choose a fixed set (e.g. 666666L, 1024, 4/4 ).
Generally, if you increase dwTempo, decrease nTicksPerBeat, or decrease the cent per beat in timesignature
( quarter-note/beat < eighth-note/beat ), the actual music will slow down.
In addition, a soundcard which supports DirectSound should be installed, yet it is not big problem.
See Also
PlaySegment, PlayNotation, ParseNotation
**************************************************************
SetChannelInstrument
Description
Set instrument for one of 16 channels.
Prototype
BOOL SetChannelInstrument(HEARCON hEarcon, WORD nChannel, int nPatch, NOTERANGE* noteRange);
Parameters
hEarcon Earcon object handle of type HEARCON, returned by CreateEarcon
nChannel Channel number, 0~15
nPatch Instrument number(patch) defined in GM(General MIDI), ranging from 0~127. Please refer to "General MIDI"
for a complete list of music instruments that you can use.
Attention : the patch number in GM are between 1 and 128, so you have to subtract it by 1.
[I've thought about removing this overhead for you, but when I add "nPatch--;" to SetChannelInstrument,
the Earcon suddently goes deaf! If you can solve the problem, please let me know. Thanks. ]
noteRange A pointer to NOTERANGE structure to tell the synthesizer the notes you are interested.
If it is NULL, the default 24~119 will be used, or music note C1~B8
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
You should set intrument for the channel you are using, if you don't asign it in the MIDI data or the music notation.
Otherwise, no sound or incorrect sound will be heard.
See Also
PlayNotation
***************************************************************
PlayNotation
Description
Translate notation string to MIDI output.
Prototype
BOOL PlayNotation(HEARCON hEarcon, PTSTR psMusic);
Parameters
hEarcon Earcon object handle of type HEARCON, returned by CreateEarcon.
psMusic Music notation string. See Remarks.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
It's the most important operation for Earcon. The notation string format is described here:
<notation> = <track>[;<track> ; ...]
<track> = <unit> ...
<unit> = <command>|<note>
<command> = @ + command_character + parameter(2 chars)
<note> = [b|#] + note_name + velocity + [.|..]
First, note that the music notation is case sensitive. Each notation consists of multiple tracks, which are
separated by ";". Track is made of many music units, either a command or a note. You may add one or more spaces after
a music unit for readability. A command has a leading "@", a command character and a 2-charater parameter. At present
only ��I'(Instrument) command char can be used. It is used to set instrument, and the parameter following is the patch
number (in GM). Note is relatively complex. An optional "#" or "b" can be used for pitch wheel. The note name ranges
from C1 to B8, with Cx, Dx, Ex, Fx, Gx, Ax, Bx in an octave. C4 represent Middle C, MIDI note 60. Besides, for
run-time(programatically) composing, you can specify an "N" with a byte(NOT ASCII ! ) to make up a note. Note that the
number should be in the range of 24~119. And the number is increased by halftone, so every 12 halftone make up a octave.
After note_name, there is a number specifying the velocity(cent), in 2's power, e.g. 2 for quarter-note, 3 for
eighth-note, etc. At last, one or more optional dots.
The parser is not perfect, though, so please make sure your notation string has no syntax error.
See Also
ParseNotation, PlaySegment
*****************************************************************
PlaySegment
Description
Play MIDI data.
Prototype
BOOL PlaySegment(HEARCON hEarcon, BYTE** ppTracks, BYTE nTracks);
Parameters
hEarcon Earcon object handle of type HEARCON, returned by CreateEarcon.
ppTracks List of track pointers.
nTracks Number of tracks in ppTracks.
Return Values
TRUE, if successful. However, it does not necessaryly follow that a FALSE will be returned if there is a mistake.
In fact, most of the errors will result in an assertion. It's just for debuging purpose.
Remarks
For each track, the beginning 4-byte data size designator and the ending "FF2F00" signature should be present.
PlayNotation invokes PlaySegment internally, so you should have confidence in organizing track data and handed the
track list to PlaySegment. But I don't think play MIDI file with PlaySegment is a good idea, at least before the 3D
positioning is refined.
See Also
PlayNotation
******************************************************************
ParseNotation
Description
Translate notation string to MIDI file data.
Prototype
int ParseNotation(HEARCON hEarcon, PTSTR psMusic, BYTE* pbData);
Parameters
hEarcon Earcon object handle of type HEARCON, returned by CreateEarcon.
psMusic Music notation string. See PlayNotation
pbData Buffer pointer to receive MIDI data.
If it is NULL, only the number of desired bytes is returned.
Return Values
The desired data size (in byte).
Remarks
The sole purpose of this function is for entertainment. You can output the data received directly to a .mid file,
and play it using Winamp or other media player. Especially when you see a string of several bytes suddenly becomes a
MIDI file of hundreds of bytes, you may get interested when you think of the other way round.
See Also
PlayNotation
*******************************************************************
3.3.2 Earcon Structures
NOTERANGE
Definition
struct NOTERANGE{
DWORD dwLowNote;
DWORD dwHighNote;
};
3.3.3 Earcon Macros
MAKE_TIMESIGNATURE
Description
Make up a DWORD representing the time signature.
Prototype
#define MAKE_TIMESIGNATURE(nn, dd, cc, bb) ((DWORD)((nn)<<24 | (dd)<<16 | (cc)<<8 | (bb)))
Parameters
nn, dd represent a timesignature of nn/pow(2,dd), [e.g. nn=1, dd=2 represent 1/4 ]��
cc��bb Refer to MIDI File Format. Usually cc=24, bb=8
Remarks
Most of time, cc and bb should be the default value. Actually, I don't know what they are used for.
This macro is used with CreateEarcon.
See Also
CreateEarcon
MAKE_KEYSIGNATURE
Description
Make up a DWORD representing the time signature.
Prototype
#define MAKE_KEYSIGNATURE(sf, mi) ((WORD)((sf)<<8 | (mi)))
Parameters
sf -7 for 7 flats, -1 for 1 flat, etc, 0 for key of c, 1 for 1 sharp, etc. [MIDI File Format]
mi 0: Major 1: Minor
Remarks
At present, the keysignature is not used.
This macro is used with CreateEarcon.
See Also
CreateEarcon
************************************************************************
Copyright 2004 Dong Lin
Zhejiang University, China
Last Updated : 2004.4.15
E-mail : jonathan1983@126.com