Adjusting Microsoft Translator WAVE Volume

Joel Ivory Johnson

5.00/5 (1 vote)

Jan 24, 2012

CPOL

7 min read

12022

How to adjust Microsoft Translator WAVE volume

The code in this article was inspired by some questions on Windows Phone 7, but it's generic enough to be used on other .NET based platforms. In the Windows Phone AppHub forums, there was a question about altering the volume of the WAVE file that the Microsoft translator service returns. In the StackOverflow forums, there was a question about mixing two WAVE files together. I started off working on a solution for the volume question and when I stepped back to examine it, I realized I wasn't far away from a solution for the other question. So I have both solutions implemented in the same code. In this first post, I'm showing what I needed to do to alter the volume of the WAVE stream that comes from the Microsoft Translation service.

I've kept the code generic enough so that if you want to apply other algorithms to the code, you can do so. I've got some ideas on how the memory buffer for the sound data can be better handled that would allow large recordings to be manipulated without keeping the entire recording in memory and allowing the length of the recording to be more easily altered. But the code as presented demonstrates three things:

Loading a WAVE file from a stream
Alter the WAVE file contents in memory
Save WAVE files back to a stream

The code for saving a WAVE file is a modified version of the code that I demonstrated some time ago for writing a proper WAVE file for the content that comes from the Microphone buffer.

Prerequisites

I'm making the assumption that you know what a WAVE file and a sample are. I am also assuming that you know how to use the Microsoft Translator web service.

Loading a Wave File

The formats for WAVE files is pretty well documented. There's more than one encoding that can be used in WAVE files, but I'm concentrating on PCM encoded WAVE files and will for now ignore all of the other possible encodings. The document that I used can be found here. There are a few variants from the document that I found when dealing with real WAVE files and I'll comment on those variants in a moment. In general, most of what you'll find in the header are 8, 16, and 32-bit integers and strings. I read the entire header into a byte array and extract the information from that byte array into an appropriate type. To extract a string from the byte array, you need to know the starting index for the string and the number of characters it contains. You can then use Encoding.UTF8.GetString to extract the string. If you understand how numbers are encoded (little endian), decoding them is fairly easy. If you want to get a better understanding, see the Wikipedia article on the encoding.

Integer Size	Extraction Code
8-bit	`data[i]`
16-bit	`(data[i])\|(data[i+1]<<0x08)`
32-bit	`(data[i])\|(data[i+1]<<0x08)\|(data[i+2]<<0x10)\|(data[i+3]<<0x18)`

Offset	Title	Size	Type	Description
0	ChunkID	4	`string(4)`	literal string "`RIFF`"
4	ChunkSize	4	`int32`	Size of the entire file minus eight bytes
8	Format	8	`string(4)`	literal string "WAVE"
12	SubChunkID	4	`string(4)`	literal string "fmt "
16	SubChunk1Size	4	`int32`	size of the rest of the subchunk
20	AudioFormat	2	`int16`	Should be 1 for PCM encoding.
22	Channel Count	2	`int16`	1 for mono, 2 for stereo,...
24	SampleRate	4	`int32`
28	ByteRate	4	`int32`	(SampleRate)(Channel Count)(Bits Per Sample)/8
32	Block Align	2	`int16`	(Channel Count)*(Bits Per Sample)/8
34	BitsPerSample	2	`int16`
	ExtraParamSize	2	`int16`	possibly not there
	ExtraParams	?	?	possibly not there
36+x	SubChunk2ID	4	`int32`	literal string "`data`"
40+x	SubChunk2Size	4	`int32`
44+x	data	`SubChunk2Size`	`byte[SubChunk2Size]`

The header will always be at least 44 bytes long. So I start off reading the first 44 bytes of the stream. The SubChunk1Size will normally contain the value 16. If it's greater than 16, then the header is greater than 44 bytes and I read the rest. I've allowed for a header size of up to 64 bytes (which is much larger than I have encountered). A header size of larger than 44 bytes will generally mean that there is an extra parameter at the end of SubChunk1. For what I'm doing, the contents of the extra parameters don't matter. But I still need to account for the space that they consume to properly read the header.

To my surprise, the contents of the fields in the header are not always populated. Some audio editors leave some of the fields zeroed out. My first attempt to read a WAVE file was with a file that came from the open source audio editor Audacity. Among other fields, the BitsPerSample field was zeroed. I'm not sure if this is allowed by the format or not. It certainly is not in any of the spec sheets that I've found. But when I encounter this, I assume a value of 16.

Regardless of whether a WAVE file contains 8-bit, 16-bit-, or 32-bit samples when read in, I store the value in an array of doubles. I chose to do this because double works out better for some of the math operations I have in mind.

public void ReadWaveData(Stream sourceStream, bool normalizeAmplitude = false)
{
    //In general I should only need 44 bytes. 
    //I'm allocating extra memory because of a variance I've seen in some WAV files. 
    byte[] header = new byte[60];
    int bytesRead = sourceStream.Read(header, 0, 44);
    if(bytesRead!=44)
        throw new InvalidDataException(String.Format
    ("This can't be a wave file. It is only {0} bytes long!",bytesRead));

    int audioFormat = ChannelCount = (header[20]) | (header[21] << 8);
    if (audioFormat != 1)
        throw new Exception("Only PCM Waves are supported (AudioFormat=1)");

    #region mostless useless code
    string chunkID = Encoding.UTF8.GetString(header, 0, 4);
    if (!chunkID.Equals("RIFF"))
    {
        throw new InvalidDataException(String.Format
    ("Expected a ChunkID of 'RIFF'. Received a chunk ID of {0} instead.", chunkID));
    }
    int chunkSize = (header[4]) | (header[5] << 8) | 
        (header[6] << 16) | (header[7] << 24);
    string format = Encoding.UTF8.GetString(header, 8, 4);
    if (!format.Equals("WAVE"))
    {
        throw new InvalidDataException(String.Format
    ("Expected a format of 'WAVE'. Received a chunk ID of {0} instead.", format));
    }
    string subChunkID = Encoding.UTF8.GetString(header, 12, 4);
    if (!format.Equals("fmt "))
    {
        throw new InvalidDataException(String.Format("Expected a subchunkID of 
        'fmt '. Received a chunk ID of {0} instead.", subChunkID));
    }
    int subChunkSize = (header[16]) | (header[17] << 8) | 
        (header[18] << 16) | (header[19] << 24);
    #endregion

    if (subChunkSize > 16)
    {
        var bytesNeeded = subChunkSize - 16;
        if(bytesNeeded+44 > header.Length)
            throw new InvalidDataException("The WAV header is larger than expected. ");
        sourceStream.Read(header, 44, subChunkSize - 16);
    }

    ChannelCount = (header[22]) | (header[23] << 8);
    SampleRate = (header[24]) | (header[25] << 8) | 
    (header[26] << 16) | (header[27] << 24);
    #region Useless Code
    int byteRate = (header[28]) | (header[29] << 8) | 
    (header[30] << 16) | (header[31] << 24);
    int blockAlign = (header[32]) | (header[33] << 8);
    #endregion
    BitsPerSample = (header[34]) | (header[35] << 8);

    #region Useless Code
    string subchunk2ID = Encoding.UTF8.GetString(header, 20 + subChunkSize, 4);
    #endregion

    var offset = 24 + subChunkSize;
    int dataLength = (header[offset+0]) | (header[offset+1] << 8) | 
    (header[offset+2] << 16) | (header[offset+3] << 24);

    //I can't find any documentation stating that I should make the following inference, 
    //but I've seen wave files that have 
    //0 in the bits per sample field. These wave files were 16-bit, so 
    //if bits per sample isn't specified I will assume 16 bits. 
    if (BitsPerSample == 0)
    {
        BitsPerSample = 16;
    }

    byte[] dataBuffer = new byte[dataLength];

    bytesRead = sourceStream.Read(dataBuffer, 0, dataBuffer.Length);

    Debug.Assert(bytesRead == dataLength);

    if (BitsPerSample == 8)
    {
        byte[] unadjustedSoundData = new byte[dataBuffer.Length / (BitsPerSample / 8)];
        Buffer.BlockCopy(dataBuffer, 0, unadjustedSoundData, 0, dataBuffer.Length);

        SoundData = new double[unadjustedSoundData.Length];
        for (var i = 0; i < (unadjustedSoundData.Length); ++i)
        {
            SoundData[i] = 128d*(double)unadjustedSoundData[i];
        }

    }
    if (BitsPerSample == 16)
    {
        short[] unadjustedSoundData = new short[dataBuffer.Length / (BitsPerSample / 8)];
        Buffer.BlockCopy(dataBuffer, 0, unadjustedSoundData, 0, dataBuffer.Length);


        SoundData = new double[unadjustedSoundData.Length];
        for (var i = 0; i < (unadjustedSoundData.Length); ++i)
        {
            SoundData[i] = (double) unadjustedSoundData[i];
        }
    }
    else if(BitsPerSample==32)
    {
        int[] unadjustedSoundData = new int[dataBuffer.Length / (BitsPerSample / 8)];
        Buffer.BlockCopy(dataBuffer, 0, unadjustedSoundData, 0, dataBuffer.Length);

        SoundData = new double[unadjustedSoundData.Length];
        for (var i = 0; i < (unadjustedSoundData.Length); ++i)
        {
            SoundData[i] = (double)unadjustedSoundData[i];
        }
    }

    Channels = new PcmChannel[ChannelCount];
    for (int i = 0; i < ChannelCount;++i )
    {
        Channels[i]=new PcmChannel(this,i);
    }
        if (normalizeAmplitude )
            NormalizeAmplitude();
}

Mono vs Stereo

In a mono (single channel) file, the samples are ordered one after another, no mystery there. For stereo files, the data stream will contain the first sample for channel 0, then the first sample for channel 1, then the second sample for channel 0, second sample for channel 1, and so on. Every other sample will be for the left channel or right channel. The sample data is stored in memory in the same way. in an array called SampleData. To work exclusively with one channel or the other, there is also a property named Channels (of type PcmChannel) that can be used to access that one channel.

public class PcmChannel
{
    internal PcmChannel(PcmData parent, int channel)
    {
        Channel = channel;
        Parent = parent;
    }
    protected PcmData Parent { get; set;  }
    public int Channel { get; protected set; }
    public int Length
    {
        get { return (int)(Parent.SoundData.Length/Parent.ChannelCount);  }
    }
    public double this[int index]
    {
        get { return Parent.SoundData[index*Parent.ChannelCount + Channel]; }
        set { Parent.SoundData[index*Parent.ChannelCount + Channel] = value; }
    }
}

//The following is a simplified interface definition for how the PcmChannel
//data type is relevant to our PCM data. The actual PcmData class has more 
//more members than what follows.
public class PcmData
{
   public double[] SoundData { get; set; }
   public int ChannelCount { get; set; }
   public PcmChannel[] Channels { get; set; }
}

Where's 24-bit Support

Yes, there do exist 24-bit WAVE files. I'm not supporting them (yet) because there's more code required to handle them and most of the scenarios I have in mind are going to use 8 and 16-bit files. Adding support for 32-bit files was only 5 more lines of code. I'll be handing 24-bit files in a forthcoming code.

Altering the Sound Data

Changes made to the values in the SoundData[] array will alter the sound data. There are some constraints on how the data can be modified. Since I'm writing this to a 16-bit WAVE file, the maximum and minimum values that can be written out are 32,768 and -32,767. The double data type has a range significantly larger than this. The properties, AdjustmentFactor and AdjustmentOffset are used to alter the sound data when it is being prepared to be written back to a file. They are used to apply a linear transformation to the sound data (remember y=mx+b?). Finding the right values for these is done for you through the NormalizeAmplitude method. Calling this method after you've altered your sound data will result in appropriate values being chose. By default, this method will try to normalize the sound data to 99% of maximum amplitude. You can pass an argument to this method between the values of 0 and 1 for some other amplitude.

public void NormalizeAmplitude( double percentMax = 0.99d)
{
    var max = SoundData.Max();
    var min = SoundData.Min();

    double rangeSize = max - min+1 ;
    AdjustmentFactor = ((percentMax * (double)short.MaxValue) - 
    percentMax * (double)short.MinValue) / (double)rangeSize;
    AdjustmentOffset = (percentMax * (double)short.MinValue) - (min * AdjustmentFactor);

    int maxExpected = (int)(max * AdjustmentFactor + AdjustmentOffset);
    int minExpected = (int)(min * AdjustmentFactor + AdjustmentOffset);
}

Saving WAVE Data

To save the WAVE data, I'm using a variant of something I used to save the stream that comes from the Microphone. The original form of the code had a bug that makes a difference when working with a stream with multiple channels. The microsphone produces a single channel stream and wasn't impacted by this bug (but it's fixed here). The code for writing the wave produces a header from the parameters it is given, then it writes out the WAVE data. The WAVE data must be converted from the double[] array to a byte[] array containing 16-bit integers in little endian format.

public class PcmData
{
    public void Write(Stream destinationStream)
    {
        byte[] writeData = new byte[SoundData.Length*2];
        short[] conversionData = new short[SoundData.Length];

        //convert the double[] data back to int16[] data
        for(int i=0;i<SoundData.Length;++i)
        {
            double sample = ((SoundData[i]*AdjustmentFactor)+AdjustmentOffset);
            //if the value goes outside of range then clip it
            sample = Math.Min(sample, (double) short.MaxValue);
            sample = Math.Max(sample, short.MinValue);
            conversionData[i] = (short) sample;
        }
        int max = conversionData.Max();
        int min = conversionData.Min();
        //put the int16[] data into a byte[] array
        Buffer.BlockCopy(conversionData, 0, writeData, 0, writeData.Length);

        WaveHeaderWriter.WriteHeader(destinationStream,writeData.Length,
        ChannelCount,SampleRate);
        destinationStream.Write(writeData,0,writeData.Length);
    }
}

public class WaveHeaderWriter
{
    static byte[] RIFF_HEADER = new byte[] { 0x52, 0x49, 0x46, 0x46 };
    static byte[] FORMAT_WAVE = new byte[] { 0x57, 0x41, 0x56, 0x45 };
    static byte[] FORMAT_TAG = new byte[] { 0x66, 0x6d, 0x74, 0x20 };
    static byte[] AUDIO_FORMAT = new byte[] { 0x01, 0x00 };
    static byte[] SUBCHUNK_ID = new byte[] { 0x64, 0x61, 0x74, 0x61 };
    private const int BYTES_PER_SAMPLE = 2;

    public static void WriteHeader(
            System.IO.Stream targetStream,
            int byteStreamSize,
            int channelCount,
            int sampleRate)
    {

        int byteRate = sampleRate * channelCount * BYTES_PER_SAMPLE;
        int blockAlign =  BYTES_PER_SAMPLE;

        targetStream.Write(RIFF_HEADER, 0, RIFF_HEADER.Length);
        targetStream.Write(PackageInt(byteStreamSize + 36, 4), 0, 4);

        targetStream.Write(FORMAT_WAVE, 0, FORMAT_WAVE.Length);
        targetStream.Write(FORMAT_TAG, 0, FORMAT_TAG.Length);
        targetStream.Write(PackageInt(16, 4), 0, 4);//Subchunk1Size    

        targetStream.Write(AUDIO_FORMAT, 0, AUDIO_FORMAT.Length);//AudioFormat   
        targetStream.Write(PackageInt(channelCount, 2), 0, 2);
        targetStream.Write(PackageInt(sampleRate, 4), 0, 4);
        targetStream.Write(PackageInt(byteRate, 4), 0, 4);
        targetStream.Write(PackageInt(blockAlign, 2), 0, 2);
        targetStream.Write(PackageInt(BYTES_PER_SAMPLE * 8), 0, 2);
        //targetStream.Write(PackageInt(0,2), 0, 2);//Extra param size
        targetStream.Write(SUBCHUNK_ID, 0, SUBCHUNK_ID.Length);
        targetStream.Write(PackageInt(byteStreamSize, 4), 0, 4);
    }

    static byte[] PackageInt(int source, int length = 2)
    {
        if ((length != 2) && (length != 4))
            throw new ArgumentException("length must be either 2 or 4", "length");
        var retVal = new byte[length];
        retVal[0] = (byte)(source & 0xFF);
        retVal[1] = (byte)((source >> 8) & 0xFF);
        if (length == 4)
        {
            retVal[2] = (byte)((source >> 0x10) & 0xFF);
            retVal[3] = (byte)((source >> 0x18) & 0xFF);
        }
        return retVal;
    }
}

Using the Code

Once you've gotten the wave stream, only a few lines of code are needed to do the work. For the example program, I am downloading a spoken phrase from the Microsoft Translation service, amplifying it, and then writing both the original and amplified versions to a file.

static void Main(string[] args)
{
    PcmData pcm;

    //Download the WAVE stream
    MicrosoftTranslatorService.LanguageServiceClient client = new LanguageServiceClient();
    string waveUrl = client.Speak(APP_ID, "this is a volume test", "en", "audio/wav","");
    WebClient wc = new WebClient();
    var soundData = wc.DownloadData(waveUrl);
          
    //Load the WAVE stream and let it's amplitude be adjusted to 99% maximum
    using (var ms = new MemoryStream(soundData))
    {
        pcm = new PcmData(ms, true);               
    }

    //Write the amplified stream to a file
    using (Stream s = new FileStream("amplified.wav", FileMode.Create, FileAccess.Write))
    {
        pcm.Write(s);
    }

    //write the original unaltered stream to a file
    using (Stream s = new FileStream("original.wav", FileMode.Create, FileAccess.Write))
    {
        s.Write(soundData,0,soundData.Length);
    }
}

The End Result

The code works as designed, but I found a few scenarios that can make it ineffective. One scenario is that not all phones have the same response frequency for their speakers. Frequencies that come through loud and clear on one phone may come through sounding quieter on another. The other scenario is that the source files may have a sample that goes to the maximum or minimum reading even though a majority of the other samples may come nowhere near the same level of amplitude. When this occurs, the spurious sample will limit the amount of amplification that is applied to the file. I opened an original and amplified WAVE file in audacity to see my results and I was pleased to see that the amplified WAVE does actually look louder when I view its graph in audacity.

Part 2 - Overlaying Wave Files

The other problem that this code can solve is combining wave files together in various ways. I'll be putting that up in the next post. Between now and then, I've got a presentation at the Windows Phone Developers Atlanta meeting this week (if you are in the Atlanta area, come on out!) and will get back to this code after the presentation.

CodeProject