Data Scraping from Speech to Text

Eric M. H. Goh

5.00/5 (5 votes)

Mar 28, 2018

Apache

2 min read

12835

535

Speech to Text Recognition for Data Scraping and Collection in Data Mining

Introduction

Data Science is a growing field. According to CRISP DM model and other Data Mining models, we need to collect data before mining out knowledge and conduct predictive analysis. Data Collection can involve data scraping, which includes web scraping (HTML to Text), image to text and video to text conversion. When data is in text format, we usually use text mining techniques to mine out knowledge.

In this article, I am going to introduce you to speech to text recognition. I developed Just Another Voice Transformer (JAVT) to convert videos into text files, and consolidate them into a set of text data for text mining and natural language processing.

JAVT has features to convert video into audio file using ffmpeg, and then convert audio into text file, using Microsoft SAPI or CMU Sphinx. I have included the source code for all the video to audio conversion and audio to text conversion. In this article, I am going to explain only the Speech Recognition and Speech Synthesizer using Microsoft SAPI, and interfacing with ffmpeg.

Speech Recognition in C# using Microsoft SAPI

To use speech recognition in C#, you will need to add the following libraries at the top of the code:

using System.Speech.Recognition;
using System.Speech.AudioFormat;

Then create the dictation grammar and Speech Recognition Engine:

DictationGrammar dictation;
dictation = new DictationGrammar();
private SpeechRecognitionEngine sr;
sr = new SpeechRecognitionEngine();

We will then need to load the dictation grammar into speech recognition engine:

sr.LoadGrammar(dictation);

If you are using .wav file as input, set the speech recognition engine to:

sr.SetInputToWaveFile(textBox3.Text);

If you are using the audio device such as microphone as input, set the speech recognition engine to:

sr.SetInputToDefaultAudioDevice();

To perform asynchronous speech recognition:

sr.RecognizeAsync(RecognizeMode.Multiple);

Then add these event handlers:

sr.SpeechRecognized -= new EventHandler<SpeechRecognizedEventArgs>(SpeechRecognized);
sr.EmulateRecognizeCompleted -= 
new EventHandler<EmulateRecognizeCompletedEventArgs>(EmulateRecognizeCompletedHandler);

sr.SpeechRecognized += new EventHandler<SpeechRecognizedEventArgs>(SpeechRecognized);
sr.EmulateRecognizeCompleted += 
new EventHandler<EmulateRecognizeCompletedEventArgs>(EmulateRecognizeCompletedHandler);

If the speech is recognized, SpeechRecognized() method will be called. The following is the SpeechRecognized() method used in JAVT. To get the recognized text, we get it from e.Result.Text.

string finalResult;
private void SpeechRecognized(object sender, SpeechRecognizedEventArgs e) {
            try{
            finalResult = e.Result.Text;
            richTextBox3.Text += " " + finalResult;
            }
            
            catch(Exception ex) {
                MessageBox.Show(ex.Message);
            }
        }

If the speech recognition is completed, the EmulateRecognizeCompletedHandler() method will be called. The following is the EmulateRecognizeCompletedHandler() method in the program:

bool isCompleted = false;
private void EmulateRecognizeCompletedHandler(object sender, EmulateRecognizeCompletedEventArgs e) {
            try{
            isCompleted = true;
            
            sr.UnloadGrammar(dictation);
            sr.RecognizeAsyncStop();
            
            richTextBox3.Text += "\n\nCompleted. \n";
            MessageBox.Show("Completed. ");
            }
            
            catch(Exception ex) {
                MessageBox.Show(ex.Message);
            }            
        }

Text to Speech

Since we have created speech recognition, the following is the text to speech recognition.

First, we need to add in System.Speech.Synthesis library and create Speech Synthesizer:

using System.Speech.Synthesis;

SpeechSynthesizer speaker;
speaker = new SpeechSynthesizer();

Then we set the Rate and Volume:

speaker.Rate = int.Parse(rateTextBox.Text);
speaker.Volume = int.Parse(volTextBox.Text);

To use a female speaker:

speaker.SelectVoiceByHints(VoiceGender.Female);

Then run the Speech Synthesizer:

speaker.SpeakAsync(richTextBox2.Text);

Video to Audio Conversion

I use ffmpeg to convert video into audio. To interface with ffmpeg, first, include the System.Diagnostics library:

using System.Diagnostics;

Then create a new process:

Process process = new Process();

Create the ffmpeg inputs:

string arg = "-i " + f + " -ab 160k -ac 2 -ar 44100 -vn " + f + ".wav";

Set the process settings:

process.StartInfo.FileName = Directory.GetCurrentDirectory() + "\\ffmpeg\\bin\\ffmpeg.exe";
process.StartInfo.Arguments = arg;
process.StartInfo.ErrorDialog = true;
process.StartInfo.WindowStyle = ProcessWindowStyle.Normal;

Start the process:

process.Start();
process.WaitForExit();