Click here to Skip to main content
Click here to Skip to main content

Speech Recognition for the Web

, 6 Sep 2011 GPL3
Rate this:
Please Sign up or sign in to vote.
This article discusses the HTML5 Speech Input API

Introduction

This post discusses about bringing speech recognition to the web. Speech recognition is in short a technology that converts spoken words to text. Voice or speech recognition has been popular in the desktop software world. Popular examples of this include the speech recognition system used in Windows XP, Vista and Seven for giving voice commands and controlling the system. Another popular example would be the speech recognition feature in Microsoft Office that helps in dictating text so that the users can write text just by dictating it to the computer.

With the new draft specification of HTML5 Speech Input, this facility will be made available for the web so that the speech recognition can be carried out in the web world with ease.

Note: To read the full specifications of HTML5 Speech Input API, visit here.

Applications

Refer to this Wikipedia page

Technology

The API itself is agnostic of the underlying speech recognition implementation and can support both server based as well as embedded recognizers. In case of embedded recognizers, the browser itself would have the capability of speech recognition and this would be quite similar to the current software that does speech recognition. In this approach, the browser would record the voice from the microphone and perform the speech recognition process on the input voice locally and generate the resultant text.

This would be a fast process and could be done offline as well. Whereas in the second approach, the browser would record the voice from the microphone and stream the audio data to its server which is responsible for the speech recognition and after the speech recognition process at the server, it would send the result text to the browser.

The advantage of using a server based approach is that speech recognition would be more precise and accurate than the local approach because large amount of training data collected at the central servers help improve accuracy of the speech recognition. The API is designed to enable both one-off speech input and continuous speech input requests. Speech recognition results are provided to the web page as a list of hypotheses along with other relevant information for each hypothesis.

In my demonstration, Chrome is the browser which captures audio and streams to Google’s servers for speech recognition and the text is resulted from the servers and sent to Chrome browser. In this demonstration, the software part that has the responsibility to capture audio and stream to servers is embedded directly in the Chrome web browser.

  • For extra research, you can look at the Chrome web browser source code related to speech. Audio is collected from the microphone, and then sent to a Google server (a Gwebservice) using HTTPS POST, which returns a JSON object with results. Check out the source code or visit Accessing Google Speech API / Chrome 11 for a little more information.
  • As it is clear that unless you have your own browser product like Google has Chrome, you will have to build an extension that will be attached to the browser and will handle the audio capture and streaming responsibilities. And you also need servers that will do the speech recognition for you. Or you can also opt for the first approach and embed your recognizer in your extension that you built.

Other Approaches

There are other approaches as well that do not relate to HTML5 or SPEECH INPUT API. They implement speech recognition for web using different implementations but using the same technology as I discussed above. The strategy followed by them is that a flash based component resides on the web page which captures the audio and streams the audio to their servers and gets the result back from the server.

Note: There can be and would be many more implementations to use speech recognition on the web. These are the ones I came across.

Prerequisites

HTML5

HTML5 is a language for structuring and presenting content for the World Wide Web, a core technology of the Internet. It is the fifth revision of the HTML standard. In particular, HTML5 adds many new syntactical features. HTML5 introduces a number of new elements and attributes that reflect typical usage on modern websites. In addition to specifying markup, HTML5 specifies scripting application programming interfaces (APIs).[HTML5 new features and specifications are not achievable without CSS and JS. So bluntly HTML5 =HTML + CSS +JS. Knowing briefly about HTML will help us to better understand the details of SPEECH INPUT.

The <input> html element (<input type="text" name="text 1">) is extended in the HTML5 Speech Input specification to allow speech recognition and input facilities. The input element is extended because the intended aim of the API is to allow input of data by voice or speech. This makes it clear for the name "Speech Input API".

  • A basic knowledge of the input element is needed. Full practical details will be discussed at a later stage.

JavaScript

Webpage authoring is beautifully separated into three layers that provide world wide web the flexibility and extensibility it enjoys today. This three layer pattern has come from our past experiences and mistakes which helped in the evolution of the world wide web and web authoring. Web authoring is separated into the layers of content, presentation and behavior where content and structure is controlled by HTML, presentation and styling is controlled by CSS and the behavior and responsiveness of elements is controlled by the JavaScript. So in brief all elements structured by HTML are represented in the DOM (Document Object Model) as objects and JS is the language that interacts with those DOM objects. JS can be used to access the object, their properties, subscribe to events associated with them and respond to those events when the event triggers.

To understand the events caused by Speech Input and to respond to them, basic knowledge of JavaScript is required.

Implementation

For a working demonstration of the Speech API, visit http://www.robinrizvi.info/speechapidemo/.

In this demonstration, I am presenting an example of navigating the website by issuing voice commands, i.e., the user can speak the links through his microphone to navigate to the link.

HTML

<div id="speechinput"> 
<input id="speech" type="text" 
speech="speech" x-webkit-speech="x-webkit-speech" 
onspeechchange="processspeech();" onwebkitspeechchange="processspeech();" /> 
<img src="image/mic_disabled.png"/> 
</div>  

The main line in the above HTML code that does the magic is:

<input id="speech" type="text" speech="speech" x-webkit-speech="x-webkit-speech" 
	onspeechchange="processspeech();" onwebkitspeechchange="processspeech();" />

Here the <input> HTML element is extended to include more properties/attributes and events so that the speech functionality could be achieved. The new attributes and methods added are:

speech="speech"
x-webkit-speech="x-webkit-speech"
onspeechchange="processspeech();"
onwebkitspeechchange="processspeech();"

speech="speech": tells the browser that it is not a normal <input> element, rather it is an <input> element that can take input by speech or voice. This adds a small mic to the right of the <input> element which can be clicked so that the browser can capture voice from the microphone. x-webkit-speech="x-webkit-speech", this attribute is just a redundant attribute which will possibly be removed. This attribute is not in the draft specification. But this attribute is necessary for the demonstration to work because Google Chrome recognizes the x-webkit-speech attribute instead of the speech attribute. speech is just prefixed with x-webkit. It's just a difference of name as specified in the browser’s engine, nothing very special about it.

For extra knowledge, webkit is the web browser engine (called layout engine or rendering engine) of Google Chrome web browser. Each browser has an underlying engine that does the work of interpreting HTML, CSS and JS and laying out the elements on the browser screen. For instance, Gecko is the layout engine of Firefox, Trident is the layout engine of Internet Explorer and Presto is the engine for Opera. These layout engines are the core or kernel of any web browser and most of them are open source including gecko, webkit and others.

onspeechchange="processspeech();"

This subscribes the processspeech() event handler to the speech change event which occurs when the speech or voice input changes the value of the <input> element. processspeech() is just a function name and could have been anything else.

onwebkitspeechchange="processspeech();"

This event is just a redundant event as the redundant attribute discussed above. But this event is necessary for the demonstration to work because Google Chrome recognizes the onwebkitspeechchange event instead of the onspeechchange event.

This phenomenon of redundant attribute and event may seem familiar if you are acquainted and worked with some of the CSS properties that are prefixed with -moz and work only on mozilla/gecko browsers like -moz-transform and others.

JavaScript

$(document).ready(function() {			
				//checking if html5 speech input is implemented 
				//in the browser of not.
				var d=document.getElementById("speech");
				if(!d.onwebkitspeechchange&&!d.onspeechchange)
				{
				    $("#speechinput").css("border-color","#900");
				    $("#speechinput input").css("display","none");
				    $("#speechinput img").css("display","block");
				    var notification= "Voice input functionality 
					is currently not supported in your browser.
 Please install the latest version (11+) of Google Chrome to access this functionality";
					notify(notification,3000);
					$("#speechinput").click(function(){
					var notification= "Voice input functionality
					 is currently not supported in your browser.
 Please install the latest version (11+) of Google Chrome to access this functionality";
						notify(notification,3000);
					});
				}
				else
				{
					var notification= "Voice input 
					functionality is supported in your browser.
 *Valid voice commands are: CHAT, VIDEO, PICTURE, LIVE, CONTACT";
					notify(notification,3000);
				}
			});	
function processspeech()
{
	var speechtext=$("#speech").val();
	var flag=1;
	switch (speechtext)
	{
		case "chat":
			$("#chat").click();
			break;
		case "video":
			$("#video").click();
			break;
		case "picture":
			$("#picture").click();
			break;
		case "live":
			$("#live").click();
			break;
		case "contact":
			$("#contact").click();
			break;
		default:
			flag=0;
			for (i=1;i<=3;i++) $("#speechinput").animate
			({"border-color":"#900"},500).animate
			({"border-color":"#fff"},500);
	}
	if (flag==1) for (i=1;i<=3;i++) $("#speechinput").animate
	({"border-color":"#060"},500).animate({"border-color":"#fff"},500);
	else
	{
		var notification="\"<span>"+ speechtext + "</span>\" is an invalid voice command.
*Valid voice commands are: CHAT, VIDEO, PICTURE, LIVE, CONTACT";
		notify(notification);
	}
}
function notify(notification,time)
{
	if (typeof time == 'undefined' ) time = 2000;
	$("#speechnotification").html(notification);
	$("#speechnotification").animate({"left":0},1500).delay(time).animate
			({"left":-(($(this).width())+5)},1500);
}

The first section of code executes when the document gets ready. It simply checks whether the two events are available in the speech input element. As it is known that JavaScript is an object-oriented language, so the above used notion for checking whether an attribute or event is present or not is an intuitive one. Here, for example, d.onwebkitspeechchange returns undefined(=NULL) on Firefox but on Chrome it does not return undefined. After checking, it just notifies the user about it.

The second section of the code is processspeech() event handler for the speechchange event. After the speech is converted to text and saved in the input text box, the event handler gets executed. The rest of the code here is quite easy to understand so I will not be discussing it.

  • The various animations of the interface that I built, I will not be discussing those to keep the content concise.

CSS

CSS does not play any significant role in the SPEECH INPUT API. Speech Input is all about the HTML <input> element and the handling of events by js which are triggered by that <input> element. I have used CSS here just to hide the textbox associated with the HTML <input> element and to show only the microphone icon that is to the right of the textbox. We also scale the microphone so that it looks bigger and replace the text cursor that comes when we hover on the microphone with a hand cursor.

#speechinput input {
	cursor:pointer;
	margin:auto;
	margin:15px;
	color:transparent;
	background-color:transparent;
	border:5px;
	width:15px;
	-webkit-transform: scale(3.0, 3.0);
	-moz-transform: scale(3.0, 3.0);
	-ms-transform: scale(3.0, 3.0);
	transform: scale(3.0, 3.0);
}
  • I have not discussed all the CSS that was used to style and position the speech input element. Just take a look at the source while you are viewing the demonstration.

License

This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)

Share

About the Author

Robin Rizvi
Software Developer Databorough India
India India
Currently working as software developer for Databorough India - Division of Fresche Legacy.
 
Developing for the open-source community and writing articles is my way of thanking the community. I have developed commercial as well as non-commercial/open-source projects for the web and windows as my work and hobby. Just trying very hard so that someday I could contribute a little for this world. I would like to send out my regards to all for your rating and comments because these comments keep me going. Thank you all.
 
Certifications:
Microsoft Certified Professional (Programming in C#)
Microsoft Certified Professional (Programming in HTML5 with JavaScript and CSS3)
 
GET IN TOUCH:
http://robinrizvi.info
http://blog.robinrizvi.info
 
If you wish to express your appreciation
Donate @ http://blog.robinrizvi.info
Follow on   Twitter   Google+   LinkedIn

Comments and Discussions

 
QuestionIs it possible to use this in windows application? Pinmemberamol_kk8-Apr-14 2:36 
AnswerRe: Is it possible to use this in windows application? PinmemberRobin Rizvi25-Apr-14 9:57 
Generalspeech is not the input's attribute Pinmemberakira3213-Oct-13 15:31 
QuestionHow to do offline configuration? PinmemberBasma Adeel12-Jul-13 5:31 
AnswerRe: How to do offline configuration? PinmemberRobin Rizvi12-Jul-13 7:53 
Questionquesion Pinmemberfaynburd5-Jun-13 11:59 
AnswerRe: quesion PinmemberRobin Rizvi12-Jul-13 7:46 
Questionspeech recognition in asp.net with c# Pinmemberpraphe23-Jan-13 0:08 
AnswerRe: speech recognition in asp.net with c# PinmemberRobin Rizvi12-Jul-13 7:55 
GeneralMy vote of 5 Pinmemberridoy17-Jan-13 8:29 
QuestionCommercial Use PinmemberMember 964588430-Nov-12 8:07 
AnswerRe: Commercial Use PinmemberRobin Rizvi30-Nov-12 22:21 
QuestionHow to Implement without Colorbox PinmemberMember 878379022-Nov-12 11:30 
How do i implement this without using colorbox. I would like to just navigate to the new page.
 
When I try the speech box flashes green but with no action.
 
Any help is appreciated.
AnswerRe: How to Implement without Colorbox PinmemberRobin Rizvi23-Nov-12 20:39 
QuestionSpeech recognition and JavaScript PinmemberSamar_0118-Sep-12 2:18 
AnswerRe: Speech recognition and JavaScript PinmemberRobin Rizvi18-Sep-12 4:15 
GeneralRe: Speech recognition and JavaScript PinmemberSamar_0118-Sep-12 4:58 
GeneralRe: Speech recognition and JavaScript PinmemberRobin Rizvi19-Sep-12 4:29 
QuestionInquiry‏ PinmemberSamar_0111-Aug-12 10:50 
AnswerRe: Inquiry‏ PinmemberRobin Rizvi11-Aug-12 11:16 
GeneralRe: Inquiry‏ PinmemberSamar_0111-Aug-12 11:34 
AnswerRe: Inquiry‏ PinmemberRobin Rizvi13-Aug-12 4:09 
GeneralRe: Inquiry‏ PinmemberSamar_0114-Sep-12 22:07 
AnswerRe: Inquiry‏ PinmemberRobin Rizvi18-Sep-12 4:31 
QuestionCan i use HTML5 for implementing speech input in mozilla like chrome11 Pinmembersabii6-Aug-12 2:57 
AnswerRe: Can i use HTML5 for implementing speech input in mozilla like chrome11 PinmemberRobin Rizvi7-Aug-12 5:05 
GeneralRe: Can i use HTML5 for implementing speech input in mozilla like chrome11 Pinmembersabii10-Aug-12 9:02 
QuestionHelp me PinmemberMember 869417926-Apr-12 20:01 
AnswerRe: Help me PinmemberRobin Rizvi26-Apr-12 21:07 
AnswerRe: Help me PinmemberRobin Rizvi12-Aug-12 2:30 
GeneralI appreciate the post PinmemberMember 869417926-Apr-12 19:59 
AnswerRe: I appreciate the post PinmemberRobin Rizvi26-Apr-12 20:55 
GeneralMy vote of 5 Pinmembermanoj kumar choubey5-Apr-12 0:23 
AnswerRe: My vote of 5 PinmemberRobin Rizvi10-Aug-12 14:41 
Questionnice PinmemberPawan Kr. Misra15-Mar-12 0:58 
AnswerRe: nice PinmemberRobin Rizvi15-Mar-12 3:17 
QuestionIf You can Help Me PinmemberMoahmed Elnaagr14-Mar-12 13:07 
AnswerRe: If You can Help Me PinmemberRobin Rizvi15-Mar-12 3:55 
QuestionNice again Pinmemberbkrlive1usr10-Mar-12 6:43 
AnswerRe: Nice again PinmemberRobin Rizvi11-Mar-12 1:22 
GeneralNice PinmemberShruti Misra10-Mar-12 6:37 
GeneralRe: Nice PinmemberRobin Rizvi11-Mar-12 1:22 
GeneralMy vote of 5 PinmemberMahsa Hassankashi30-Jan-12 13:10 
GeneralRe: My vote of 5 PinmemberRobin Rizvi31-Jan-12 1:47 
GeneralMy vote of 5 PinmemberSuresh Suthar20-Dec-11 21:14 
GeneralRe: My vote of 5 PinmemberRobin Rizvi11-Mar-12 1:21 
QuestionMy vote of 5 PinmemberRubén Hinojosa Chapel13-Sep-11 20:15 
AnswerRe: My vote of 5 PinmemberRobin Rizvi11-Mar-12 1:21 
QuestionMy vote 5 Pinmemberabhishek.biradar12-Sep-11 22:38 
AnswerRe: My vote 5 PinmemberRobin Rizvi11-Mar-12 1:21 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web02 | 2.8.141015.1 | Last Updated 6 Sep 2011
Article Copyright 2011 by Robin Rizvi
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid