Click here to Skip to main content
12,244,103 members (55,339 online)
Click here to Skip to main content
Add your own
alternative version


83 bookmarked

Speech Recognition for the Web

, 6 Sep 2011 GPL3
Rate this:
Please Sign up or sign in to vote.
This article discusses the HTML5 Speech Input API


This post discusses about bringing speech recognition to the web. Speech recognition is in short a technology that converts spoken words to text. Voice or speech recognition has been popular in the desktop software world. Popular examples of this include the speech recognition system used in Windows XP, Vista and Seven for giving voice commands and controlling the system. Another popular example would be the speech recognition feature in Microsoft Office that helps in dictating text so that the users can write text just by dictating it to the computer.

With the new draft specification of HTML5 Speech Input, this facility will be made available for the web so that the speech recognition can be carried out in the web world with ease.

Note: To read the full specifications of HTML5 Speech Input API, visit here.


Refer to this Wikipedia page


The API itself is agnostic of the underlying speech recognition implementation and can support both server based as well as embedded recognizers. In case of embedded recognizers, the browser itself would have the capability of speech recognition and this would be quite similar to the current software that does speech recognition. In this approach, the browser would record the voice from the microphone and perform the speech recognition process on the input voice locally and generate the resultant text.

This would be a fast process and could be done offline as well. Whereas in the second approach, the browser would record the voice from the microphone and stream the audio data to its server which is responsible for the speech recognition and after the speech recognition process at the server, it would send the result text to the browser.

The advantage of using a server based approach is that speech recognition would be more precise and accurate than the local approach because large amount of training data collected at the central servers help improve accuracy of the speech recognition. The API is designed to enable both one-off speech input and continuous speech input requests. Speech recognition results are provided to the web page as a list of hypotheses along with other relevant information for each hypothesis.

In my demonstration, Chrome is the browser which captures audio and streams to Google’s servers for speech recognition and the text is resulted from the servers and sent to Chrome browser. In this demonstration, the software part that has the responsibility to capture audio and stream to servers is embedded directly in the Chrome web browser.

  • For extra research, you can look at the Chrome web browser source code related to speech. Audio is collected from the microphone, and then sent to a Google server (a Gwebservice) using HTTPS POST, which returns a JSON object with results. Check out the source code or visit Accessing Google Speech API / Chrome 11 for a little more information.
  • As it is clear that unless you have your own browser product like Google has Chrome, you will have to build an extension that will be attached to the browser and will handle the audio capture and streaming responsibilities. And you also need servers that will do the speech recognition for you. Or you can also opt for the first approach and embed your recognizer in your extension that you built.

Other Approaches

There are other approaches as well that do not relate to HTML5 or SPEECH INPUT API. They implement speech recognition for web using different implementations but using the same technology as I discussed above. The strategy followed by them is that a flash based component resides on the web page which captures the audio and streams the audio to their servers and gets the result back from the server.

Note: There can be and would be many more implementations to use speech recognition on the web. These are the ones I came across.



HTML5 is a language for structuring and presenting content for the World Wide Web, a core technology of the Internet. It is the fifth revision of the HTML standard. In particular, HTML5 adds many new syntactical features. HTML5 introduces a number of new elements and attributes that reflect typical usage on modern websites. In addition to specifying markup, HTML5 specifies scripting application programming interfaces (APIs).[HTML5 new features and specifications are not achievable without CSS and JS. So bluntly HTML5 =HTML + CSS +JS. Knowing briefly about HTML will help us to better understand the details of SPEECH INPUT.

The <input> html element (<input type="text" name="text 1">) is extended in the HTML5 Speech Input specification to allow speech recognition and input facilities. The input element is extended because the intended aim of the API is to allow input of data by voice or speech. This makes it clear for the name "Speech Input API".

  • A basic knowledge of the input element is needed. Full practical details will be discussed at a later stage.


Webpage authoring is beautifully separated into three layers that provide world wide web the flexibility and extensibility it enjoys today. This three layer pattern has come from our past experiences and mistakes which helped in the evolution of the world wide web and web authoring. Web authoring is separated into the layers of content, presentation and behavior where content and structure is controlled by HTML, presentation and styling is controlled by CSS and the behavior and responsiveness of elements is controlled by the JavaScript. So in brief all elements structured by HTML are represented in the DOM (Document Object Model) as objects and JS is the language that interacts with those DOM objects. JS can be used to access the object, their properties, subscribe to events associated with them and respond to those events when the event triggers.

To understand the events caused by Speech Input and to respond to them, basic knowledge of JavaScript is required.


For a working demonstration of the Speech API, visit

In this demonstration, I am presenting an example of navigating the website by issuing voice commands, i.e., the user can speak the links through his microphone to navigate to the link.


<div id="speechinput"> 
<input id="speech" type="text" 
speech="speech" x-webkit-speech="x-webkit-speech" 
onspeechchange="processspeech();" onwebkitspeechchange="processspeech();" /> 
<img src="image/mic_disabled.png"/> 

The main line in the above HTML code that does the magic is:

<input id="speech" type="text" speech="speech" x-webkit-speech="x-webkit-speech" 
	onspeechchange="processspeech();" onwebkitspeechchange="processspeech();" />

Here the <input> HTML element is extended to include more properties/attributes and events so that the speech functionality could be achieved. The new attributes and methods added are:


speech="speech": tells the browser that it is not a normal <input> element, rather it is an <input> element that can take input by speech or voice. This adds a small mic to the right of the <input> element which can be clicked so that the browser can capture voice from the microphone. x-webkit-speech="x-webkit-speech", this attribute is just a redundant attribute which will possibly be removed. This attribute is not in the draft specification. But this attribute is necessary for the demonstration to work because Google Chrome recognizes the x-webkit-speech attribute instead of the speech attribute. speech is just prefixed with x-webkit. It's just a difference of name as specified in the browser’s engine, nothing very special about it.

For extra knowledge, webkit is the web browser engine (called layout engine or rendering engine) of Google Chrome web browser. Each browser has an underlying engine that does the work of interpreting HTML, CSS and JS and laying out the elements on the browser screen. For instance, Gecko is the layout engine of Firefox, Trident is the layout engine of Internet Explorer and Presto is the engine for Opera. These layout engines are the core or kernel of any web browser and most of them are open source including gecko, webkit and others.


This subscribes the processspeech() event handler to the speech change event which occurs when the speech or voice input changes the value of the <input> element. processspeech() is just a function name and could have been anything else.


This event is just a redundant event as the redundant attribute discussed above. But this event is necessary for the demonstration to work because Google Chrome recognizes the onwebkitspeechchange event instead of the onspeechchange event.

This phenomenon of redundant attribute and event may seem familiar if you are acquainted and worked with some of the CSS properties that are prefixed with -moz and work only on mozilla/gecko browsers like -moz-transform and others.


$(document).ready(function() {			
				//checking if html5 speech input is implemented 
				//in the browser of not.
				var d=document.getElementById("speech");
				    $("#speechinput input").css("display","none");
				    $("#speechinput img").css("display","block");
				    var notification= "Voice input functionality 
					is currently not supported in your browser.
 Please install the latest version (11+) of Google Chrome to access this functionality";
					var notification= "Voice input functionality
					 is currently not supported in your browser.
 Please install the latest version (11+) of Google Chrome to access this functionality";
					var notification= "Voice input 
					functionality is supported in your browser.
 *Valid voice commands are: CHAT, VIDEO, PICTURE, LIVE, CONTACT";
function processspeech()
	var speechtext=$("#speech").val();
	var flag=1;
	switch (speechtext)
		case "chat":
		case "video":
		case "picture":
		case "live":
		case "contact":
			for (i=1;i<=3;i++) $("#speechinput").animate
	if (flag==1) for (i=1;i<=3;i++) $("#speechinput").animate
		var notification="\"<span>"+ speechtext + "</span>\" is an invalid voice command.
*Valid voice commands are: CHAT, VIDEO, PICTURE, LIVE, CONTACT";
function notify(notification,time)
	if (typeof time == 'undefined' ) time = 2000;

The first section of code executes when the document gets ready. It simply checks whether the two events are available in the speech input element. As it is known that JavaScript is an object-oriented language, so the above used notion for checking whether an attribute or event is present or not is an intuitive one. Here, for example, d.onwebkitspeechchange returns undefined(=NULL) on Firefox but on Chrome it does not return undefined. After checking, it just notifies the user about it.

The second section of the code is processspeech() event handler for the speechchange event. After the speech is converted to text and saved in the input text box, the event handler gets executed. The rest of the code here is quite easy to understand so I will not be discussing it.

  • The various animations of the interface that I built, I will not be discussing those to keep the content concise.


CSS does not play any significant role in the SPEECH INPUT API. Speech Input is all about the HTML <input> element and the handling of events by js which are triggered by that <input> element. I have used CSS here just to hide the textbox associated with the HTML <input> element and to show only the microphone icon that is to the right of the textbox. We also scale the microphone so that it looks bigger and replace the text cursor that comes when we hover on the microphone with a hand cursor.

#speechinput input {
	-webkit-transform: scale(3.0, 3.0);
	-moz-transform: scale(3.0, 3.0);
	-ms-transform: scale(3.0, 3.0);
	transform: scale(3.0, 3.0);
  • I have not discussed all the CSS that was used to style and position the speech input element. Just take a look at the source while you are viewing the demonstration.


This article, along with any associated source code and files, is licensed under The GNU General Public License (GPLv3)


About the Author

Robin Rizvi
Software Developer Databorough India
India India
Currently working as software developer for Databorough India - Division of Fresche Legacy.

Developing for the open-source community and writing articles is my way of thanking the community. I have developed commercial as well as non-commercial/open-source projects for the web and windows as my work and hobby. Just trying very hard so that someday I could contribute a little for this world. I would like to send out my regards to all for your rating and comments because these comments keep me going. Thank you all.

Microsoft Certified Professional (Programming in C#)
Microsoft Certified Professional (Programming in HTML5 with JavaScript and CSS3)

You may also be interested in...

Comments and Discussions

QuestionA great implementation web app for speech recognition using that technology Pin
Speechlogger Admin5-Jun-15 5:46
memberSpeechlogger Admin5-Jun-15 5:46 
GeneralAbout Article Pin
Member 1157141431-Mar-15 18:34
memberMember 1157141431-Mar-15 18:34 
QuestionIs it possible to use this in windows application? Pin
amol_kk8-Apr-14 3:36
memberamol_kk8-Apr-14 3:36 
AnswerRe: Is it possible to use this in windows application? Pin
Robin Rizvi25-Apr-14 10:57
memberRobin Rizvi25-Apr-14 10:57 
Generalspeech is not the input's attribute Pin
akira3213-Oct-13 16:31
memberakira3213-Oct-13 16:31 
QuestionHow to do offline configuration? Pin
Basma Adeel12-Jul-13 6:31
memberBasma Adeel12-Jul-13 6:31 
AnswerRe: How to do offline configuration? Pin
Robin Rizvi12-Jul-13 8:53
memberRobin Rizvi12-Jul-13 8:53 
Questionquesion Pin
faynburd5-Jun-13 12:59
memberfaynburd5-Jun-13 12:59 
AnswerRe: quesion Pin
Robin Rizvi12-Jul-13 8:46
memberRobin Rizvi12-Jul-13 8:46 
Questionspeech recognition in with c# Pin
praphe23-Jan-13 1:08
memberpraphe23-Jan-13 1:08 
AnswerRe: speech recognition in with c# Pin
Robin Rizvi12-Jul-13 8:55
memberRobin Rizvi12-Jul-13 8:55 
GeneralMy vote of 5 Pin
ridoy17-Jan-13 9:29
memberridoy17-Jan-13 9:29 
QuestionCommercial Use Pin
Member 964588430-Nov-12 9:07
memberMember 964588430-Nov-12 9:07 
AnswerRe: Commercial Use Pin
Robin Rizvi30-Nov-12 23:21
memberRobin Rizvi30-Nov-12 23:21 
QuestionHow to Implement without Colorbox Pin
Member 878379022-Nov-12 12:30
memberMember 878379022-Nov-12 12:30 
AnswerRe: How to Implement without Colorbox Pin
Robin Rizvi23-Nov-12 21:39
memberRobin Rizvi23-Nov-12 21:39 
QuestionSpeech recognition and JavaScript Pin
Samar_0118-Sep-12 3:18
memberSamar_0118-Sep-12 3:18 
AnswerRe: Speech recognition and JavaScript Pin
Robin Rizvi18-Sep-12 5:15
memberRobin Rizvi18-Sep-12 5:15 
GeneralRe: Speech recognition and JavaScript Pin
Samar_0118-Sep-12 5:58
memberSamar_0118-Sep-12 5:58 
GeneralRe: Speech recognition and JavaScript Pin
Robin Rizvi19-Sep-12 5:29
memberRobin Rizvi19-Sep-12 5:29 
QuestionInquiry‏ Pin
Samar_0111-Aug-12 11:50
memberSamar_0111-Aug-12 11:50 
AnswerRe: Inquiry‏ Pin
Robin Rizvi11-Aug-12 12:16
memberRobin Rizvi11-Aug-12 12:16 
GeneralRe: Inquiry‏ Pin
Samar_0111-Aug-12 12:34
memberSamar_0111-Aug-12 12:34 
AnswerRe: Inquiry‏ Pin
Robin Rizvi13-Aug-12 5:09
memberRobin Rizvi13-Aug-12 5:09 
GeneralRe: Inquiry‏ Pin
Samar_0114-Sep-12 23:07
memberSamar_0114-Sep-12 23:07 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.160426.1 | Last Updated 6 Sep 2011
Article Copyright 2011 by Robin Rizvi
Everything else Copyright © CodeProject, 1999-2016
Layout: fixed | fluid