Hands-free WinRT: Part 2 – The Listening App


The first part of this series demonstrated how to use Speech Synthesis and “voice fonts” to give your WinRT app a voice. Now we can see what it takes for your app to work in the opposite direction: listening to your voice and producing text as a result.

Platform Divergence

While the Speech Synthesis API in WinRT is available to both Windows and Windows Phone 8.1, the Speech Recognition API, unfortunately, is not.

With Windows Phone, you get a powerful API built right into WinRT that enables speech recognition in your app. This can tap into the power of Bing and Microsoft’s cloud services to provide accurate voice recognition and grammar detection that an otherwise underpowered device, like the phone, may not be able to accomplish alone.

Windows 8.1 (desktop/tablet) does not provide speech recognition in its version of WinRT. Instead, you can add the Bing Speech Recognition Control to use the same cloud-based speech recognition that the phone uses.  This reality may be short lived, though, considering that Windows 10 is getting Cortana (so reading the tea leaves, I predict that the Phone’s speech recognition API will be available across the entire Windows 10 landscape – stay tuned for announcements at the April 2015 Build conference to see if this comes true or not).

This blog post will concentrate on the Windows Phone 8.1 WinRT implementation.


In speech recognition systems, a constraint is the list of words and phrases that the app recognizes. For example, your app may define its own grammar constraint to accept just a “Yes” or “No” answer to a question.  However, for “Yes”, the app may need to accept any familiar colloquialisms, such as “Okay”, “Yep”, “Sure”, “Go for it”.  The constraint is what is used to recognize these terms to be valid input for the current situation, and in the case of custom grammar, provides the mapping from “Sure” to “Yes” for your application. Constraints also help to solve the homonym problem – where words sound alike (i.e., “to”, “two”, and “too”), but have different meanings and depend on the context around them to interpret which word is actually being spoken.

The WinRT speech recognition system permits a number of different types of constraints to be defined, allowing for varying complexity:

  • Predefined Grammars: Free-text dictation and web-search grammars. These are used to recognize words and phrases that a user may say in a particular language, and as a result, are very large in size. Because of this, the recognition task is performed online via a remote service in near-real time.
  • Programmatic list constraints: You provide a list of words or phrases, and recognition is successful when the speech recognizer hears one of the strings in the array.
  • SRGS grammars: You define a set of grammar rules in a XML document that the speech recognizer uses to identify phrases. Specification: http://www.w3.org/TR/speech-grammar/

For the sake of getting started, we’ll just concentrate on the default constraint in this blog post: Free-text dictation.

Creating Strings from Speech

The goal of speech recognition is to generate a string containing the words that were spoken, which can then be parsed by your program using more familiar mechanisms. To do this with WinRT, first instantiate a SpeechRecognizer object, add any constraints, and then call RecognizeWithUIAsync():

By not adding any constraints, calling CompileConstraintsAsync() will use the default Free-text Dictation. As stated above, this will require that the device is connected to the internet, since the speech recognition will be performed in the cloud.

Next, since RecognizeWithUIAsync() will be used to start the speech recognizer, we can set up some of the options for that UI. The AudiblePrompt and ExampleText will be displayed on the capture screen as a prompt to the user for what to do. If ShowConfirmation is enabled, then the results will be displayed to the user after the recognition is complete. Optionally, this can be spoken aloud using Text to Speech if IsReadBackEnabled is true.

The capture screen:


The confirmation screen:


You’ll also notice in the code above that several timeout values are provided. InitialSilenceTimeout is the amount of time that the recognizer will wait for speech to begin before giving up. Similarly, EndSilenceTimout is the amount of time after the last spoken work that the recognizer will wait to see if there is any additional input.

Sometimes, the environment may be noisy due to wind noise, unintelligible background speech while in a restaurant or at a party, or just the sound of moving the phone around, etc. This is known as a Babble to the speech recognizer, since it’s sound that is being processed, but not recognized to be valid words. BabbleTimout, therefore, is the amount of time that the recognizer will continue to try parsing sound that only contains Babble.

Finally, after the recognition process is complete, a SpeechRecognitionResult object will be returned. Among other things, this provides the Text that the recognizer determined that the user said. This is the string value that your app will need to process and take action on.

The following two tabs change content below.
Jason Follas is a Sr. Architect for Falafel Software and lives near Toledo, OH. He has been a Microsoft MVP for SQL, Visual C#, and most recently Windows Platform. When not working or speaking, you can find him writing a number of casual games currently in the Windows Store or fishing in the local river.