Hands-free WinRT: Part 3 – Cortana


Cortana is many things. She (yes, after using it for a while, you start to refer to Cortana as “she”) is a digital assistant, keeping track of details about you in order to be proactive in delivering information to you before you need it. She is a voice interface to perform tasks that would otherwise require starting one of the built-in apps, such as setting reminders and alarms, or sending a text message while driving. And when she cannot determine an action based on what you say (or type) to her, then she will just perform a Bing search using your words.

Cortana is also an app launcher, and can launch your app in response to a voice command. This feature is known as Cortana Integration from your app’s perspective.

Integrating with Cortana involves three steps:

  1. Define voice commands so that Cortana can identify the user’s intention to use your app for the given command.
  2. “Install” the voice command definitions (a XML file) when your app first starts
  3. Handle voice activation by voice by navigating and/or performing an action when your app is launched via a voice command.

Note: At this year’s //BUILD conference, it was revealed that Cortana Integration will become much richer in the Windows 10 timeframe. However, at the time of this writing, that is still months away, so this blog post will focus on the latest APIs available for Windows Phone 8.1 and WinRT today.

Define your Voice Command Definitions

The Voice Command Definition file contains the rules that Cortana will listen for in order to determine the user’s intent for launching your app. You can find details about the contents on MSDN: https://msdn.microsoft.com/en-us/library/dn706592.aspx

Suppose that I have an app that is able to automate my vehicle in order to do things like locate the vehicle, lock/unlock the doors, or remote start my engine. I may want to perform these tasks via Cortana, so that I don’t need to open my app and click a button.  So, to remote start my car, I might want to say “Car, start engine.”  The VCD for this would resemble:

The <CommandSet> element defines a number of rules for a particular language. If your app is multilingual, then your VCD would likely have multiple CommandSet elements.

Notice the <CommandPrefix> element. This is the prefix that Cortana will use to identify when the user is using a voice command for your app. Typically, it’s the name of your app, but doesn’t need to be. Even though I used “Car” in this example, that may lead to confusion for Cortana if I try do a web search for “Car parts store near me,” so the guidance is to actually avoid common terms as the command prefix.

Also, there’s no guarantee that another app won’t try to claim that same command prefix. Since VCDs are registered with the system upon app start up (see the next section for details), then the last app to launch using that prefix will be the one that is current. In this way, you could have multiple Twitter apps installed on your phone, each claiming “Twitter” as the command prefix. Only the most recently used Twitter app will be what Cortana would launch in response to a “Twitter” voice command.

The <Command> element contains all of the information for a single command. It’s best to use simple commands for Cortana to launch your app, and then carry on the conversation inside of your app for anything that is more sophisticated. In this example, I have defined a single <ListenFor> element, which serves as the voice command’s pattern that Cortana will try to match. Your VCD can have multiple <ListenFor> elements for a single command, but multiple lines are not always required. In this example, the words surrounded by square brackets will be treated as optional, so this command will be used whether the user says “remote start engine”, “remote start”, “start engine”, or just plain “start” – all from a single <ListenFor> pattern.

Note: While not represented in this post, there is also a concept of a PhraseList and PhraseTopic placeholder to simplify the <ListenFor> patterns when more dynamic input is required.

The <Feedback> text is what Cortana will say to the user as she is launching your app using that command.

Install the VCD

Cortana doesn’t automatically know about your app just because it’s installed on the phone. Instead, your app must make an API call to explicitly register your VCD file. This is done on every application startup to ensure that Cortana has the latest voice command definitions, and to also ensure that your app is the one that will use the CommandPrefix that is defined (i.e., you may have a Twitter client, and could claim the “Twitter” CommandPrefix until the user runs another app that also claims the “Twitter” CommandPrefix).

The recommendation is to register the VCD within the OnNavigatedTo event of your app’s startup form:

Note: The VCD file (handsfree.xml) is included in the root of my project as a “Content” build action, which will include it in the Appx.

Handle the Voice Activation of your App

Cortana does the work of speech recognition (i.e., speech-to-text) and activating your app automatically. However, it is up to your app to handle the activation and respond to the action that the user is taking. This may include the need to parse the string of text that the user spoke in order to extract data or determine the specific intent.

To handle the activation, you must override the OnActivated method of the Application class. I prefer to do this within a partial class located in my Phone project (when part of a Universal Application solution).  Note that when the app is activated, OnLaunched is not called, so any objects (such as Frames) that are created within OnLaunch as part of the app start up should also be created in your OnActivated method.

A VoiceCommandActivatedEventArgs object will be provided to the OnActivated method. One useful property of this object is the Result, which is a SpeechRecognitionResult object (see part 2 of this series). From here, you can get access to the matching rule from your VCD, the text spoken, etc.

Like many things that originate from the underlying operating system, the SpeechRecognitionResult object, or at least some of its properties, is a COM object. This can make it tricky to debug your code in Visual Studio, since values are not always represented in the Watch window, etc. This is true for the SemanticInterpretation property, which contains information like the NavigationTarget (if defined in your VCD rule) and the CommandMode (whether the user spoke or typed the text to Cortana).  The sample below shows two ways of interacting with SemanticInterpretation: directly via the Properties collection (see “navigationTarget”), or through a helper method that includes guard code and a default value if the desired property does not exist in the COM object (see “commandMode”).

SpeechRecognitionResult.RulePath.First() will give me the name of the command that Cortana matched. From my VCD above, if the user spoke “Car, Start Engine”, then the voiceCommandName will be set to “start”.

The end result of this OnActivated method should be to navigate to a page in your app, passing in a string parameter with data that the page can use to perform an action. Depending on your app, all actions may be performed on a single page (maybe there’s a big switch statement in that page’s OnNavigated code to take action based on the parameters), or you may have different actions routing to different pages.

As a rule of thumb, only speak when spoken to. So, if the user activated your app via typing to Cortana, then they may be in a meeting, and it would be most inappropriate to have your app start talking aloud. This is why it’s important to determine the CommandMode (voice or text) and use that information when deciding what action the user interface should take in response.

The next part of this series will explore how to continue the conversation within your app – something important to truly be hand-free.

The following two tabs change content below.
Jason Follas is a Sr. Architect for Falafel Software and lives near Toledo, OH. He has been a Microsoft MVP for SQL, Visual C#, and most recently Windows Platform. When not working or speaking, you can find him writing a number of casual games currently in the Windows Store or fishing in the local river.