Voice input: Why Amazon’s Search Query update is a game-changer for developers

Our resident voice expert and Software Engineer Patrick Cavanagh explains the importance of this month’s Search Query update on Amazon’s Alexa, and why it’s about to make life much easier for voice developers.

 

Fundamentally, the process of interacting with a voice assistant is capturing some user input, processing it and producing a response. Sounds simple!

Unfortunately, it’s more complicated than that. The main issue being the complexity of language – how we can interpret a user’s intent in a meaningful way?

The key abstraction brought in by Amazon when creating skills for Alexa is to boil the content of a user’s input down to an intent, capturing the intended action from the input, and potentially some associated input data that we care about.

For example, let’s order a coffee. There are many different ways to ask that question and many different types of coffee that could be ordered:

‘Get me a Cappuccino’

‘Could I get a Frappuccino?’

‘I’d like an Americano please’

‘Order a Flat White’

This is far from an exhaustive list, but all these phrases can be condensed into an action: ‘Order Coffee’ and the type of drink the user wants.

It’s this process of considering the action a user wishes to perform, and what information we need to capture, that we can construct our model for how the user interacts with our skill.

In the language of Alexa development, the collection of phrases representing different ways a user could express a query are called Sample Utterances, while the pieces of data we want to capture are called Slot Values.

There are many different types of data that we can capture in Slots, and some exciting new improvements are making the lives of Alexa developers easier.

Primitive Slots

Amazon provide a series of built-in types that will cover most everyday data that a developer might want to capture.

Any developer who has ever worked with dates will be thankful that this includes Duration, Date and Time that conveniently capture natural voice input such as ‘tomorrow’ or ‘8 in the morning’, providing a standard format date or time.

Also covered are numerical inputs that convert spoken numbers into easy-to-work-with decimal equivalents.

List Type Slots

Also provided and maintained by Amazon are list types; these capture user input corresponding to a value in a pre-defined list, such as US States.

There are an ever-increasing number of these lists available, but unfortunately not all of them are available in all locales which reduces the tools in the kit available, especially to developers outside the US.

We can also extend these lists, say to add your favourite band to ‘AMAZON.MusicGroup’ or local town to ‘AMAZON.GB_CITY’. Eventually though, the built-in lists aren’t quite going to cut it, which is where the Custom Slot type comes in.

Custom Slots

Here you have the power to create a completely custom list of expected values.

Let’s say you’re making a skill to help users decide where to get a takeaway; you could have a custom list of takeaway types (‘Italian’, ‘Chinese’, ‘Fish and Chips’, ‘Indian’…) which you can use to decide which takeaways you suggest to the user:

 

User: Alexa, open Takeaway helper

Alexa: Takeaway helper here! What type of food are you after?

User: I’m after an Indian

Alexa: Here are the best rated Indian takeaways in your area…

 

The values you supply as examples in your custom slot type should be as exhaustive as possible to try and cover all bases but shouldn’t be treated as an enumeration.

Here the user could ask for a pizza, and whilst the value chosen for a slot type is weighted towards the values provided, it’s not limited to those in the list.

This footnote on custom slot types makes the life of the developer slightly harder as you always need to account for unexpected user input.

What if I don’t know what the user is going to say?

The slot types outlined above are perfect when you have a rough idea of the expected user input, but there are plenty of scenarios where it’s just not possible to come up with an exhaustive list.

For example, take a skill that will define words from a dictionary: “Alexa, ask my dictionary to define X”

Amazon generously allow up to 50,000 values in a custom slot type, but even that isn’t quite going to cut the mustard in this case. How can we capture any arbitrary user input, no matter what word the user asks to define?

The answer has already been hinted at in the fact that custom slot types aren’t an enumeration. We can simply provide a non-exhaustive list of example values and rely upon the Alexa service to provide us with what the user actually said – which may or may not actually be on our original list!

This leads us to the understanding that the values provided for a custom slot are actually training data for Alexa’s natural language processing and are simply there as a guide – not a hard and fast rule.

This approach of providing a non-exhaustive example list does work but coming up with the examples is a laborious process and has always felt like an inadequate solution to a very common problem.

Enter Amazon.SearchQuery

As of February 2018, Amazon released Amazon.SearchQuery which gives another option to capture less predictable user input. It’s the first of a new category of slot types called Phrases – hinting at new developments to come!

It’s optimised to capture user input when searching for information, allowing for whole questions or chunks of text to be captured.

In essence, the technique previously employed to reliably capture arbitrary user input is no longer needed – definitely good news.

With great power comes great responsibility

The ability to capture larger amounts of user input is a powerful tool in the Alexa developer’s arsenal, but one that shouldn’t be used lightly.

Processing language is an inherently difficult problem, and the framework of intents and slots is there to simplify the process of comprehending what a user wants to do.

Search Query strips away some of this framework and puts the onus back onto the developer to decide what the user actually wants.