Making an action for Google Assistant

Google assistant is one of the most powerful platform out there. Google’s enormous scale and their prowess in machine learning gives google assistant a fair shot at becoming the next iTunes. As designers and developers we have to think about its impact on user behavior. There are several implications that come to my mind:

  1. Assistant is yet another way to reach our customers without them having to download our apps. Since it is on every smartphone out there, it could at the very least be a great entry point to lure customers into our platforms. For example giving a preview of our inventory etc.
  2. Google assistant can be a part of or can talk to any number of IOT devices. Which includes things like TV, fridge, washing machines, speakers and what not. So the responsive design universe has grown to include things with giant screens to no screen at all.
  3. Not just devices but our products should be responsive to different modalities of inputs. Speech and visuals have become a mainstream ways of interacting with technology along with touch. We have to understand how to leverage the positives of each mode of interaction.
  4. The concept of arranging the content on your site such that users can find their way around is old. Users want the thing they want to see to come to them instead of them having to poke around. This is the kind of experience that assistant provides. It poses several challenges which we’ll discuss in part two.

With these in mind I started my project to build a Paypal app on Google Assistant. Apps are called “actions” on assistant to we’ll use that term to avoid any confusion. My objective was to build a simple Paypal action that would help people send money to someone, request money from someone and check their Paypal balance. Doing this was extremely fun and I learnt a lot about how this is done technically and more importantly, what does it take to create a good experience on conversational UI. This is a documentation of that project, I have divided it into two parts, first one deals with the technical stuff and part two is all about the user experience part.

Part1: The technical stuff

So how do you make an action on Google. In terms of programming language I used Javascript and Node.js. Google has a nice node package that sorts many things out for you, we’ll come to that shortly. But before any of the coding stuff we have to understand what actually happens when we tell the assistant to do something. This diagram explains just that:

So basically you have to do two things:

  1. Make an api.ai agent. A super simple process that requires no coding, everything can be done from the api.ai console. The agent acts as a mediator between your code and Google. It takes user input from Google in textual form, parses the input to understand “intent” and “parameters” (explained later), passes along this information to our node code, takes the response from our node code and passes it along to Google.
  2. Make a node webhook. Basic javascript and node knowledge is required for this. Once the agent communicates what the user wants to us to do, it is the webhook where we can handle that request and configure appropriate responses.

It will be better if we work with an example. Since I can’t put the actual code for Paypal here, lets build some other action together. Movies are fun, building an action that lets users ask questions around movies will also give us a chance to explore some of the limitations of NLP in Api.ai. Use cases around a movie database can be so many, lets build a simple action that lets users do the following:

  1. Ask for a movie recommendation from a specific year. For example “Suggest a movie from 1980”
  2. Limit the genre of the movie. For e.g “Suggest a horror movie from 1999”
  3. Ask for the plot of the recommended movie. For e.g “What was this movie about?”
  4. Ask for more recommendations. For e.g “Next”, “Previous”

Lets get started

Step1: Create a Project on Google and api.ai console
  1. Create a new project on https://console.developers.google.com
  2. Add Google Actions API to the project
  3. Import the project on https://console.actions.google.com
  4. Choose API.AI from step1 in Overview
  5. Give the project a name and description and save
Step2: Create the agent on api.ai

This is how your agent must be looking right now:

As discussed earlier, the api.ai agent acts as an interface between google assistant and our JS code. There are a couple of api.ai concepts that we should know in order create these agents. These are:

  1. Intents and Actions
  2. Entities and Parameters
  3. Contexts
  4. Session

All these are nicely explained here: https://api.ai/docs. I’ll try to give a brief explanation below:

Intents and Actions

Like the name suggests, Intent is the intent of user input. The way api.ai understands the intent is that you as a developer will decide what all could the user possibly want from your action or in other words what are the use-cases. For every use-case, create an intent and feed in all possible ways in which user can express that use-case. The more examples you feed in the better the agent will become in understanding sentences that you didn’t feed. In your code you will address the intent not by its actual name but by the name you gave it in the “actions” field. Generally I use the same name to avoid any confusion.

Entities and Parameters

When the user says “Show me a movie from 1980”, we want the agent to understand that (a) the user is asking for a movie recommendation from a specific year and (b) that specific year is 1980. The first part is taken care by intent definition. For the second part, while defining intents in api.ai we can ask the agent to look for certain entities in the user input. Entities like a year or a date for example. The values of these entities are then passed to JS as parameters. For example in the above example, 1980 is the value of the entity @sys.date-period which can be passed on to our JS with the key or variable name of our choosing. We can define that as the parameter name. Quick side note, @sys.date-period is a system defined entity, there are many such entities that come pre-defined with api.ai. We can define our own entities as well. One common example is when we want users to take a yes or no decision, we can ask the agent to listen for words like “Yes”, “Yep”, “No”, “Nope” etc and pass it as a parameter called “user_confirmation”.

Contexts

Contexts are important to make a conversation. You can do a back and forth with the user if you remember what parameters were set by user in their previous inputs so they don’t have to repeat themselves. The way it works is basically you can start a context at any point in the conversation with a specific lifespan. The agent will then store all the parameters set by user inputs from that point on till the end of lifespan under that context. Lifespan is basically the number of inputs till when the context will be active. There can be multiple contexts running at the same time. The way to access parameters in a specific context will be seen when we code. Another use of context is limiting the intents that can be triggered. In api.ai there is a way to set incoming contexts, so the said intent will be triggered only when that context is active. Similarly there are output contexts, i.e. you can set a context when certain intent is triggered.

Session

When a user initiates conversation with the assistant, a new session is created. While setting context helps you keep track of parameters set by user across inputs, sessions comes in handy when you have to keep track of parameters across contexts. The good thing is that as a developer you know exactly when a session starts and you can program when you want to end the session. This is a great asset, you can set and reset important global variables and use them across functions. We’ll look at it later when we code stuff.

Back to movie assistant

With these three concepts in mind lets start creating the agent for our movie action. There are three use-cases for our actions

  1. User wants a movie recommendation for a particular year and (maybe) a specific genre.
  2. User wants to listen to the plot of the recommended movie.
  3. User wants more recommendations for the same criteria.

We know the parameters that we’ll need are year and genre. We have a system defined year entity, we have to make an entity for genre. The api that we’ll use for the movie recommendation part is The Movie Database. It supports the following genres, Action, Adventure, Animation, Comedy, Documentary, Drama, Family, Fantasy, History, Horror, Music, Mystery, Romance, Science Fiction, TV Movie, Thriller, War and Western. So we can make an entity called “genre” like so:

See how we can set synonym words that would map to same value for the entity “genre”.

Next, lets define our first intent. User wants a movie recommendation for a particular year and (maybe) a specific genre. Here is how to do it:

Give it as many example statements that users can say.

Give the action a name. We need two parameters, date-period and genre. Date-period is required, so we can use the prompt option to ask follow-up questions if user did not mention a date in their original request.

Similarly lets create the remaining two intents for knowing the plot summary and next recommendation. Also lets edit the welcome intent to say an appropriate welcome message.

no parameters are needed for this

no parameters required for this either

Not all requests need to be completed by our webhook. Response to the welcome intent can be given by through the agent itself.

All right the agent is ready to hear and translate user requests in the way we need it. Lets move to step the and start coding to address these requests.

Step3: Creating the webhook

The webhook has to be hosted on an https server, Heroku is a free and easy way to do that. Create a new node project and push it to Heroku. Don’t forget to add the Procfile. You can see my code on this git repository.

But this is essentially the starter code:

As you can see we’ll use the “actions-on-google” node package to make the action. Apart from that we’ll create an express webserver and body-parser and request to handle our api calls from the Movie Database.

Now we need no enable webhooks in the api.ai console, like so:

Also, we have to go check “Use webhook” in the intents that we had created, like so:

All right, with this you should be able to see the request coming in heroku logs when you use the “Try it now” feature in api.ai console. Now lets use the node package to send a proper response back. Here is the most important link to understand more about the apiai class of the package https://developers.google.com/actions/reference/nodejs/ApiAiApp

Here is the basic setup to handle all our intents:

This is what we have to do with every intent. Map the intent, through action name, with a function in our JS. And then use app.ask or app.tell to respond back. Note the important difference between app.ask and app.tell, app.tell ends the session while app.ask keeps the session alive.

Thats it! Now its all about taking each intent and handling it in our JS, which involves talking to the Movie Database apis.

Step4: Testing it on a device

A great thing about api.ai agent is that once you build it, it can be integrated with multiple devices and bot platforms. Go to the Integrations tab on your api.ai console. You’ll see something like this:

Once you are ready to test you action on a google assistant enabled device, just go to the Simulator on the actions console, it looks like this:

Here’s what mine looks like:

Note how:

  1. It sometimes has trouble understanding the accent.
  2. I cannot cut it in between and say a different command.
  3. No one makes movie decisions like this. The idea of the exercise was to get the technical stuff in place, I have written about the UX of voice UI and Google assistant in part 2.
  4. You can find my complete code here https://github.com/kshivanku/movieAssistant

I forgot to mention, to test you action on any google assistant enabled platform, all you have to do is login with the same email id that you used to make the action.

Part2 : The UX of Voice UI

Understanding the technology behind building an assistant app is a lot of fun, but it is only half the battle. To make something that the world would want to use, we have to understand how to design in voice. There are several articles out there about best practices, 10 things you should keep in mind while designing for voice UI etc. All great articles and I would encourage everyone to go through them. My aim is not to write something as comprehensive as that. I just had a couple of interesting observations as I was building the action for PayPal and wanted to share them with you. So here they are:

  1. Onboarding is tricky. What is the first thing that you do when you download a new app or visit a new website. You poke around to see whats in it. This option to poke around is not available in voice UI. People treat voice UI like an alien object, unsure of what it can do or how to start communicating with it. It is like early days of the internet and designers have to find creative ways to solve this problem, till someone solves it for everyone.
  2. Voice UI is an open slate and users will always bump in the edges of your action’s limitations. You cannot prevent users from making a demand that your action cannot fulfill, hence it becomes very important to be mindful of not only the usecases that you want to address but also the ones you don’t. For example in the PayPal action, it was natural for people to say something like “send $5 to Ben and $10 to Jerry”. While on the app the same user will naturally go through the flow twice to make these two transactions, voice UI somehow gives them an expectation that they can do two transactions with one command. As designers we have to be mindful of this behavior and design our response accordingly. But at the end of the day technological limitations are a huge and very real. You cannot design for all the edge cases and there will be times when your action will have no idea what the user is saying and end up giving a generic error message. Lack of a proper error message will make the user confused about what the problem is and the whole experience gets frustrating.
  3. You are making a personal assistant and people want it to behave like one. There is something about the affordance of voice that makes users expect a very personal experience. For example while user testing the PayPal action, one user gave the command “Send money to my daughter”. For someone talking to their personal assistant, this is a legit demand. Same is the case while dealing with contacts. The users don’t want their assistants to get confused with multiple contacts of same name everytime. For example if the user says “Message Ajay I’ll be 10 minutes late for dinner”, as personal assistant I should be able to deduce which “Ajay” is the user talking about in most of the cases, no matter how many people with the name Ajay are in my contact list. While this capability is a good to have feature in screen based UI, it becomes critical in voice UI. This is because there is no good way for user to choose between a list of 10 items in voice UI.
  4. This brings me to the next point on the importance of personal assistant evolving and becoming smarter with use. This is not specific to voice UI but having an AI first approach to personal assistants is a must. Your users don’t want to go through the “50 first dates” experience with your action. There has to be a familiarity in interactions of you personal assistant with time.
  5. All the basics remain the same. It is easy to forget that we are not changing the human nature just by changing the UI to voice. So when you are taking your app from a screen based to a voice UI remember that the user is still looking for same confirmations and micro interactions. A very good example for this is “recognition rather than recall”. We all know about this principle and thats why we give an option to users to upload their display pictures, so that when other users are looking for them they can recognize your picture rather than recall your full name. This principle obviously is true for voice UI as well. For example while user testing the action for PayPal, I asked one question to every user, will you trust this action to carry out big amount transactions. The answer was no, all the time. That was because the action failed the “recognition rather than recall” principle. In absence of a visual confirmation, users had no way to be absolutely sure that they have selected the right person to send their money to.

These are still challenges, that I am working through. There is not rule book to solve any of these problems yet which makes it very exciting.

All right thats it. Hope you make great google assistant actions!

Leave a Reply