Each of us has come across voice interfaces. A robot responding that a red KIA is driving up, an elevator saying out a floor number, a navigator directing to turn right now – somebody has to elaborate all those words, right? Therein lies a new direction for the interface designers – the design of voice interfaces.
The article focuses on more intelligent systems, such as voice assistants or a smart home. Here, Yuriy Uchanov, a UI/UX designer for Moqod, together with his colleague Slava Todavchich will explain how they work.
What Is VUI
Voice interface (VI, or VUI – voice user interface) is an evolution of interaction that frees hands and eyes, simplifies inputting and receiving information. For example, when we’re driving a car or performing surgery, and at the same moment want to know how old Demi Moore is.
In the past few years, voice interaction has been developing by leaps and bounds. For now, 20% of all search queries to Google on mobile devices are done using voice. According to Gartner, 30% of website visits will occur without a screen by 2020. Even now it is already possible to find out the weather forecast, turn on the lights in the living room or order pizza. In the future, the possibilities seem to be almost limitless.
Components of Voice Interface
What characterizes the voice interface and what are its differences comparing to a common visual one? Specialists from the Nielsen Norman Group have identified five basic voice user interface technologies:
- Voice input: requests are pronounced by voice instead of being entered via a keyboard or graphic elements of the screen interface.
- Natural language: users should not be limited to using a specific vocabulary or computer-optimized dictionary, but should be able to structure the input by any means, as if it was a conversation with a human.
- Voice output: information is pronounced by voice instead of being displayed on the screen.
- Intellectual interpretation: for a correct understanding of user requests, a VI should use additional information, such as a context of use or actions that the user performed before.
- Facilitation: to complete the user’s task, the VI performs necessary actions which weren’t requested by the user.
Not all voice interfaces use all five items simultaneously. For example, virtual keyboards on mobile devices offer only language input, voice assistants sometimes display information on the screen instead of saying it with a voice.
In case of integration of all five features, we get interactions with two significant advantages:
- A possibility to compose a goal using natural language. It’s not necessary anymore to study the interface and press buttons.
- A possibility to predict user’s goals; to offer those goals basing on context information or previous actions.
Illustration by Yuri Uchanov
The combination of all five basic technologies and their integration is a prerequisite for creating an interface that does not require any input at all. Although we are still very far from the design of the interface that reads people’s thoughts, voice assistants, primarily Alexa, Google Assistant, and Siri, are the first step towards that.
Almost all of us have already used voice assistants at least once. As a minimum, the ones that are built into our smartphones. We have some idea of what it is and what it may be useful for. The study from the Nielsen Norman Group has revealed the current situation in the market of assistants, the advantages and disadvantages of VI in their modern implementation. Here are some results of the study.
The study has shown that voice assistants poorly meet all five criteria of voice interfaces and their integration. The level of usability is close to useless even for slightly complex interactions. Contrary to the assumptions about human-oriented design, users have to consider when the voice assistant will be useful and when it is better not to use it and select the wording of the queries. And that happens despite the fact that the initial message was that the computer should adapt to the person, and not vice versa.
Below, there’s a list showing how assistants coped with each criterion of the voice interface and what may be corrected in the future.
Majority of the users who participated in the study of voice assistants mentioned that they use them mainly in two cases:
- When the hands are occupied, for example, while driving or cooking;
- When it seems to them that asking a question by voice will be faster than typing it from the keyboard and reading the answer.
Almost everyone clearly imagines the capabilities of assistants and doesn’t often use them for complicated queries, preferring web browsers instead. They feel that queries with one clear answer will get the correct results. Some people think that assistants can accomplish a sophisticated task, but to do so, they need to simplify queries and think about their wording. The majority believes that considering how to ask a question properly is not worth the effort.
A relevant area where voice assistants substantively help facilitate interaction is a text dictation: long messages or search queries, especially for mobile devices. Dictation seems to be a faster and more convenient alternative to on-screen keyboards. But even here there are problems with the recognition of specific terms, the insertion of correct punctuation and names.
Design of Voice Interfaces
To solve all the problems of VI in their current implementation, it is important to find the right design approach. Voice control is a verbal process, communication with a machine. For a good voice interface, this communication should be as natural as with a real person. Developing such systems is more about psychology and understanding of specifics of human reasoning.
Konstantin Samoilov from the Google Voice Interface Research team has told about the specific features of VI development in his report. So, what should be considered when you are developing them and what principles to adhere to:
Trust is not a technical issue, but if it is not solved, the rest of the work will be done in vain. Without trust, the user just won’t use the VI to perform even remotely significant tasks. First, we learn how the system copes, and after that begin to delegate it the tasks.
It is not easy to make an interface that the user would trust even in such a simple task as setting an alarm clock. It is one thing to oversleep Saturday’s breakfast, and it’s totally different for a flight by plane. If a person does not understand how far the system can get it wrong, he or she just doesn’t use it.
Invisibility is the fundamental difference of the voice interface. We do not see interface elements, in which part of it or at what step we are at a particular moment.
Each user has his/her own mental model that answers the question about the capabilities of the system. Basically, it replaces the visual components of the interface. Each system’s response to the user’s actions changes the mental model and, for the VI to work, it is essential to help the user adjust the model as necessary.
Mental model adjustment
When the system asks questions that involve only simple answers, for example, yes/no, the user can conclude that it is rather primitive and all subsequent commands and responses will be formulated accordingly.
If the system puts questions the answers to which can be formulated anyway the user wants and understood, the user will build all subsequent interactions with the system at the same level.
To make interaction with VI natural, it should be clear why communication with other people seems natural. But the problem is that we don’t know it. Why does conversation with some people seem more natural than with the others? What features do that? Without knowing that, it is impossible to integrate it into the system as well.
A possible way out is to make a system which, receiving feedback, will identify what has been done correctly and what could have been done differently. The system will figure out which characteristics are essential for natural interaction.
Modern implementations of VI allow for imitating the character of its personality – friendliness, sense of humor, intellectuality, and others. These are quite diverse characteristics, and the approach of different companies to their implementation varies.
Siri is a project of the company, the ideology of which is the following: everything should just work. And everything really works if the user makes the right guesses with grammar and vocabulary. If he doesn’t guess right, then the system, without any indication of what went wrong and how to affect its behavior, just stops working.
Herewith, a great emphasis is placed on individuality. Voice quality, jokes, funny comments when performing common tasks are sometimes really impressive. It creates a feeling that we are facing a person. The user relaxes and tries to interact with Siri like with a person. But when the system begins to react differently than he expects, the perception decreases dramatically. He thinks that his actions are not approved or he has been simply laughed at. And it is much worse than if he would initially perceive it as a machine.
At Google, they have considered it safer not to try to imitate individuality, but to show that the user is simply facing a high-tech software product that does not even have a name (OK, Google).
Voice Interfaces for Business
Nowadays voice interfaces help not only ordinary users but also businesses to complete their tasks.
As for sales through VI, according to Voicebot.ai, 26% of the owners of “smart” speakers have made purchases with their help at least once, and about 16% do it monthly. Yet, in the majority of cases, those are basic consumer goods or services that do not require studying the reviews, photos or price comparisons with other suppliers. For example, ordering food or buying subscriptions to audio/video services.
Companies typically create their own “skills” – commands that allow them to interact with their own programs through voice assistants. For example, “Alice” from Yandex can already be used to search for tickets, order delivery of flowers, products, search for vacancies, simple games, and much more. With the help of the same “skills,” companies use assistants as consultants; as a result, clients receive help instantly, without going through search results.
One of the crucial questions is related to advertising: will voice assistants start to become monetized? This is, in fact, a new promotion channel, which is still unclear how to use. We are already accustomed to mentally “filter” visual advertising – the so-called “banner blindness,” when we just do not notice everything that looks like a banner or contextual advertising, and this does not require any effort. But what will the reaction be if the voice dialogue with the computer will be interrupted by advertising pieces?
In addition to skills, some companies choose another way to use VI in their business – developing their own software. That usually happens due to the inability to use voice assistants. For example, a taxi operation service to which the user calls from a regular phone. In cases where a very high level of confidentiality is required, it’s also better not to use voice assistants – the data goes to the server of third-party companies.
Future of Voice Interfaces
Soon, voice interaction will become more common in almost all areas of activity. Devices that can recognize the voice and generate it are rapidly getting cheaper with the development of voice assistants and the global presence of the Internet. However, mostly, it will be highly specialized use cases. When the user understands, for example, that it is not necessary to ask the weather forecast from an automated kiosk selling ice cream.
There will be no end of attempts to imitate the ability of voice assistants to answer any question or perform any action that we can already perform now using the visual interface. However, unlikely it will work exactly as we imagine. Even in conversation with ordinary people we often face misunderstanding, let alone talking to a machine. The problem of creating “real” artificial intelligence, which would completely solve all the issues of voice interaction, is connected with this — we just do not fully understand how the brain and the human work.