Technical Thoughts: Sentieo’s Alexa Skill and the Three Fundamental Laws of Voice User Experience $AMZN $GOOGL $AAPL

Sentieo’s Alexa Skill is live! We present some thoughts from our technical team recapping our experiences for the benefit of those who are keen on considering the future of computer interfaces.

For Voice User Interfaces (VUIs) to have any chance of success, the future direction of Voice User Experience (VUX) will be strongly tied to physical, not software, constraints.

The three features of these will be:

1) At least 100 words per minute (wpm) input

2) close to 200wpm output

3) under 250ms response time.

We are nowhere close.

Voice User Experience

We have just updated the Sentieo Skill on the Alexa Skill Store where it now ranks among the best Finance skills (higher if you strip out all the bitcoin noise). We thought we might share a few thoughts on our experiences redesigning the Sentieo experience, which was first created for Desktop and then Mobile, for a radically different interface.

With the linguistic abstraction infrastructure finally in place to separate voice software engineering (executing specific intents with data and integrations) from language processing (a natural monopoly parsing general human speech to specific intent and vice versa), and, as a plus, supportive hardware, there will undoubtedly be a wealth of development with the ecosystem benefits deservingly accruing to Amazon. Ours was roughly the 2000th skill to hit the Alexa store, 3 months after it crossed the 1000 mark.

This, together with the recent attention on chatbots, has predictably prompted all sorts of manic speculation, including “the Death of the GUI”, but that discussion is premature until key issues in the development of the Voice User Interface are addressed. Simply put, apart from simple hands free convenience, we haven’t figured out where the VUI absolutely dominates. You see this when your bank gives you the option to “Speak to a human representative”.

In this very real sense, the VUI is a solution in search of a problem. We aren’t even very good at the solution yet: we are terrible at transcribing accents and abbreviations; context management and intent disambiguation is a mess; input mappings are naturally many-to-one while output tends toward one-to-one; and we haven’t even tried our hand at “nontextual” verbal data like voice recognition (multiple speakers), sarcasm, humor, and tone. And let’s not even talk about privacy issues in practical implementations.

There is a common implicit assumption that those problems go away as research and infrastructure in language processing improves, but in fact the meta problems endemic to voice software engineering are perhaps even harder to solve because they run into physical “laws”. Even if we do the voice equivalent of assuming a spherical frictionless cow, and assume every speech is perfectly translated to intent, there are still terminally intractable problems with the field of voice that, for want of a better term, we will call UI efficiency (although there are formal definitions of this).

Why We Want UI Efficiency

A quick oversimplified review of user interface history in the lens of efficiency:

  • We moved from punch cards to command lines because virtual punch cards were more quickly iterative and thus more efficient for input than physical punch cards. (As a bonus, they were less prone to corruption…)
  • We moved from command lines to graphical interfaces because inputting information in two dimensions is more efficient than inputting information in one. (As a bonus, they changed commands from memorized text to thoughtfully placed buttons, spawning an entire field of design.)
  • We moved from graphical interfaces to touch interfaces because it removed an unnatural translation — moving my hand on the x-y plane moves the cursor on the x-z plane — and is cognitively more efficient. (As a bonus, the lack of stylus or mouse helped get us mobile.)

You see where this is going, and what question we will have to answer in a post-touch world. Every iteration is more efficient and accompanied by an order of magnitude change in input friendliness. There is de minimis tradeoff and the new UI dominates in basically every metric each time.

Implications of the UI Efficiency framework

In this light, we understand that chatbots are really irrelevant as they represent two steps backward to command lines AND have language processing issues, creating a very high bar for truly structural UI efficiency. But they may be a fantastic test case to cost-less-ly and rapidly improve upon language processing and that is not worth nothing.

We have chosen here to stress inputs because visualization was pretty much always two dimensional from the outset. However, output efficiency is also likely to increase in relevance going forward as technologies improve in the audio and visual spheres.

Understanding that it is input efficiency that drives mass adoption and “killer apps” means that for the VUI to get anywhere we have to figure out 1) what exactly the efficiency improvement is and 2) what the step change benefit will be. Our view on 1) is obfuscated by pesky natural language issues and for 2) our best answer is hands-free and eyes-free interaction.

Ironically, it should be blindingly evident that one of the biggest drivers of benefit for 2) is about to go away. The biggest use case for voice interaction is while manually driving. This use case will diminish in direct proportion to the adoption rate of autonomous vehicles.

While we wait for a better 2), we are left with a thought experiment on 1): what is the upper limit on input and output efficiency in a VUI and what conditions must exist to get there? In other words, what is the ideal Voice User Experience if UI Efficiency alone dictated success or failure?

We are fully aware of the futility of pinning numbers on future unknowns but we’re going to just try so that we can have an idea of the magnitude of improvement.

VUX Law #1: Maximize Natural Input— VUI input speed must be at least 50% HIGHER than existing UI

Here are some facts to know (all rough estimates for average anglophones, easily searchable so left uncited):

  • The average writing speed is 25wpm
  • The average typing speed is 40wpm
  • The average talking speed is 100wpm

The important thing to note about these speeds is that unlike physical laws, it is as painful to go DOWN in speed as it is to go up. We absolutely CAN slow down our speech 60% for better machine comprehension. Is it great UX? Hell, no.

So not only is voice input potentially much faster than typing, it HAS to be much faster than typing. Incidentally, this means that voice software engineering will tend naturally toward machine learning since we can use the wealth of data to arrive at better outcomes than deterministic logic trees. But that’s nothing new.

VUX Law #2: Minimize Output Tradeoff— VUI output speed must be able to go up to around 200wpm, or 33% more efficient than regular listening

  • The average listening comprehension speed is 150wpm-200wpm.
  • The average reading speed is 250wpm

However there is ample evidence to suggest that our current average listening speeds are simply being dragged down by our average talking speed. Just take any podcast and ramp it up to 2x playback— you can still listen comfortably, and only the 3x-4x regions are where experience really starts to matter. At average talking speeds of 100wpm that works out to a 200wpm natural listening speed.

Why this is important is because we want to reduce as much as possible the tradeoff in listening vs reading, a key feature of every generational shift in UI (as discussed above).

VUX Law #3: Constrain by Conversational Constant— VUI I/O responsiveness must be under 250ms

More facts:

  • The “feeling of being instantaneous” barrier is 100ms
  • The global average time between two participants in a conversation is 200ms
  • The “flow of thought” barrier is 1,000ms
  • Alexa’s default timeout is 3000ms
  • My Alexa Skill’s average time to execute is 4000ms, relying on 2 slow APIs for data (could be optimized…)
  • The “attention” barrier is 10,000ms

It’s worth giving a full read of the fascinating study on “the conversational constant”:

Conversation analysts first started noticing the rapid-fire nature of spoken turns in the 1970s, but had neither interest in quantifying those gaps nor the tools to do so. Levinson had both. A few years ago, his team began recording videos of people casually talking in informal settings. “I went to people who were sitting outside on the patio and asked if it was okay to set up a video camera for a study,” says Tanya Stivers.

While she recorded Americans, her colleagues did the same around the world, for speakers of Italian, Dutch, Danish, Japanese, Korean, Lao, Akhoe Haiom (from Namibia), Yélî-Dnye (from Papua New Guinea), and Tzeltal (a Mayan language from Mexico). Despite the vastly different grammars of these ten tongues, and the equally vast cultural variations between their speakers, the researchers found more similarities than differences.

The typical gap was 200 milliseconds long, rising to 470 for the Danish speakers and falling to just 7 for the Japanese. So, yes, there’s some variation, but it’s pretty minuscule, especially when compared to cultural stereotypes. There are plenty of anecdotal reports of minute-long pauses in Scandinavian chat, and virtually simultaneous speech among New York Jews and Antiguan villagers. But Stivers and her colleagues saw none of that.


It is striking how these conclusions, arrived at from a high level human-interaction point of view, come to the exact opposite of the design choices the Alexa/Echo team have made so far:

  • Fixing Alexa output voice at a plodding speed (easily fixable)
  • Advising everyone that the Amazon.Literal type only be used with short phrases
  • Not allowing compound intents and actions
  • Not natively supporting intent disambiguation
  • Allowing timeouts to go as high as 10,000 ms
  • I haven’t done the math but I wonder if making everything on the cloud makes things that much slower. At modern broadband transmission speeds this probably not a material concern.

No shade thrown on the Alexa team at all, but we think it likely that all these choices will have to be addressed and reversed over time due to the above VUX laws having to be satisfied to make it work. We are pretty interested in the idea of predicting and narrowing down voice queries to reduce response times as this matches what we do in real life.

Parting thought

Remember that use case we casually wrote off as if it was a done deal?

The biggest use case for voice interaction is while manually driving. This use case will diminish in direct proportion to the adoption rate of autonomous vehicles.

That’s probably going to take longer than anyone reading this would like to become a reality. Meanwhile, there’s another interesting use case on the rise — one where your eyes and hands are fully occupied with a need to still interact with the computer…