Length
3 min read
“Hey, robot, what’s the weather going to be today?”
“It’s 24 and sunny.”
The first time you try it, it feels a bit like magic. After years of using computers with your hands it can be freeing to speak a command into the air.
People have been making voice/speech computing for almost as long as computers have existed. A famous software demo called Put That There by Chris Schmandt at MIT in 1979 shows a user sitting in a chair in front of a wall-sized display. He points with his hand and asks for different shapes to be drawn. The system creates them. He can move the shapes around, asking for “that” to be put “there”. It seems so natural.
“Hey, robot, play the new Taylor Swift album.”
“Sure, here’s Rick Astley.”
Of course, sometimes our voice commands don’t work as we expected. We try again, speaking… more… slowly. The voice interface still fails. It’s frustrating to get Rick when we want Taylor, but it’s not a big deal. Never Gonna Give You Up is still a banger.
The question is why are voice interfaces sometimes so bad? If we had Put That There in 1979 and Samuel L Jackson asking Siri to “remind me to put the gazpacho on ice at 7pm” in 2012, why haven’t we got further in the last 12 years?
It’s not because the technology needs work. It’s because speaking doesn’t work the way we think it does. By objective measures, speech recognition is often more than 99% accurate. But actual speech is errorful and non-deterministic. We are so good at speaking and listening that we don’t notice that speech is so variable. Speech is a social and collaborative medium. It’s not an “I” thing. It’s a “we” thing.
“Hey, robot, read me most recent text message from my brother.”
“Can you fix my Wi-Fi tomorrow?”
As we talk with someone, we develop a shared context that becomes the container for our conversation. In simple situations like asking for the weather, or for a playlist, our request sets the context. But as situations get more complex, we need to collaboratively create the context over time.
In more complex settings, computers don’t have enough context to help them make sense of what we’re saying. As certain kinds of AI get better, there will be the expectation that more “natural” interactions will become used more often. The problem is those “natural” interactions take advantage social processes that we take for granted.
When we talk, we’re always adjusting what we say and how we say it based on what we understand about the other person. If they seem confused, we use simpler language. If they glance at the time, we get to the point faster. Those two examples are important context, but they’re not part of what’s being said. And that’s why voice interfaces are so hard. Talking, in the way that we do it most of the time, assumes far more collaborative context than we get when we talk with computers.
If we want to create voice interfaces that are as “easy” as we want them to be, we need two things. First, we need really great information about the context. Second, we need specific information about the capabilities of the system. That is, we need to know what people are doing and we need to know what the technology can do. Because if we ask about the warehouse operations and the system only knows about music, who can tell what will happen?
“Hey, robot, open the pod bay doors.”
“I’m afraid I can’t do that, Dave.”