Voice Technology: The Good, The Bad, The Inevitable

April 26, 2024

4minute read

Written by Sai Rathnam, CTO of Relay

The future is voice-activated: You’ve probably read the predictions that half of all online searches will be voice-based by 2020, but it isn’t quite as futuristic as it sounds.

If you start your day by checking in with Alexa on your Amazon Echo device, or spend time chewing over news with Google Home or Siri, you’re already utilizing cutting edge technology gone mainstream. And you’re not alone. There are currently over 66.4 million smart speaker owners in the U.S. and that number only keeps on growing, but it’s been something of a bumpy path getting here.

Recognizing human speech using computers has always been a challenge. But, what once looked like an intractable problem has become much more tractable due to rapid advances in deep learning. We’ve come a very long way in terms of recognizing accents (even mine!) and different languages. The time feels right for voice everything interfaces. The road will be bumpy but the destination reachable.

Deep learning is powered by collecting vast numbers of speech samples. The data intensive training process then requires labeling of the speech samples. The machine learning models gradually starts to build correlations between the sounds made by human speech to phonemes, alphabet and eventually the words used in any language. However, the process of collecting speech samples has not gone off smoothly with customers often caught unaware that their speech samples were used for training a machine learning model. A smart speaker improves through the process, but privacy concerns are very real. Alexa is neither deeply intuitive nor is she a dolt, she’s simply learning from you. Other approaches to collecting data are based on folks volunteering their speech samples – Mozilla’s DeepSpeech project used this approach.

I have worked in software development for more than two decades, working on projects big and small. As programmers we are all guilty of unleashing complex systems on unsuspecting users. The systems we build are incredibly useful but also difficult to use in many cases. Building these products is fun for the programmers and engineers, but when I watched someone use the product for the first time, I realize how much more could have been done to improve the usability. A very loose definition of usable is – user finds it easy to find the action they want computer to take, the results of invoking the action don’t surprise the user. Screen based computer interfaces force users to adapt to the programmer’s world view. Human beings are very good at learning and adapting, so it’s usually not an issue after some hours or days of use.

Is there an alternative ? If you look at the way we communicate, speech is the most natural. And while we’re used to seeing our friends and co-workers and family hunched over a screen, the thing that comes most naturally to all of us is simply speaking. That’s how we came to build Relay, which allows you to communicate without a screen. I feel like it’s an organic evolution of technology and that we’re moving in that direction for a reason.

“And while voice will be the wave of the future, it will augment instead of replace the way we already interact with each other and our devices. “

A natural evolution: If you study the evolution of computers, you see the original interfaces were composed almost entirely of text. Take a computer from 50 years ago, you’ll see a screen full of text with no graphical interface and nothing pretty to distract us. Then someone decided to slap a graphical user interface on the process and suddenly it all became more usable and so much more attractive than simply gobs of texts. Interestingly enough, while that’s what we know now, it’s not a very natural way to use a computer. It’s just what we have come to accept it as the norm. Think of your smartphone for a minute, all that data and potential connectivity greater than the greatest computer of even a few decades ago, all in the palm of your hand. But there are times you can’t or shouldn’t be looking at your phone. Maybe it’s when you’re driving or with a small child. Moving onto voice technology makes sense, we just have to get used to it.

A gradual voice revolution: We’re already at the next generation of tech, but don’t panic that all systems will change at once, it’ll happen naturally. And while voice will be the wave of the future, it will augment instead of replace the way we already interact with each other and our devices. Shows like Star Trek or the Iron Man movies taught us that you can talk to your computer and make things happen. Your computer and connected devices already do a lot to improve efficiency and productivity; What voice does is enable folks to access information without being distracted by a screen. At the end of the day despite the appearance of an entity on the other side that seems to understand, it’s just a computer. Besides, your kids already bossily order Alexa to play Baby Shark, and your grandparents figured out how to order dinner in by just asking their device.

Isn’t it time you caught up with the trend?