“Alexa, what’s the weather going to be like today.”
It’s taken decades for scientists to understand natural human speech to the point where voice-activated interfaces such as Alexa, the natural language processing system by Amazon, are sufficiently enabled to be successfully accepted by consumers. Alexa is who talks to users of Amazon’s Echo products including the Echo, Dot and Tap, as well as Amazon Fire TV and other third-party products. Even since 2012, when the patent was filed for what would ultimately become Amazon’s artificial intelligence system Alexa, there has been tremendous growth in capabilities and the credit for that growth goes to machine learning.
For something that we do every day without giving it any thought, conversation between machines and humans is complex. So, how did Amazon and others in the space such as Google, Apple and Microsoft crack the code?
ABCs of Alexa
Over 30 million smart speakers were sold globally last year, and this number is expected to grow to nearly 60 million this year. While Amazon remains the industry leader in smart speakers selling about 20 million devices last year, others (especially Google) are also growing and starting to catch up. There are nuances to each, but let’s look “under the hood” of an Echo to see how Alexa works.
While there is some capability contained in the Echo cylinder such as speakers, a microphone and a small computer that can awake the system and blink its lights to let you know it’s activated, its real capabilities occur once it sends whatever you have told Alexa to the cloud to be interpreted by Alexa Voice Services (AVS).
So, when you ask Alexa, “What’s the weather going to be like today, ” the device records your voice. Then that recording is sent over the Internet to Amazon’s Alexa Voice Services which parses the recording into commands it understands. Then, the system sends the relevant output back to your device. When you ask about the weather, an audio file is sent back and Alexa tells you the weather forecast all without you having any idea there was any back and forth between systems. What that of course means is that if you lose internet connexion Alexa is no longer working.
The skills Echo has out of the box are impressive to most of us, but Amazon allows and encourages approved developers free access to Alexa Voice Services so they can create new Alexa skills to augment the system’s skill-set just as Apple did with the app store. As a result of this openness, the list of skills that Alexa (currently over 30,000) can help with continues to grow rapidly. Users can, of course, purchase products from Amazon, but they can also order pizza from Domino’s, hail a ride from Uber or Lyft, control their light fixtures, make a payment through the Capital One skill, get wine pairings for dinner and so much more.
Constantly learning from human data
Data and machine learning is the foundation of Alexa’s power, and it’s only getting stronger as its popularity and the amount of data it gathers increase. Every time Alexa makes a mistake in interpreting your request, that data is used to make the system smarter the next time around. Machine learning is the reason for the rapid improvement in the capabilities of voice-activated user interface. For example, Google speech was able to improve its error rate tremendously in a year; now it recognises 19 out of 20 words it hears. Understanding natural human speech is a gargantuan problem, and we now have the computing power at our disposal to make it better the more we use it.
The challenges of natural language generation and processing
As a subset of artificial intelligence, natural language generation (NLG) is the ability to get natural sounding written and verbal responses back based on data that’s input into a computer system. Human language is quite complex, but today’s natural language generation capabilities are becoming very sophisticated. Think of NLG as a writer that turns data into language that can be communicated.
Natural language processing (NLP) is the reader that takes the language created by NLG and consumes it. Advances in this technology have allowed dramatic growth in intelligent personal assistants such as Alexa.
Voice-based AI is so appealing because it holds the promise of supporting in a way that is natural to us humans; no swiping or typing necessary. That’s also why it’s a technical challenge to build. Just think about how nonlinear your typical conversation is.
When people talk they interrupt themselves, change topics or repeat themselves, use body language to add meaning and use a wide variety of words that have multiple meanings depending on the context. It’s like a parent trying to understand the vernacular of teens, but much, much more complicated.
Amazon continues to have an army of specialists in addition to a cadre of machines on the task of making Alexa and Alexa Voice Services even better. Their goal is to make spoken language a user interface that is as natural as talking to another human being. I can’t wait to see what’s in store next.
Bernard Marr is a bestselling author, keynote speaker, and advisor to companies and governments. He has worked with and advised many of the world's best-known organisations. LinkedIn has recently ranked Bernard as one of the top 10 Business Influencers in the world (in fact, No 5 - just behind Bill Gates and Richard Branson). He writes on the topics of intelligent business performance for various publications including Forbes, HuffPost, and LinkedIn Pulse. His blogs and SlideShare presentation have millions of readers.