Speech recognition technology is something that has been dreamt about and worked on for decades. From R2-D2’s beeping and booping in Star Wars to Samantha’s disembodied but soulful voice in Her, sci-fi writers have had a huge role to play in building expectations and predictions for what speech recognition could look like in our world.
However, for all of modern technology’s advancements, voice-control has been a rather unsophisticated affair. What supposedly aims at simplifying our lives instead has historically been frustratingly clunky and nothing more than a novelty. That is, until big data, deep learning, machine learning and AI began to make their way more and more into the forefront of technology.
History of Speech Recognition Technology
As with any technology, what we know today has to have come from somewhere, sometime and someone. In fact, the first ever recorded attempt at speech recognition technology dates back to 1,000 A.D. through the development of an instrument that could supposedly answer “yes” or “no” to direct questions. Though this experiment didn’t technically involve voice processing in any form, the idea behind it remains to be part of the foundation of speech recognition technology: using natural language as input to trigger an action.
Centuries later, Bell laboratories worked to develop “Audrey”, a system which was able to recognize the numbers 1-9 spoken by a single voice, and IBM developed a device that could recognize and differentiate between 16 spoken words. These successes brought about a higher prevalence of technology companies focusing on speech related technologies. Indeed, even the Department of Defense wanted to get in on the action. Slowly but surely, developers moved towards the goal of enabling machines to understand and respond to more and more of our verbalized commands.
The history of speech recognition technology has been a long and winding one. Nevertheless, today’s speech systems such as Google Voice, Amazon Alexa, Microsoft Cortana and Apple’s Siri, would not be where they are today without the early pioneers who paved the way. Thanks to the integration of new technologies such as cloud-based processing as well as ongoing data collection projects, these speech systems have continuously improved their ability to ‘hear’ and understand a wider variety of words, languages, and accents. At this rate, it seems sci-fi writers’ predictions of the future aren’t as far off as we might think.
How Does Voice Recognition Work?
Surrounded by smartphones, smart cars, smart home appliances, voice assistants and more, it’s easy to take for granted how speech recognition technology actually works. Why? Because the simplicity of being able to speak to digital assistants is misleading. Speech recognition is actually incredibly complicated, even now.
Think about how a child learns a language. From day one, they hear words being used all around them. Parents speak to their child, and, although the child doesn’t respond, they absorb all kinds of verbal cues; intonation, inflection, and pronunciation; their brain forms patterns and connections based on how their parents use language.
Though it may seem as though humans are hardwired to listen and understand, we have actually been training our entire lives to develop this so-called natural ability. Speech recognition technology works in essentially the same way. Whereas humans have refined our process, we are still figuring out the best practices for computers. We have to train them in the same way our parents and teachers trained us. And that training involves a lot of innovative thinking, manpower, and research.
Perfecting these speech recognition systems will take a lot more time and a lot more field data; there are thousands of languages, accents and dialects to take into account, after all. That’s not to say we aren’t making progress; as of May 2017, Google’s machine learning algorithms have now achieved a 95% word accuracy rate for the English language. That current rate also happens to be the threshold for human accuracy, mind you.
Which Voice Assistant is the Best?
By now, we’ve all heard of and/or used a speech recognition system; they have penetrated the tech ecosystem en route to become the defacto means of communication between humans and technology. Voice input is simply the more efficient form of computing, says Mary Meeker in her annual Internet Trends report: humans can speak 150 words per minute on average, but can only type 40. Goodbye texting and pushing buttons – we’re simply too busy for all that now.
What’s kept speech recognition from becoming the dominant form of computing as of yet is its unreliability. Regional accents and speech impediments can throw off word recognition platforms, and background noise can be difficult to penetrate. Not to mention multiple-voice input.
In other words, simply recognizing sounds isn’t quite enough – to have any level of effectiveness, these speech recognition systems have to be able to distinguish between homophones (words with the same pronunciation but different meanings), to learn the difference between proper names and separate words (“Tim Cook” is a person, not a request to find a cook named Tim), and more. After all, speech recognition accuracy is what determines whether these voice assistants becomes a can’t-live-without feature. This of course begs the question as to which voice assistant currently out on the market is the best; in terms of speech accuracy, innovation, and usability and cohesiveness with other smart systems.
Apple’s Siri was the first voice-assistant created by mainstream tech companies debuting back in 2011. Since then, it has been integrated on all iPhones, iPads, the AppleWatch as well as Mac computers and Apple TV. Via your phone, Siri is even being used as the key user interface in Apple’s CarPlay infotainment system for automobiles as well as the wireless AirPod earbuds. With the release of SiriKit, a development tool that lets third-party companies integrate with Siri, and HomePod, Apple’s own attempt at an intelligent speaker (following the success of Amazon Echo and Google Home), the voice assistant’s abilities become even more robust. Siri is with you everywhere you go; on the road, in your home, and for some, literally on your body. This gives Apple a huge advantage in terms of adoption.
Although Apple had a big head start with Siri, many users expressed frustration at its seeming inability to properly understand and interpret voice commands. Naturally, being the earliest quite often means receiving most of the flack for functionality that might not work as expected. But, even today Siri remains notorious for misunderstanding voice commands, even going so far as to respond to a request for help with alcohol poisoning by providing a list of nearby liquor stores.
If you ask Siri to send a text message or make a call on your behalf, it can easily do so. However when it comes to interacting with third-party apps, Siri is a little less robust compared to its competitors, working with only six types of apps: ride-hailing and sharing; messaging and calling; photo search; payments; fitness; and auto infotainment systems. Why? Because Apple is betting that “customers will not use voice commands without an experience similar to speaking with a human, and so it is limiting what Siri can do in order to make sure it works well”, reports Reuters. Now, an iPhone user can say, “Hey Siri, I’d like a ride to the airport” or “Hey Siri, order me a car,” and Siri will open whatever ride service app you have on your phone and book the trip.
Focusing on the system’s ability to handle follow-up questions, language translation, and revamping Siri’s voice to something more human-esque is definitely helping to iron out the voice assistant’s user experience. In addition, Apple rules over its competitors in terms of availability by country and thus in Siri’s understanding of foreign accents. Siri is available in more than 30 countries and 20 languages – and, in some cases, several different dialects. Google Home, by comparison, is available in only seven countries and can only speak four languages ‘fluently’ (English, German, French and Japanese), though it does support multiple versions of some of those languages. Alexa on the other hand can only manage English (U.S. and U.K.) and German.
Housed inside Amazon’s smash-hit Amazon Echo smart speaker as well as the newly released Echo Show (a voice-controlled tablet) and Echo Spot (a voice-controlled alarm clock), Alexa is one of the most popular voice-assistants out there today. Whereas Apple focuses on perfecting Siri’s ability to do a small handful of things versus expanding its areas of expertise, Amazon puts no such restrictions on Alexa. Instead, wagering that the voice assistant with the most “skills,” (its term for apps on its Echo assistant devices), “will gain a loyal following, even if it sometimes makes mistakes and takes more effort to use”. Although some users have pegged Alexa’s word recognition rate as being a shade behind other voice platforms, the good news is that Alexa adapts to your voice over time, offsetting any issues it may have with your particular accent or dialect.
Speaking of skills, Amazon’s Alexa Skills Kit (ASK) is perhaps what has propelled Alexa forward as a bonafide platform. ASK allows third-party developers to create apps and tap into the power of Alexa without ever needing native support. With over 30,000 skills and growing, Alexa certainly outperforms Siri, Google Voice and Cortana combined in terms of third-party integration. With the incentive to “Add Voice to Your Big Idea and Reach More Customers” (not to mention the ability to build for free in the cloud “no coding knowledge required”) it’s no wonder that developers are rushing to put content on the Skills platform. Some can’t help but draw similarities with Apple’s AppStore, which also attracted the attention of developers rushing to put content – any content – on their platform, regardless of whether or not that content was actually valuable.
Another huge selling point for Alexa is its integration with smart home devices such as cameras, door locks, entertainment systems, lighting and thermostats. Ultimately, giving users absolute control of their home whether they’re cozying up on their couch or on-the-go. With Amazon’s Smart Home Skill API (another third-party developer tool similar to ASK), you can enable customers to control their connected devices from tens of millions of Alexa-enabled endpoints.
When you ask Siri to add something to your shopping list, she adds it to your shopping list – without actually buying it for you. Alexa however goes a step further. If you ask Alexa to re-order your rubbish bags, she’ll just go through Amazon and order them. In fact, you can order millions of products off of Amazon without ever lifting a finger; a natural and unique ability that Alexa has over its competitors.
Based on a 26th-century artificially intelligent character in the Halo video game series, Cortana debuted in 2014 as part of Windows Phone 8.1, the next big update at the time for Microsoft’s mobile operating system. Microsoft has since announced, in late 2017, that its conversational speech recognition system reached a 5.1% error rate, its lowest so far. This surpasses the 5.9% error rate reached in October 2016 by a group of researchers from Microsoft Artificial Intelligence and Research and puts its accuracy on par with professional human transcribers who have advantages like the ability to listen to text several times. In this race, every inch counts; when Microsoft announced their 5.9% accuracy rate in late 2016, they were ahead of Google. However, fast-forwarding a year puts Google ahead – but only by 0.2%.
While percentages and accuracy-rates are important, Cortana differentiates itself from other voice-assistants by actually being based upon real, human personal assistants. Rival services dig deep into data from devices, your search history, cookie trails you may have left behind throughout the internet. While that’s often useful, it can also be irritating in the form of non-stop notifications, or just plain frightening that a smart system can know so much about you. We’ve all watched 2001: A Space Odyssey where the mother of all sentient computers, HAL 9000, goes on a killing rampage with its unblinking red eye and smooth-as-butter robotic voice.
To avoid this, Microsoft spoke to a number of high-level personal assistants, finding that they all kept notebooks handy with key information of the person they were looking after. It was that simple idea which inspired Microsoft to create a virtual “Notebook” for Cortana, which stores personal information and anything that’s approved for Cortana to see and use. It’s not a privacy control panel, per se, but it definitely gives you a little more control over what Cortana does and doesn’t have access to. For instance, if you aren’t comfortable with Cortana having access to your email, your Notebook is where you can add or remove access. Another stand-out feature? Cortana will always ask you before she stores any information she finds in her Notebook.
Microsoft has also worked closely with Halo developers on the eyelike visual elements as well as with voice actress Jen Taylor for Cortana’s voice. These elements truly brings Cortana to life and prescribes a personality and emotion to the system that it may not have had without that association. Of course, Cortana’s personality shines through in daily use as well – with witty responses just about oozing from her circuit boards.
Similarly to Google Assistant and Google search, Cortana has support from Microsoft’s Bing search engine; allowing the voice-assistant to chew through whatever data it needs to to answer your burning questions. And, similarly to Amazon, Microsoft has come out with its own home smart speaker, Invoke, which executes many of the same functions that their rival devices do. Microsoft has another huge advantage when it comes to market reach – with Cortana being available on all Windows computers and mobiles running on Windows 10.
One of the most common responses to voicing a question out loud these days is, “LMGTFY”. In other words, “let me Google that for you”. It only makes sense then, that Google Assistant prevails when it comes to answering (and understanding) any and all questions its users may have. From asking for a phrase to be translated into another language, to converting the number of sticks of butter in one cup, Google Assistant not only answers correctly, but also gives some additional context and cites a source website for the information. Given that it’s backed by Google’s powerful search technology, perhaps it’s an unsurprising caveat.
Though Amazon’s Alexa was released (through the introduction of Echo) two years earlier than Google Home, Google has made great strides in catching up with Alexa in a very short time. Google Home was released in late 2016, and within a year, had already established itself as the most meaningful opponent to Alexa. As of late 2017, Google boasted a 95% word accuracy rate for U.S. English; the highest out of all the voice-assistants currently out there. This translates to a 4.9% word error rate – making Google the first of the group to fall below the 5% threshold.
In what some call an attempt to strike back at Amazon, Google has launched many eerily similar products to Amazon. For instance, Google Home is reminiscent of Amazon’s Echo, and Google Home Mini of Amazon Echo Dot. More recently, Google also announced some new, key partnerships with companies including Lenovo, LG and Sony to launch a line of Google Assistant-powered “smart displays,” which once again seems to ‘echo’ the likeness of Amazon’s Echo Show.
Nuance’s Dragon Assistant and Dragon Naturally Speaking
Though Nuance hasn’t come out with a smart home speaker, their Dragon Assistant and Dragon Naturally Speaking systems have been used as the speech recognition backbone for other tech companies. “I should just be able to talk to [my phone] without touching it,” says Vlad Sejnoha, chief technology officer at Nuance Communication. “It will constantly be listening for trigger words, and will just do it — pop up a calendar, or ready a text message, or a browser that’s navigated to where you want to go”.
Much of Nuance’s voice-recognition technology is centered around in-car speech systems; bringing embedded dictation capabilities and conversational infotainment to the car. “A further development is the addition of a deeper level of understanding,” says John West, principal solutions architect for Nuance. West claims, “Here, the aim is to not only recognize speech, but also to extract the meaning and intent of what has been said, enabling voice driven systems as a whole to react in an intelligent way, appropriate to the user’s needs.”
So… Which Voice-Assistant is Best?
First and foremost are the numbers; Google Assistant, the company’s smart voice assistant, is now installed on 400 million devices, including the Google Home speakers and certain Google-powered Android phones. Similarly, Microsoft officially claims there are 400 million active users of Windows 10 itself; discounting mobiles running the same system. With Amazon’s Alexa only being available on their Echo speakers, this number certainly dwarfs the number of devices Alexa can compete on. Siri, on the other hand, still has the upper advantage in this space with more than 700 million iPhones in use worldwide as of mid-2017 – this doesn’t even count the number of people who own an AppleWatch, a Macbook or iPad.
With the support of millions of pre-existing users for the aforementioned tech giants, a simple software update is all it takes to integrate their subsequent voice-assistants world-wide. Those with Google’s Pixel phones for instance, become a part of the Google ecosystem. They are more likely to invest in a Google Home speaker, and thus be funneled into engaging with YouTube, Google search, Google Maps, and so on. Whereas the same goes for Apple, Amazon and Microsoft users, with a slight iteration on which ecosystem and what products they are funneled into.
Ultimately, there is no one-size-fits-all winner when it comes to voice assistants. If you’re an avid Apple-user, then Siri and its widespread distribution across all Apple products might be the assistant for you. If you want to make your home into a smart home, Alexa already has thousands of software and hardware integrations ready to go. If you’re looking for an assistant who can tell you the answers to all your strange and wonderful questions, Google Assistant’s search engine beats all the rest. If you’re looking for a little more control as to what information your digital assistant has access to, Microsoft’s Cortana has that functionality.
Perhaps the real deal-breaker here is the recent partnership between Microsoft and Amazon announced August 30, 2017. That’s right, Alexa and Cortana are officially working together. Users will be able to say, “Alexa, open Cortana,” to their Echo devices, and “Cortana, open Alexa,” to their Windows 10 devices. Because both companies lack popular smartphones (unlike Google and Apple), they’ve tailored their assistants to play to their strengths. Alexa customers will be able to access Cortana’s unique features such as booking meetings or accessing work calendars, reminding you to pick up flowers on your way home, or reading your work email. Similarly, Cortana customers can ask Alexa to control their smart home devices, shop on Amazon.com, and interact with many of the more than 30,000 skills build by third-party developers.
Thus, in terms of pioneers in this new industry of voice-activation and digital assistants, Amazon certainly takes the cake. Not only does the company support the creation of other voice-activated technologies through their ASK and Smart Home API, but they were the original innovators to create a smart home speaker, a smart home speaker with a screen, and more. In other words, they simply moved (and are continuing to move) faster than their rivals, all the while innovating through the continued creation of partnerships.
In-Car Speech Recognition
Voice-activated devices and digital voice-assistants aren’t just about making things easier. It’s also about safety – at least it is when it comes to in-car speech recognition. Companies like Apple, Google and Nuance are completely reshaping the driver’s experience in their vehicle; aiming at removing the distraction of looking down at your mobile phone while you drive allows drivers to keep their eyes on the road.
Instead of texting while driving, you can now tell your car who to call or what restaurant to navigate to. Instead of scrolling through Apple Music to find your favorite playlist, you can just ask Siri to find and play it for you. If the fuel in your car is running low, your in-car speech system can not only inform you that you need to refuel, but also point out the nearest fuel station and ask whether you have a preference for a particular brand. Or perhaps it can warn you that the petrol station you prefer is too far to reach with the fuel remaining.
As beneficial as it may seem in an ideal scenario, it in-car speech technology can be dangerous when implemented before it has high enough accuracy. Studies have found that voice activated technology in cars can actually cause higher levels of cognitive distractions. This is because it is relatively new as a technology; engineers are still working out the software kinks.
But, at the rate speech recognition technology and artificial intelligence is improving, perhaps we won’t even be behind the wheel at all a few years down the line.
Voice-Activated Video Games
Outside of these use-cases in which speech recognition technology is implemented with the intent to simplify our lives, it’s also making strides in other areas. Namely, in the gaming industry.
Creating a video game is already extraordinarily difficult. It takes years to properly flesh out the plot, the gameplay, character development, customizable gear, lottery systems, worlds, and so on. Not only that, but the game has to be able to change and adapt based on each player’s actions.
Now, just imagine adding another level to gaming through speech recognition technology. Many of the companies championing this idea do so with the intention of making gaming more accessible for visually and/or physically impaired people, as well as allowing players to immerse themselves further into gameplay through enabling yet another layer of integration. Voice-control could also potentially lower the learning curve for beginners, seeing as less importance will be placed on figuring out controls; player’s can “just” begin talking right away.
In other words: it’ll be extremely challenging for game developers who will now have to account for hundreds (if not thousands) of hours of voice data collection, speech technology integration, testing and coding in order to retain their international audience.
However, despite all the goals tech companies are shooting for and the challenges they have to overcome along the way, there are already handfuls of video games out there who have believe the benefits outweigh the obstacles. In fact, voice-activated video games have even begun to extend from the classic console and PC format to voice-activated mobile games and apps. From Seaman starring a sarcastic man-fish brought to life by Leonard Nimoy’s voice in the late 1990s to Microsoft’s Mass Effect 3 released in 2012, the rise of speech technology in video games has only just begun.
Speech Technology Apps and Devices
While voice-assistants have been making a big splash in our personal lives, a recent study by VoiceLabs revealed that 30% of respondents noted smart home devices as their primary reason(s) for investing in an Amazon Echo or Google Home. This next-generation ‘conversation’ technology offers consumers a way out of having to use the clunky remote control interface. As such, allowing users to speak and interact with their electronics as they would another human being adds to the seamlessness of usability and decreases the barrier to entry for tech products.
Engineers are hard at work creating a plethora of voice-controlled devices which can be integrated with the leading digital-assistants’ voice technology; from household appliances and security devices, to thermostats and alarm systems. Nest for example, is a company that is capitalizing on the new voice-technology frontier. “Your smart home shouldn’t be dumb,” the company claims. With a Nest Thermostat, you can utilize Amazon Echo to control the temperature in your home with a simple voice command. Or, pre-order a Nest Hello video doorbell and get a Google Home Mini at no cost when it ships. From alarm systems to smoke and carbon monoxide alarms, Nest Protect thinks, speaks, and alerts your devices.
Going beyond the home, future applications of speech recognition include bringing these voice-assistants to the workplace. In late 2017, Amazon announced new voice-activated tools for the workplace, hoping that verbal commands such as, “Alexa, print my spreadsheet,” will expand to common office tasks. Microsoft’s Cortana has similarly begun to manage some of the more onerous office tasks such as: scheduling meetings, recording meeting minutes, and making travel arrangements.
Today, only a few people in high-up positions have their own personal assistant. With the introduction of AI digital-assistants in the office, everyone can have one. From asking Cortana to please access company financial data from last week to last year, to asking your Google Assistant to please create a graph showcasing the year’s growth in click-through-rates – the use-cases for implementing digital-assistants in the workplace are far reaching.
Just think – voice could very well replace manually going through files on your computer just like electronic documents so easily replaced paper records just a short time ago.
Where Did We Come From, Where Will We Go?
Speech recognition has indeed come a long way over the last decade; from the 1,000 A.D. magic eight ball to today’s land-grab in the voice technology market. The intense level of competition we’re seeing between these tech giants in the industry and the increasing prevalence of companies jumping in to create content in the space suggest that we still have a long road ahead of us.