The Complete Guide to Speech Recognition Technology

Introduction

Here’s everything you need to know about speech recognition technology. History, how it works, how it’s used today, what the future holds, and what it all means for you.

Back in 2008, many of us were captivated by Tony Stark’s virtual butler, J.A.R.V.I.S, in Marvel’s Iron Man movie.

J.A.R.V.I.S. started as a computer interface. It was eventually upgraded to an artificial intelligence system that ran the business and provided global security.

Learn more about our speech data solutions.

J.A.R.V.I.S. opened our eyes – and ears – to the possibilities inherent in speech recognition technology. While we’re maybe not all the way there just yet, advancements are being used in many ways on a wide variety of devices.

Speech recognition technology allows for hands-free control of smartphones, speakers, and even vehicles in a wide variety of languages.

It’s an advancement that’s been dreamt of and worked on for decades. The goal is, quite simply, to make life simpler and safer.

In this guide we are going to take a brief look at the history of speech recognition technology. We’ll start with how it works and some devices that make use of it. Then we’ll examine what might be just around the corner.

History of Speech Recognition Technology

Speech recognition is valuable because it saves consumers and companies time and money.

The average typing speed on a desktop computer is around 40 words per minute. That rate diminishes a bit when it comes to typing on smartphones and mobile devices.

When it comes to speech, though, we can rack up between 125 and 150 words per minute. That’s a drastic increase.

Therefore, speech recognition helps us do everything faster—whether it’s creating a document or talking to an automated customer service agent.

The substance of speech recognition technology is the use of natural language to trigger an action. Modern speech technology began in the 1950s and took off over the decades.

Speech Recognition Through the Years

1950s: Bell laboratories developed “Audrey”, a system able to recognize the numbers 1-9 spoken by a single voice.
1960s: IBM came up with a device called “Shoebox” that could recognize and differentiate between 16 spoken English words.
1970s: The It led to the ‘Harpy’ system at Carnegie Mellon that could understand over 1,000 words.
1990s: The advent of personal computing brought quicker processors and opened the door for dictation technology. Bell was at it again with dial-in interactive voice recognition systems.
2000s: Speech recognition achieved close to an 80% accuracy rate. Then Google Voice came on the scene, making the technology available to millions of users and allowing Google to collect valuable data.
2010s: Apple launched Siri and Amazon came out with Alexa in a bid to compete with Google. This big three continues to lead the charge.

Slowly but surely, developers have moved towards the goal of enabling machines to understand and respond to more and more of our verbalized commands.

Today’s leading speech recognition systems—Google Assistant, Amazon Alexa, and Apple’s Siri—would not be where they are today without the early pioneers who paved the way.

Thanks to the integration of new technologies such as cloud-based processing and continuous improvements made thanks to speech data collection, these speech systems have continuously improved their ability to ‘hear’ and understand a wider variety of words, languages, and accents.

How Does Voice Recognition Work?

Now that we’re surrounded by smart cars, smart home appliances, and voice assistants, it’s easy to take for granted how speech recognition technology works.

Why?

Because the simplicity of being able to speak to digital assistants is misleading. Voice recognition is incredibly complicated—even now.

Think about how a child learns a language.

From day one, they hear words being used all around them. Parents speak and their child listens. The child absorbs all kinds of verbal cues: intonation, inflection, syntax, and pronunciation. Their brain is tasked with identifying complex patterns and connections based on how their parents use language.

But whereas human brains are hard-wired to acquire speech, speech recognition developers have to build the hard wiring themselves.

The challenge is building the language-learning mechanism. There are thousands of languages, accents, and dialects to consider, after all.

That’s not to say we aren’t making progress. In early 2020, researchers at Google were finally able to beat human performance on a broad range of language understanding tasks.

Google’s updated model now performs better than humans in labelling sentences and finding the right answers to a question.

Basic Steps

A microphone transmits the vibrations of a person’s voice into a wavelike electrical signal.
This signal in turn is converted by the system’s hardware—a computer’s sound card, for examples—into a digital signal.
The speech recognition software analyzes the digital signal to register phonemes, units of sound that distinguish one word from another in a particular language.
The phenomes are reconstructed into words.

To pick the correct word, the program must rely on context cues, accomplished through trigram analysis.

This method relies on a database of frequent three-word clusters in which probabilities are assigned that any two words will be followed by a given third word.

Think about the predictive text on your phone’s keyboard. A simple example would be typing “how are” and you phone would suggest “you?” The more you use it, though, the more it gets to know your tendencies and will suggest frequently used phrases.

Speech recognition software works by breaking down the audio of a speech recording into individual sounds, analyzing each sound, using algorithms to find the most probable word fit in that language, and transcribing those sounds into text.

How do companies build speech recognition technology?

A lot of this depends on what you’re trying to achieve and how much you’re willing to invest.

As it stands, there’s no need to start from scratch in terms of coding and acquiring speech data because much of that groundwork has been laid and is available to be built upon.

For instance, you can tap into commercial application programming interfaces (APIs) and access their speech recognition algorithms. The problem, though, is they’re not customizable.

You might instead need to seek out speech data collection that can be accessed quickly and efficiently through an easy-to-use API, such as:

The Speech-to-text API from Google Cloud
The Automatic Speech Recognition (ASR) system from Nuance
IBM Watson “Speech to text” API

From there, you design and develop software to suit your requirements. For example, you might code algorithms and modules using Python

Regional accents and speech impediments can throw off word recognition platforms, and background noise can be difficult to penetrate, not to mention multiple-voice input. In other words, understanding speech is a much bigger challenge than simply recognizing sounds.

Different Models

Acoustic: Take the waveform of speech and break it up into small fragments to predict the most likely phonemes in the speech.
Pronunciation: Take the sounds and tie them together to make words, i.e. associate words with their phonetic representations.
Language: Take the words and tie them together to make sentences, i.e. predict the most likely sequence of words (or text strings) among several a set of text strings.

Algorithms can also combine the predictions of acoustic and language models to offer outputs the most likely text string for a given speech file input.

To further highlight the challenge, speech recognition systems have to be able to distinguish between homophones (words with the same pronunciation but different meanings), to learn the difference between proper names and separate words (“Tim Cook” is a person, not a request for Tim to cook), and more.

After all, speech recognition accuracy is what determines whether voice assistants become a can’t-live-without accessory.

How Voice Assistants Bring Speech Recognition into Everyday Life

Speech recognition technology has grown leaps and bounds in the early 21^st century and has literally come home to roost.

Look around you. There could be a handful of devices at your disposal at this very moment.

Let’s look at a few of the leading options.

Apple’s Siri

Apple’s Siri emerged as the first popular voice assistant after its debut in 2011. Since then, it has been integrated on all iPhones, iPads, the Apple Watch, the HomePod, Mac computers, and Apple TV.

Siri is even used as the key user interface in Apple’s CarPlay infotainment system, as well as the wireless AirPod earbuds, and the HomePod Mini.

Siri is with you everywhere you go; on the road, in your home, and for some, literally on your body. This gave Apple a huge advantage in terms of early adoption.

Naturally, being the earliest quite often means receiving most of the flack for functionality that might not work as expected.

Although Apple had a big head start with Siri, many users expressed frustration at its seeming inability to properly understand and interpret voice commands.

If you asked Siri to send a text message or make a call on your behalf, it could easily do so. However, when it came to interacting with third-party apps, Siri was a little less robust compared to its competitors.

But today, an iPhone user can say, “Hey Siri, I’d like a ride to the airport” or “Hey Siri, order me a car,” and Siri will open whatever ride service app you have on your phone and book the trip.

Focusing on the system’s ability to handle follow-up questions, language translation, and revamping Siri’s voice to something more human-esque is helping to iron out the voice assistant’s user experience.

As of 2021, Apple hovers over its competitors in terms of availability by country and thus in Siri’s understanding of foreign accents. Siri is available in more than 30 countries and 21 languages – and, in some cases, several different dialects.

Amazon Alexa

Amazon announced Alexa and the Echo to the world in 2014, kicking off the age of the smart speaker.

Alexa is now housed inside the following:

Smart speaker
Show (a voice-controlled tablet)
Spot (a voice-controlled alarm clock)
Buds headphones (Amazon’s version of Apple’s AirPods).

In contrast to Apple, Amazon has always believed the voice assistant with the most “skills”, (its term for voice apps on its Echo assistant devices) “will gain a loyal following, even if it sometimes makes mistakes and takes more effort to use”.

Although some users pegged Alexa’s word recognition rate as being a shade behind other voice platforms, the good news is that Alexa adapts to your voice over time, offsetting any issues it may have with your particular accent or dialect.

Speaking of skills, Amazon’s Alexa Skills Kit (ASK) is perhaps what has propelled Alexa forward as a bonafide platform. ASK allows third-party developers to create apps and tap into the power of Alexa without ever needing native support.

Alexa was ahead of the curve with its integration with smart home devices. They had cameras, door locks, entertainment systems, lighting, and thermostats.

Ultimately, giving users absolute control of their home whether they’re cozying up on their couch or on-the-go. With Amazon’s Smart Home Skill API, you can enable customers to control their connected devices from tens of millions of Alexa-enabled endpoints.

When you ask Siri to add something to your shopping list, she adds it without buying it for you. Alexa however goes a step further.

If you ask Alexa to re-order garbage bags, she’ll scroll Amazon and order some. In fact, you can order millions of products off Amazon without ever lifting a finger; a natural and unique ability that Alexa has over its competitors.

Google Assistant

How many of us have said or heard “let me Google that for you”? Almost everyone, it seems. It only makes sense then, that Google Assistant prevails when it comes to answering (and understanding) all questions its users may have.

From asking for a phrase to be translated into another language, to converting the number of sticks of butter in one cup, Google Assistant not only answers correctly, but also gives some additional context and cites a source website for the information.

Given that it’s backed by Google’s powerful search technology, perhaps it’s an unsurprising caveat.

Though Amazon’s Alexa was released (through the introduction of Echo) two years earlier than Google Home, Google has made great strides in catching up with Alexa in a very short time. Google Home was released in late 2016, and within a year, had already established itself as the most meaningful opponent to Alexa.

In 2017, Google boasted a 95% word accuracy rate for U.S. English, the highest out of all the voice-assistants currently out there. This translates to a 4.9%-word error rate – making Google the first of the group to fall below the 5% threshold.

Word-error rate has its limitations, though. Factors that affect the data include:

Background noise
Crosstalk
Accents
Rare words
Context

Still, they’re getting close to 0% and that’s significant.

To get a better sense of the languages supported by these voice assistants, be sure to check out our comparison article.

Where else is speech recognition technology prevalent?

Voice assistants are far from the only mechanisms through which advancements in speech recognition are becoming even more mainstream.

In-Car Speech Recognition

Voice-activated devices and digital voice assistants aren’t just about making things easier. It’s also about safety – at least it is when it comes to in-car speech recognition.

Companies like Apple, Google, and Nuance have completely reshaped the driver’s experience in their vehicle—aiming at removing the distraction of looking down at your mobile phone while you drive allows drivers to keep their eyes on the road.

Instead of texting while driving, you can now tell your car who to call or what restaurant to navigate to.
Instead of scrolling through Apple Music to find your favorite playlist, you can just ask Siri to find and play it for you.
If the fuel in your car is running low, your in-car speech system can not only inform you that you need to refuel, but also point out the nearest fuel station and ask whether you have a preference for a particular brand. Or perhaps it can warn you that the petrol station you prefer is too far to reach with the fuel remaining.

When it comes to safety, there’s an important caveat to be aware of. A report published by the UK’s Transport Research Laboratory (TRL) showed that driver distraction levels are much lower when using voice activated system technologies compared to touch screen systems.

However, it recommends that further research is necessary to steer the use of spoken instructions as the safest method for future in-car control, seeing as the most effective safety precautions would be the elimination of distractions altogether.

That’s where field data collection comes in.

How to Train a Car

Companies need precise and comprehensive data with respect to terms and phrases that would be used to communicate in a vehicle.

Field data collection is conducted in a specifically chosen physical location or environment as opposed to remotely. This data is collected via loosely structured scenarios that include elements like culture, education, dialect, and social environment that can an impact on how a user will articulate a request.

This is best suited for projects with specific environmental requirements, such as specific acoustics for sound recordings.

Think about in-car speech recognition, for example. Driving around presents very unique circumstances in terms of speech data.

You must be able to record speech data from the cabin of a car to simulate acoustic environment, background noises, and voice commands used in real scenarios.

That’s how you reach new levels of innovation in human and machine interaction.

Voice-Activated Video Games

Speech recognition technology is also making strides in the gaming industry.

Voice-activated video games have begun to extend from the classic console and PC format to voice-activated mobile games and apps.

Creating a video game is already extraordinarily difficult. It takes years to properly flesh out the plot, the gameplay, character development, customizable gear, worlds, and so on. The game also has to be able to change and adapt based on each player’s actions.

Now, just imagine adding another layer to gaming through speech recognition technology.

Many of the companies championing this idea do so with the intention of making gaming more accessible for visually and/or physically impaired players, as well as allowing players to immerse themselves further into gameplay through enabling yet another layer of integration.

Voice control could also potentially lower the learning curve for beginners, seeing as less importance will be placed on figuring out controls. Players can just begin talking right away.

Moving forward, text-to-speech (TTS), synthetic voices, and generative neural networks will help developers create spoken and dynamic dialogue.

You will be able to have a conversation with characters within the game itself.

The rise of speech technology in video games has only just begun.

Speech Recognition Technology: The Focus Moving Forward

What does the future of speech recognition hold?

Here are a few key areas of focus you can expect moving forward.

1. Mobile app voice integration

Integrating voice-tech into mobile apps has become a hot trend, and will remain so because speech is a natural user interface (NUI).

Voice-powered apps increase functionality and save users from complicated navigation.

It’s easier for the user to navigate an app — even if they don’t know the exact name of the item they’re looking for or where to find it in the app’s menu.

Voice integration will soon become a standard that users will expect.

2. Individualized experiences

Voice assistants will also continue to offer more individualized experiences as they get better at differentiating between voices.

Google Home, for example, can not only support up to six user accounts but also detect unique voices, which allows you to customize many features.

You can ask “What’s on my calendar today?” or “tell me about my day?” and the assistant will dictate commute times, weather, and news information tailored specifically to you.

It also includes features such as nicknames, work locations, payment information, and linked accounts such as Google Play, Spotify, and Netflix.

Similarly, for those using Alexa, saying “learn my voice” will allow you to create separate voice profiles so it can detect who is speaking.

3. Smart displays

The smart speaker is great and all, but what people are really after now is the smart display, essentially a smart speaker with a touch screen attached to it.

In 2020, the sale of smart displays rose by 21% to 9.5 million units, while basic smart speakers fell by 3%, and that trend is only likely to continue.

Smart displays like the Russian Sber portal or the Chinese smart screen Xiaodu, for example, are already equipped with several AI-powered functions, including far-field voice interaction, facial recognition, hand gesture control, and eye gesture detection.

Collect Better Data

We help you create outstanding human experiences with high-quality speech, image, video, or text data for AI.

Summa Linguae Technologies collects and annotates the training and testing data you need to build your AI-powered solutions, including voice assistants, wearables, autonomous vehicles, and much more.

We offer both in-field and remote data collection options. They’re backed by a network of technical engineers, project managers, quality assurance professionals, and annotators.

Here are a few resources you can tap into right away:

Data Sets – Sample of our pre-packaged speech, image, and video data sets. These data samples are free to download and provide a preview of the capabilities of our ready-to-order or highly customizable data solutions.
The Ultimate Guide to Data Collection (PDF) – Learn how to collect data for emerging technology.

Want even more? Contact us today for a full speech data solutions consultation.