3 Types of Speech Recognition Data (and What They’re Used For)

Introduction

Developing a speech product like a voice assistant or speech recognition software? At some point, you’ll find yourself in need of speech recognition data to train your machine learning algorithms.

But speech comes in many forms—especially when it comes to interacting with an AI. You’ll require different types of voice data. It all depends on the type of interaction you’re looking to build and how robust you want that interaction to be.

At Summa Linguae Technologies, we’ve collected thousands of hours of speech data for our clients. This experience has allowed us to identify the most commonly requested forms of speech recognition data. Additionally, we’ve compiled the main pros and cons of each type.

In this blog, we’ll cover the three most popular types of speech recognition data.

What is speech recognition data?

Speech recognition data refers to audio recordings of human speech used to train a voice recognition system. This audio data is typically paired with a text transcription of the speech, and language service providers are well positioned to help.

The audio and transcription are fed to a machine learning algorithm as training data. That way, the system learns how to identify the acoustics of certain speech sounds and the meaning behind the words.

There are many readily available sources of speech data, including public speech corpora or pre-packaged datasets, but in most cases, you will need to work with a data services provider to collect your own speech data through remote collection or in-person collection.

Collecting your own data allows you to customize your speech dataset by variables like language, speaker demographics, audio requirements, or collection size.

But what does speech recognition data actually look like—or should we say: sound like?

The Speech Data Spectrum

You can think of speech data as existing on a spectrum from unnatural to natural speech.

On one end of the spectrum, you have the “unnatural” case of someone reading directly from a script. The speaker is restricted in what they can say. That way, we can capture variance in how that particular phrase is read.

On the other end of the spectrum, you have the completely natural case of someone speaking spontaneously and freely in conversation. In this case, we lose the ability to closely measure the variance in one variable. We do, however, get a more realistic picture of natural speech.

In the middle of the spectrum, you have cases where speakers may be prompted to imagine themselves speaking naturally in a particular scenario, where their speech may not be scripted but is still controlled in some way.

This spectrum allows us to bin speech recognition data into three broad categories:

Controlled: Scripted speech data
Semi-controlled: Scenario-based speech data
Natural: Unscripted or conversational speech data

Here’s a closer look at each type of data and when you would use them:

1. Scripted Speech Data

Scripted speech data is the most controlled form of speech data. In this format, we ask speakers to record specific utterances from a script.

For speech recognition purposes, scripted speech data typically includes voice commands, wake words, or a combination of the two.

For example, we ask a participant to read a list of pre-scripted wake word + command sentences that are pre-written to capture a variety of wording options:

1: “Alexa, turn off the TV.”
2: “Alexa, please turn off the TV.”
3: “Alexa, turn off the television.”
4: “Alexa, please turn off the television.”

What it’s used for

Scripted speech data is used when developers need speech samples that vary not by what is said, but how it’s said.

In this case, the developer has already chosen or researched the most common speech commands for their technology. They want to ensure their speech recognition will work for a wide variety of pronunciations.

Scripted speech data helps the developer achieve variety both across speaker groups and within speaker groups.

For example, if a developer were creating a voice assistant, they would need to capture voice commands across a variety of speaker accents (e.g. Spanish-accented English, British-accented English, and American-accented English).

But they would also need to collect from a large number of speakers within each of those accent groups to ensure enough speaker-to-speaker variety.

Advantage

Scripted speech has the advantage of being controlled by the exact words that are used, so the only variance is how the words are pronounced.

Disadvantage

By restricting what the speaker can say, scripted speech misses out on the natural variety of language. In the case of voice commands, for example, this could mean ignoring other phrasing structures, or failing to capture artifacts like the hemming and hawing found in natural speech.

Hear it in action: For an example of scripted speech data in four different languages, download Summa Linguae Technologies’ Alexa wake word sample dataset.

2. Scenario-Based Speech Data

Scenario-based speech data is a form of natural language collection. We ask speakers to come up with their own voice commands based on a provided scenario.

For example, we ask a participant to come up with a variety of ways of asking Alexa for directions. Additionally, they might come up with a list of commands they may give to a banking app.

Prompt: You want to go to your favorite restaurant. How would you ask your device to navigate to the restaurant?

1: “Take me to Pizza Hut.”
2: “Give me directions to Pizza Hut.”
3: “Can I get directions to Pizza Hut?”

What it’s used for

Scenario-based speech data is collected when developers need a natural sampling of different ways of asking for the same thing, or when they need a wider variety of command intentions (i.e. asking for different things).

Therefore, scenario-based speech data provides variety both in what is said and how it’s said.

Advantages

If a device is designed to understand everyday speech and all the nuances that come with it, then a scenario-based collection is critical.

Unlike scripted speech which can only train for a subset of commands (e.g. “Turn on”, “Turn off”), scenario-based data can account for all the different ways your customer may phrase their request.

Disadvantages

Because there is increased variety in the actual words used, scenario-based data is less useful than scripted speech for training on acoustic variance between speakers. And because there are so many possible ways of phrasing the same request, scenario-based collections require far more data.

3. Unscripted or Conversational Speech Data

Unscripted or conversational speech data is a recording of a conversation between two or more speakers—the most “natural” form of speech.

Because natural speech comes in many forms in the real world, so does unscripted speech data. For example, this data could take the form of phone conversation recordings or recordings of people speaking to each other in a crowded room.

If a developer is looking for conversational data around a particular topic (e.g. music), two speakers may be prompted to have a conversation about that specific subject.

1: Hi Jen, it’s good to hear from you again!
2: It’s been too long! I’m actually calling to see if you want to grab some dinner this week.
1: I’d love to. I’m busy tomorrow night, but Wednesday is wide open.
2: Great! I can be off work around six and come swing by to grab you if that works?

What it’s used for

Unscripted or conversational speech data is used to help train AI applications on the dynamics of a multi-speaker conversation.

The first challenge for conversational AI chatbots and voice assistants is conversational context. These applications need to understand the flow of a natural conversation, which requires different speech input than one-off speech commands.

Humans are really good at making assumptions about context. Take the following example:

1: I want to go to the movies.
2: I want to see the one with Brad Pitt.

To us super-intelligent humans, there’s nothing complicated happening here. But to a bot, it’s really tricky for it to understand that when Speaker 2 says “the one”, they’re still referring to a movie.

This is made even more challenging by the fact that people will suddenly shift conversational topics without warning. The machine somehow needs to learn if the next phrase is a new topic or related to something that was previously said. Transcribed conversational speech data helps to train on these cases.

The second major challenge for AI with conversations is teasing apart overlapping speech. When two speakers are speaking on top of each other, the machine has to pick out individual voices. And even in cases where speakers don’t overlap in speech, the AI still has to understand when each speaker has finished their turn in the conversation.

Advantage

The understanding of conversational context and how different sentences relate to each other that’s provided by conversational data adds a much more realistic dimension to AI.

Disadvantages

Because of the relative lack of structure and unpredictability of the training data, conversational data is more difficult to train on and requires a significant amount of data.

Hear it in action: For an example of conversational data in three different languages, check out Summa Linguae’s phone conversation data sample.

Start Collecting The Speech Data You Need

Hopefully you now have a clear picture on what type of speech recognition data you need for your speech solution.

If you’re ready to start collecting data for your machine learning project, check out Summa Linguae’s speech data collection services to learn more.

Data

Speaking Your Customers’ Language: How Multilingual Text Data Empowers Cha...

To equip a chatbot with the ability to understand and engage in conversations across multiple languages, i...

Data

The Impact of Accurate Data Labeling on Model Performance

Discover how accurate data labeling transforms the chaos of raw data into clarity, significantly impacting...

Data

How Multilingual AI Text Data is Shaping the Future of Technology

The goal is to create multilingual models that can effectively process and generate human-like text across...