Last updated:  
Jul 13, 2020 @ 9:59 AM

If you’re developing a speech product like a voice assistant or speech recognition software, at some point, you’ll find yourself in need of speech data to train your machine learning algorithms.

But speech comes in many forms—especially when it comes to interacting with an AI. Depending on the type of interaction you’re looking to build, and how robust you want that interaction to be, you will require different types of voice data.

At Globalme, we’ve collected thousands of hours of speech data for our clients. This experience has allowed us to identify the most commonly requested forms of speech recognition data and the main pros and cons of each type.

In this blog, we’ll cover the three most popular types of speech recognition data.

What is speech recognition data?

By speech recognition data, we mean audio recordings of human speech that are used to train a voice recognition system. This audio data is typically paired with a text transcription of the speech.

The audio and transcription are fed to a machine learning algorithm as training data so the system can learn how to identify the acoustics of certain speech sounds as well as the meaning behind the words.

There are many readily available sources of speech data, including public speech corpora or pre-packaged datasets, but in most cases, you will need to work with a data services provider to collect your own speech data through remote collection or in-person collection.

Collecting your own data allows you to customize your speech dataset by variables like language, speaker demographics, audio requirements, or collection size.

But what does speech recognition data actually look like—or should we say: sound like?

The Speech Data Spectrum

You can think of speech data as existing on a spectrum from unnatural to natural speech.

On one end of the spectrum, you have the “unnatural” case of someone reading directly from a script. The speaker has been restricted in what they can say so that we can capture variance in how that particular phrase may be read.

On the other end of the spectrum, you have the completely natural case of someone speaking spontaneously and freely as part of a back-and-forth conversation with another person. In this case, we lose the ability to closely measure the variance in one variable, but we get a more realistic picture of natural speech.

In the middle of the spectrum, you have cases where speakers may be prompted to imagine themselves speaking naturally in a particular scenario, where their speech may not be scripted but is still controlled in some way.

This spectrum allows us to bin speech recognition data into three broad categories:

  1. Controlled: Scripted speech data
  2. Semi-controlled: Scenario-based speech data
  3. Natural: Unscripted or conversational speech data

Here’s a closer look at each type of data and when you would use them:

1. Scripted Speech Data

Scripted speech data is the most controlled form of speech data. In this format, speakers are asked to record specific utterances from a script.

For speech recognition purposes, scripted speech data typically includes voice commands, wake words, or a combination of the two.

For example, a participant could be asked to read a list of pre-scripted wake word + command sentences that are pre-written to capture a variety of wording options:

Sentence 1: “Alexa, turn off the TV.”
Sentence 2: “Alexa, please turn off the TV.”
Sentence 3: “Alexa, turn off the television.”
Sentence 4: “Alexa, please turn off the television.”

What it’s used for: Scripted speech data is used when developers need speech samples that vary not by what is said, but how it’s said.

In this case, the developer has already chosen or researched the most common speech commands for their technology (oftentimes from a preceding natural language collection) and wants to ensure their speech recognition will work for a wide variety of pronunciations.

Scripted speech data helps the developer achieve variety both across speaker groups and within speaker groups.

For example, if a developer were creating a voice assistant, they would need to capture voice commands across a variety of speaker accents (e.g. Spanish-accented English, British-accented English, and American-accented English).

But they would also need to collect from a large number of speakers within each of those accent groups to ensure enough speaker-to-speaker variety.

Advantage: Scripted speech has the advantage of being controlled by the exact words that are used, so the only variance is how the words are pronounced.

Disadvantage: By restricting what the speaker can say, scripted speech misses out on the natural variety of language. In the case of voice commands, this could mean ignoring other phrasing structures, or failing to capture artifacts like the hemming and hawing found in natural speech.

Hear it in action: For an example of scripted speech data in four different languages, download Globalme’s Alexa wake word sample dataset.

2. Scenario-Based Speech Data

Scenario-based speech data is a form of natural language collection where speakers are asked to come up with their own voice commands based on a provided scenario.

For example, a participant could be asked to come up with a variety of ways of asking Alexa for directions, or to spontaneously come up with a list of commands they may give to a banking app.

Prompt: You would like to be taken to your favorite restaurant. How would you ask your device to navigate to the restaurant?

Speaker 1: Take me to Pizza Hut.
Speaker 2: Give me directions to Pizza Hut.
Speaker 3: Can I get directions to Pizza Hut?

What it’s used for: Scenario-based speech data is collected when developers need a natural sampling of different ways of asking for the same thing, or when they need a wider variety of command intentions (i.e. asking for different things).

Therefore, scenario-based speech data provides variety both in what is said and how it’s said.

Advantages: If a device is designed to understand everyday speech and all the nuances that come with it, then a scenario-based collection is critical.

Unlike scripted speech which can only train for a subset of commands (e.g. “Turn on”, “Turn off”), scenario-based data can account for all the different ways your customer may phrase their request.

Disadvantages: Because there is increased variety in the actual words used, scenario-based data is less useful than scripted speech for training on acoustic variance between speakers. And because there are so many possible ways of phrasing the same request, scenario-based collections require far more data.

3. Unscripted or Conversational Speech Data

Unscripted or conversational speech data is a recording of a conversation between two or more speakers—the most “natural” form of speech.

Because natural speech comes in many forms in the real world, so does unscripted speech data. For example, this data could take the form of phone conversation recordings or recordings of people speaking to each other in a crowded room.

If a developer is looking for conversational data around a particular topic (e.g. music), two speakers may be prompted to have a conversation about that specific subject.

Speaker 1: Hi Jen, it’s good to hear from you again!
Speaker 2: It’s been too long! I’m actually calling to see if you want to grab some dinner this week.
Speaker 1: I’d love to. I’m busy tomorrow night, but Wednesday is wide open.
Speaker 2: Great! I can be off work around six and come swing by to grab you if that works?

What it’s used for: Unscripted or conversational speech data is used to help train AI applications on the dynamics of a multi-speaker conversation.

The first challenge for conversational AI like chatbots and voice assistants is conversational context. These applications need to understand the flow of a natural conversation, which requires different speech input than one-off speech commands.

Humans are really good at making assumptions about context. Take the following example:

Speaker 1: I want to go to the movies.
Speaker 2: I want to see the one with Brad Pitt.

To us super-intelligent humans, there’s nothing complicated happening here. But to a bot, it’s really tricky for it to understand that when Speaker 2 says “the one”, they’re still referring to a movie.

This is made even more challenging by the fact that people will suddenly shift conversational topics without warning. The machine somehow needs to learn if the next phrase is a new topic or related to something that was previously said. Transcribed conversational speech data helps to train on these cases.

The second major challenge for AI with conversations is teasing apart overlapping speech. When two speakers are speaking on top of each other, the machine has to pick out individual voices. And even in cases where speakers don’t overlap in speech, the AI still has to understand when each speaker has finished their turn in the conversation.

Advantage: The understanding of conversational context and how different sentences relate to each other that’s provided by conversational data adds a much more realistic dimension to AI.

Disadvantages: Because of the relative lack of structure and unpredictability of the training data, conversational data is more difficult to train on and requires a significant amount of data.

Hear it in action: For an example of conversational data in three different languages, check out Globalme’s phone conversation data sample.

Start collecting speech recognition data

This article should have given you a clear picture on what type of speech recognition data is needed for your speech solution.

If you’re ready to start collecting data for your machine learning project, check out Globalme’s speech data collection services to learn more.