Skip to content

Methodology

Alexander Veysov edited this page Oct 19, 2020 · 2 revisions

Validation sets

Each validation set should consist of audio files that meet the following conditions:

  • The model has never seen them during training (or it is improbable or in case of huge datasets irrelevant);
  • Audio samples are representative for the domain, i.e.:
    • Have the same quality/codec/sampling rate/etc as the main part of the domain;
    • Share the specific vocabulary: for example, medical terms for doctor's conversations;
  • Also, the validation set should contain at least 2-5 hours of speech in total;

To avoid misleading errors, we pre-process audio files + transcripts as follows:

  • Remove too short, broken, or invalid files;
  • Divide long audio files(>20 sec) into smaller chunks of speech;
  • Trim long silences;
  • Normalize transcripts:
    • Lowercase;
    • Remove punctuation/unspoken symbols;
    • Replace all numbers: June 2 -> june second;
    • Replace abbreviations if its spoken and written forms differ(Mr -> mister) or if it has several written forms (that's -> that is);

Metrics

For benchmarks we use a common STT metric - Word Error Rate.

In plain terms WER can be interpreted as an approximate percentage of INCORRECTLY recognized words. So, if WER=50% that means every second word is wrong. If WER=20% - every fifth, and so on.

In reality, WER is a bit more complicated since substitutions and deletions of words are also taken into account. If you are familiar with a Levenshtein distance, WER is just it for sequences of words.

Metrics Achievable in Real Life

According to 1, 2 humans typically make 4-12% errors transcribing audio. Specifically, in English manual annotation WER is:

  • 4-5% on clean speech;
  • 10-12% on speech with accents/noise/defects

Our estimate of human's WER on clean speech in Russian is at least 6-7% due to grammatical cases and specific rules.

What does it mean for Speech Recognition systems? If you do not go into the details of configuring systems for specific domains, then it is logical to expect the systems to produce the following metrics:

Domain English Russian
Manual transcribing 4-5% 5-7%
Clean speech 5-10% 7-10%
Phone calls 10-15% 10-20%
Noisy speech 20-30% 20-40%

So, if some ASR system/paper reports WER < 4%, it is at least a reason to treat this system with a grain of salt. A too low error rate usually means over-fitting on the specific dataset (for example, LibriSpeech) and, as a result, poor overall performance. More about it read in our paper on The Gradient.

Clone this wiki locally