Skip to content

Quality Benchmarks

Alexander Veysov edited this page Feb 24, 2022 · 35 revisions

🥇 Quality Benchmarks

For your convenience, we provide a set of benchmarks on publicly available datasets. We chose Google's STT as a decent approximation of a high quality enterprise solution available commercially and in many languages.

Methodology

Our approach is described in this article.

Caveats

Overall Quality

Unlike many solutions available off-the-shelf our models (especially the Enterprise Edition models) feature generalization across the following domains:

  • Video;
  • Lectures;
  • Narration;
  • Phone calls;
  • Various noises, codecs, recording methods and conditions;

Any "in-the-wild" speech with sufficient SNR and recording quality should work reasonably fine by design. The main caveat is that our models work poorly with far-field audio and extremely noisy audio.

Though our models work fine with 8kHz audio (phone calls), for simplicity we always resample to 16 kHz. Robustness is built into the models themselves.

Visually Pleasing Transcriptions

Be prepared that sometimes the CE-model has hard time producing visually pleasing transcriptions, though the results are phonetically similar.

This is usually solved one way or another:

  1. Limiting the model to a very narrow domain (i.e. speech commands);
  2. Adding an external traditional (n-gram) or more modern (DL-based) language model(s) and performing some sort of fusion / re-scoring;
  3. Using much larger (hence slower) model;

Options (1) and (3) contradict our design philosophy and in general limit the real life applicability of models. We are firm believers that technology should be embarrassingly simple to use (i.e. one line of code). Naturally we have solved these challenges with the EE edition of our models, but at this stage we are still not ready to publish the embarrassingly simple EE models that fulfill the same criteria (i.e. all compute graph triggered by one line of code).

Models

  • Google was used as a main reference in terms of quality;
  • CE = Community Edition;
  • EE = Enterprise Edition;

All of the below metrics are WER (word error rate).

English Version Comparison

Simple WER Version Comparison Table

V1 V2 V3 V4 V5
AudioBooks
lj 5.4 5.6 5.1
librispeech_test_clean 6.9 6.9 5.9 6.1 5.5
librispeech_val 11.5 11.7 9.7 10 8.8
librispeech_test_other 17.1 17.4 15.1 15.2 13.5
mls_test 17.9 17.1 14.8
mls_dev 15.8 15.3 13.3
Lecture / speech
multi_ted_test_he 12 11.5 12.1 10.2 8.4
multi_ted_test_common 17.6 17.3 18.3 15.5 14
multi_ted_val 20 19.9 21 18.4 16.9
voxpopuli_dev 20.8 19.4 16
voxpopuli_test 21.4 20.5 16.4
Finance
kensho 8.1 5.9 4.3
In the wild
common_voice_val 20.6 20.3 18.5 15.8 15
common_voice_test 25.5 25.3 23.3 20.3 19.2
gigaspeech 20.7
VOIP / calls
voip_test 21 18.7 18.3
Dialects
UK dialects mean 14.6 14.6 12.6 10.9 10.4

EN V1

All of these tests were run in early September 2020.

Dataset Silero CE Silero EE Google Video Premium Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean 8.6 6.9 7.8 8.7
en_librispeech_val 14.4 11.5 11.3 13.1
en_librispeech_test_other 20.6 17.1 16.2 19.1
Lecture / speech
en_multi_ted_test_he 16.6 12.0 15.3 14.1
en_multi_ted_test_common 21.2 17.6 16.9 16
en_multi_ted_val 23.5 20 22.7 20.8
In the wild
en_common_voice_val 27.5 20.6 20.8 20.8
en_common_voice_test 32.6 25.5 22.2 24
VOIP / calls
en_voip_test 9 8.6 19.7 18.3
British Dialects
en_uk_dialects_midlands_english_female 16.7 10.8 9.6 8.4
en_uk_dialects_southern_english_female 16.7 11.4 10.8 9.3
en_uk_dialects_welsh_english_female 17.1 12.1 20.5 10.5
en_uk_dialects_southern_english_male 17.9 12.7 11.5 10.6
en_uk_dialects_welsh_english_male 18.6 13.2 12.1
en_uk_dialects_northern_english_male 20 13.9 15.5 11.7
en_uk_dialects_scottish_english_male 21.3 15.1 10 11.3
en_uk_dialects_midlands_english_male 21.7 15.1 11.8 10.3
en_uk_dialects_northern_english_female 22 15.2 15 12.7
en_uk_dialects_scottish_english_female 22.2 15.7 13.5 12.6
en_uk_dialects_irish_english_male 32.7 25.5 25.5 21.9
Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest 17.3 13.7 21.5 27
en_voices_rm2_far_none_lav_manifest 31.4 26.5 27.5 42.3
en_voices_rm4_far_none_stu_manifest 33.5 28.7 43.2 43.2
en_voices_rm3_clo_none_stu_manifest 34.5 29.9 28.6 40.8
en_voices_rm2_far_musi_stu_manifest 35.4 30.9 30.6 42.4
en_voices_rm2_far_babb_stu_manifest 39.3 35.0 38.5 48.2
en_voices_rm3_clo_musi_stu_manifest 46.9 43 38.1 51.8
en_voices_rm4_ceo_none_lav_manifest 50.3 46.4 42.9 52.5
en_voices_rm3_far_none_stu_manifest 78.9 78.3 68.8 81.6
en_nsc_val_manifest_part1 31.7 24.4 NA NA
en_nsc_val_manifest_part2 67.0 60.9 NA NA

EN V2

Google tests were run in early September 2020.

EN V2 metrics updated in early November 2020.

Dataset Silero CE Silero EE Google Video Premium Google Phone Premium
AudioBooks
en_v001_librispeech_test_clean 8.7 6.9 7.8 8.7
en_librispeech_val 14.5 11.7 11.3 13.1
en_librispeech_test_other 20.6 17.4 16.2 19.1
Lecture / speech
en_multi_ted_test_he 15.0 11.5 15.3 14.1
en_multi_ted_test_common 20.7 17.3 16.9 16
en_multi_ted_val 22.9 19.9 22.7 20.8
In the wild
en_common_voice_val 27.1 20.3 20.8 20.8
en_common_voice_test 32.1 25.3 22.2 24
VOIP / calls
en_voip_test 11.4 10.8 19.7 18.3
British Dialects
en_uk_dialects_midlands_english_female 15.7 10.4 9.6 8.4
en_uk_dialects_southern_english_female 16.6 11.6 10.8 9.3
en_uk_dialects_welsh_english_female 16.9 11.9 20.5 10.5
en_uk_dialects_southern_english_male 17.4 12.6 11.5 10.6
en_uk_dialects_welsh_english_male 17.8 13.1 12.1
en_uk_dialects_northern_english_male 19.7 13.7 15.5 11.7
en_uk_dialects_scottish_english_male 20.5 14.6 10 11.3
en_uk_dialects_midlands_english_male 21.4 16.1 11.8 10.3
en_uk_dialects_northern_english_female 21.3 15.5 15 12.7
en_uk_dialects_scottish_english_female 21.8 15.4 13.5 12.6
en_uk_dialects_irish_english_male 32.5 25.7 25.5 21.9
Far-field / very noisy
en_voices_rm2_clo_none_stu_manifest 17.5 14.1 21.5 27
en_voices_rm2_far_none_lav_manifest 31.6 27.0 27.5 42.3
en_voices_rm4_far_none_stu_manifest 33.7 29.3 43.2 43.2
en_voices_rm3_clo_none_stu_manifest 34.7 30.4 28.6 40.8
en_voices_rm2_far_musi_stu_manifest 35.9 31.5 30.6 42.4
en_voices_rm2_far_babb_stu_manifest 39.8 35.7 38.5 48.2
en_voices_rm3_clo_musi_stu_manifest 47.2 43.5 38.1 51.8
en_voices_rm4_ceo_none_lav_manifest 50.0 46.3 42.9 52.5
en_voices_rm3_far_none_stu_manifest 78.3 78.0 68.8 81.6
en_nsc_val_manifest_part1 18.3 13.9 NA NA
en_nsc_val_manifest_part2 31.7 28.5 NA NA

EN V3

Google tests were run in early September 2020.

EN V3 metrics updated in April 2021.

Dataset Silero Silero Silero Silero Silero Google Google
xsmall_q xsmall small_q small large Video Phone
CE CE CE CE CE Premium Premium
AudioBooks / narration
lj 11.5 10.2 8.6 7.9 6.6
librispeech_test_clean 14.3 12.1 11.1 9.7 7.4 7.8 8.7
librispeech_val 21.0 18.4 16.9 15.2 11.9 11.3 13.1
librispeech_test_other 29.0 25.7 23.8 21.6 17.9 16.2 19.1
aru 21.3 18.5 16.9 14.4 11.1 16.2 19.1
mls_test 32.0 29.2 27.3 25.2 22.0
mls_dev 29.6 26.7 24.6 22.7 19.7
Lecture / speech
multi_ted_test_he 25.9 23.1 20.6 19.0 15.8 15.3 14.1
multi_ted_test_common 34.3 30.9 28.1 25.8 21.5 16.9 16.0
multi_ted_val 34.6 31.5 29.4 27.7 23.9 22.7 20.8
voxpopuli_dev 35.2 32.6 30.6 28.7 25.0
voxpopuli_test 36.3 34.1 31.7 30.1 26.4
Finance
kensho 21.3 18.8 15.3 13.8 10.0
In the wild
common_voice_val 37.8 35.1 31.2 28.8 25.3 20.8 20.8
common_voice_test 42.2 39.5 35.9 33.5 30.1 22.2 24
VOIP / calls
voip_test 32.7 31.7 23.7 23.7 21.2 19.7 18.3
Dialects
uk_dialects_midlands_english_female 26.0 23.1 21.3 19.6 13.6 9.6 8.4
uk_dialects_southern_english_female 26.7 23.6 20.9 18.9 14.2 10.8 9.3
uk_dialects_welsh_english_female 25.6 22.6 19.8 18.3 14.2 20.5 10.5
uk_dialects_southern_english_male 27.7 24.7 22.2 20.0 15.0 11.5 10.6
uk_dialects_welsh_english_male 27.8 25.3 22.6 20.5 16.6 12.1
uk_dialects_northern_english_male 31.3 28.2 24.8 23.0 17.2 15.5 11.7
uk_dialects_scottish_english_male 32.0 28.8 25.1 23.2 17.8 10 11.3
uk_dialects_midlands_english_male 33.1 30.2 26.5 24.3 18.0 11.8 10.3
uk_dialects_northern_english_female 33.2 30.1 26.6 24.3 19.3 15 12.7
uk_dialects_scottish_english_female 31.3 28.6 25.4 23.5 18.6 13.5 12.6
uk_dialects_irish_english_male 42.7 40.2 36.8 34.1 29.3 25.5 21.9
nsc_val_manifest_part1
Far-field / very noisy
voices_rm2_clo_none_stu 25.6 22.4 19.7 17.5 14.2 21.5 27
voices_rm2_far_none_lav 41.5 37.2 32.1 29.0 25.7 27.5 42.3
voices_rm4_far_none_stu 46.1 41.4 36.5 33.1 30.1 43.2 43.2
voices_rm3_clo_none_stu 43.2 38.9 35.0 32.1 28.9 28.6 40.8
voices_rm2_far_musi_stu 46.0 41.6 37.0 33.6 30.3 30.6 42.4
voices_rm2_far_babb_stu 50.6 46.3 41.0 37.9 34.7 38.5 48.2
voices_rm3_clo_musi_stu 55.1 51.0 47.6 44.7 41.7 38.1 51.8
voices_rm4_ceo_none_lav 60.7 56.2 52.4 49.0 45.3 42.9 52.5
voices_rm3_far_none_stu 82.0 79.5 76.4 73.8 71.8 68.8 81.6
Dataset Silero Silero Silero Silero Silero Google Google
xsmall_q xsmall small_q small large Video Phone
EE EE EE EE EE Premium Premium
AudioBooks / narration
lj 6.8 6.3 5.9 5.6 5.4
librispeech_test_clean 9.6 8.3 7.7 7.0 5.9 7.8 8.7
librispeech_val 15.0 13.2 12.4 11.2 9.7 11.3 13.1
librispeech_test_other 21.7 19.2 17.9 16.5 15.1 16.2 19.1
aru 13.7 11.7 11.0 9.7 8.2 16.2 19.1
mls_test 24.4 22.1 20.9 19.3 17.9
mls_dev 22.0 19.8 18.5 17.2 15.8
Lecture / speech
multi_ted_test_he 19.0 16.6 14.8 14.1 12.1 15.3 14.1
multi_ted_test_common 28.1 24.9 22.7 21.1 18.3 16.9 16.0
multi_ted_val 29.3 26.2 24.8 23.2 21.0 22.7 20.8
voxpopuli_dev 25.7 24.4 23.5 22.4 20.8
voxpopuli_test 26.1 25.0 24.1 23.0 21.4
Finance
kensho 14.0 12.3 10.6 9.7 8.1
In the wild
common_voice_val 25.7 24.0 21.4 20.1 18.5 20.8 20.8
common_voice_test 30.9 29.0 26.4 24.9 23.3 22.2 24
VOIP / calls
voip_test 29.1 29.0 24.0 23.6 21.0 19.7 18.3
Dialects
uk_dialects_midlands_english_female 15.5 13.8 12.5 10.8 8.8 9.6 8.4
uk_dialects_southern_english_female 16.4 14.7 13.1 11.9 9.9 10.8 9.3
uk_dialects_welsh_english_female 15.8 14.3 12.0 12.8 10.7 20.5 10.5
uk_dialects_southern_english_male 17.6 15.7 14.1 12.9 10.6 11.5 10.6
uk_dialects_welsh_english_male 17.9 16.4 14.5 13.9 12.1 12.1
uk_dialects_northern_english_male 19.8 17.9 15.7 14.6 12.0 15.5 11.7
uk_dialects_scottish_english_male 20.5 18.4 15.9 14.9 12.7 10 11.3
uk_dialects_midlands_english_male 22.6 20.2 17.6 16.0 12.2 11.8 10.3
uk_dialects_northern_english_female 21.1 18.9 16.3 15.8 13.4 15 12.7
uk_dialects_scottish_english_female 20.1 18.2 16.5 15.2 12.8 13.5 12.6
uk_dialects_irish_english_male 31.4 29.6 28.1 26.3 23.7 25.5 21.9
nsc_val_manifest_part1 10.0 9.3 8.3
Far-field / very noisy
voices_rm2_clo_none_stu_manifest 18.5 15.9 14.2 12.6 11.2 21.5 27
voices_rm2_far_none_lav_manifest 34.3 29.7 25.4 22.8 21.5 27.5 42.3
voices_rm4_far_none_stu_manifest 39.5 34.4 28.6 26.2 24.7 43.2 43.2
voices_rm3_clo_none_stu_manifest 36.8 32.1 41.9 39.2 37.9 28.6 40.8
voices_rm2_far_musi_stu_manifest 39.1 34.3 30.2 27.5 26.1 30.6 42.4
voices_rm2_far_babb_stu_manifest 44.8 39.3 34.6 31.8 30.9 38.5 48.2
voices_rm3_clo_musi_stu_manifest 49.8 45.1 29.8 27.0 26.0 38.1 51.8
voices_rm4_ceo_none_lav_manifest 56.3 50.7 46.9 43.7 41.3 42.9 52.5
voices_rm3_far_none_stu_manifest 80.9 78.0 74.2 71.5 70.0 68.8 81.6

EN V4

Google tests were run in early September 2020. EN V4 metrics updated in June 2021.

Dataset Silero Silero Silero Silero Silero Google Google
xsmall_q xsmall small_q small large Video Phone
CE CE CE CE CE Premium Premium
AudioBooks / narration
lj 6.6
librispeech_test_clean 6.8 7.8 8.7
librispeech_val 11.7 11.3 13.1
librispeech_test_other 17.5 16.2 19.1
aru 10.6 16.2 19.1
mls_test 20.6
mls_dev 18.7
Lecture / speech
multi_ted_test_he 12.2 15.3 14.1
multi_ted_test_common 17.4 16.9 16
multi_ted_val 20.4 22.7 20.8
voxpopuli_dev 21.2
voxpopuli_test 22.6
Finance
kensho 6.5
In the wild
common_voice_val 21.6 20.8 20.8
common_voice_test 26.4 22.2 24
VOIP / calls
voip_test 21.2 19.7 18.3
Dialects
uk_dialects_midlands_english_female 10.8 9.6 8.4
uk_dialects_southern_english_female 11.8 10.8 9.3
uk_dialects_welsh_english_female 12.2 20.5 10.5
uk_dialects_southern_english_male 12.6 11.5 10.6
uk_dialects_welsh_english_male 14.1 12.1
uk_dialects_northern_english_male 14.0 15.5 11.7
uk_dialects_scottish_english_male 15.1 10 11.3
uk_dialects_midlands_english_male 13.7 11.8 10.3
uk_dialects_northern_english_female 16.0 15 12.7
uk_dialects_scottish_english_female 15.8 13.5 12.6
uk_dialects_irish_english_male 25.8 25.5 21.9
Far-field / very noisy
voices_rm2_clo_none_stu 13.7 21.5 27
voices_rm2_far_none_lav 25.0 27.5 42.3
voices_rm4_far_none_stu 30.0 43.2 43.2
voices_rm3_clo_none_stu 28.0 28.6 40.8
voices_rm2_far_musi_stu 29.7 30.6 42.4
voices_rm2_far_babb_stu 34.7 38.5 48.2
voices_rm3_clo_musi_stu 41.3 38.1 51.8
voices_rm4_ceo_none_lav 44.5 42.9 52.5
voices_rm3_far_none_stu 70.7 68.8 81.6
Dataset Silero Silero Silero Silero Silero Google Google
xsmall_q xsmall small_q small large Video Phone
EE EE EE EE EE Premium Premium
AudioBooks / narration
lj 5.6
librispeech_test_clean 6.1 7.8 8.7
librispeech_val 10.0 11.3 13.1
librispeech_test_other 15.2 16.2 19.1
aru 8.0 16.2 19.1
mls_test 17.1
mls_dev 15.3
Lecture / speech
multi_ted_test_he 10.2 15.3 14.1
multi_ted_test_common 15.5 16.9 16
multi_ted_val 18.4 22.7 20.8
voxpopuli_dev 19.4
voxpopuli_test 20.5
Finance
kensho 5.9
In the wild
common_voice_val 15.8 20.8 20.8
common_voice_test 20.3 22.2 24
VOIP / calls
voip_test 18.7 19.7 18.3
Dialects
uk_dialects_midlands_english_female 7.8 9.6 8.4
uk_dialects_southern_english_female 8.3 10.8 9.3
uk_dialects_welsh_english_female 8.9 20.5 10.5
uk_dialects_southern_english_male 9.2 11.5 10.6
uk_dialects_welsh_english_male 10.9 12.1
uk_dialects_northern_english_male 10.0 15.5 11.7
uk_dialects_scottish_english_male 11.1 10 11.3
uk_dialects_midlands_english_male 9.8 11.8 10.3
uk_dialects_northern_english_female 11.3 15 12.7
uk_dialects_scottish_english_female 11.7 13.5 12.6
uk_dialects_irish_english_male 21.2 25.5 21.9
Far-field / very noisy
voices_rm2_clo_none_stu 10.8 21.5 27
voices_rm2_far_none_lav 21.1 27.5 42.3
voices_rm4_far_none_stu 26.0 43.2 43.2
voices_rm3_clo_none_stu 24.1 28.6 40.8
voices_rm2_far_musi_stu 26.1 30.6 42.4
voices_rm2_far_babb_stu 31.5 38.5 48.2
voices_rm3_clo_musi_stu 38 38.1 51.8
voices_rm4_ceo_none_lav 41.3 42.9 52.5
voices_rm3_far_none_stu 69.3 68.8 81.6

EN V5

Google tests were run in early September 2020. EN V5 metrics updated in September 2021.

Dataset Silero Silero Silero Silero Silero Google Google
xsmall_q xsmall small_q small xlarge Video Phone
CE CE CE CE CE Premium Premium
AudioBooks / narration
lj 9.2 8.4 5.9
librispeech_test_clean 11.6 10.2 6.1 7.8 8.7
librispeech_val 17.7 15.9 10.3 11.3 13.1
librispeech_test_other 24 22.2 15.7 16.2 19.1
aru 17.8 15.4 9.3 16.2 19.1
mls_test 26.2 23.9 17.9
mls_dev 23.9 21.8 16.1
Lecture / speech
multi_ted_test_he 18.3 16.7 10.3 15.3 14.1
multi_ted_test_common 25.4 23.2 16.1 16.9 16
multi_ted_val 27.2 25.5 18.8 22.7 20.8
voxpopuli_dev 22.8 21.4 17.2
voxpopuli_test 23.3 22.3 17.9
Finance
kensho 10.5 9.3 4.7
In the wild
common_voice_val 28.5 26.3 20.2 20.8 20.8
common_voice_test 33.2 30.9 24.6 22.2 24
gigaspeech_test 30.5 28.6 22.4
VOIP / calls
voip_test 19.4 19.5 18.3 19.7 18.3
Dialects
uk_dialects_midlands_english_female 19.3 17.2 9.1 9.6 8.4
uk_dialects_southern_english_female 19.6 17.5 11.2 10.8 9.3
uk_dialects_welsh_english_female 18.7 16.6 11.9 20.5 10.5
uk_dialects_southern_english_male 20.2 18.5 11.7 11.5 10.6
uk_dialects_welsh_english_male 20.6 18.9 13.4 12.1
uk_dialects_northern_english_male 23.7 21.1 12.9 15.5 11.7
uk_dialects_scottish_english_male 23 21 14.5 10 11.3
uk_dialects_midlands_english_male 24.6 23.2 13.1 11.8 10.3
uk_dialects_northern_english_female 24.4 22.3 15.7 15 12.7
uk_dialects_scottish_english_female 23.5 21.7 15.2 13.5 12.6
uk_dialects_irish_english_male 35.6 33.6 25.3 25.5 21.9
Far-field / very noisy
voices_rm2_clo_none_stu 20.7 18.2 11.5 21.5 27
voices_rm2_far_none_lav 34 30.7 22 27.5 42.3
voices_rm4_far_none_stu 37.9 34.3 26 43.2 43.2
voices_rm3_clo_none_stu 36.5 33.4 25.1 28.6 40.8
voices_rm2_far_musi_stu 39 35.8 26.5 30.6 42.4
voices_rm2_far_babb_stu 44.5 41.2 30.8 38.5 48.2
voices_rm3_clo_musi_stu 49.5 46.6 38.5 38.1 51.8
voices_rm4_ceo_none_lav 54.3 50.9 40.6 42.9 52.5
voices_rm3_far_none_stu 76 74.5 69.9 68.8 81.6
Dataset Silero Silero Silero Silero Silero Google Google
xsmall_q xsmall small_q small xlarge Video Phone
EE EE EE EE EE Premium Premium
AudioBooks / narration
lj 6.1 5.8 5.1
librispeech_test_clean 8.3 7.5 5.5 7.8 8.7
librispeech_val 12.8 11.9 8.8 11.3 13.1
librispeech_test_other 18.6 17.3 13.5 16.2 19.1
aru 11.6 10.3 7 16.2 19.1
mls_test 20.1 18.5 14.8
mls_dev 18 16.6 13.3
Lecture / speech
multi_ted_test_he 12.9 11.9 8.4 15.3 14.1
multi_ted_test_common 20.2 18.6 14 16.9 16
multi_ted_val 22.4 21.1 16.9 22.7 20.8
voxpopuli_dev 18.6 17.9 16
voxpopuli_test 18.9 18.3 16.4
Finance
kensho 7.5 6.7 4.3
In the wild
common_voice_val 19.9 18.5 15 20.8 20.8
common_voice_test 24.4 22.9 19.2 22.2 24
gigaspeech_test 26.2 24.5 20.7
VOIP / calls
voip_test 18.7 20.2 18.3 19.7 18.3
Dialects
uk_dialects_midlands_english_female 11.9 10.8 6.8 9.6 8.4
uk_dialects_southern_english_female 12.4 11.2 8 10.8 9.3
uk_dialects_welsh_english_female 12.5 11.3 8.7 20.5 10.5
uk_dialects_southern_english_male 13.3 12.2 8.6 11.5 10.6
uk_dialects_welsh_english_male 13.8 12.9 10.1 12.1
uk_dialects_northern_english_male 14.9 13.8 9.6 15.5 11.7
uk_dialects_scottish_english_male 14.9 13.8 10.6 10 11.3
uk_dialects_midlands_english_male 15.5 14.5 9.2 11.8 10.3
uk_dialects_northern_english_female 15.9 14.9 11.2 15 12.7
uk_dialects_scottish_english_female 15.5 14.5 11.3 13.5 12.6
uk_dialects_irish_english_male 26.5 25.1 20.3 25.5 21.9
Far-field / very noisy
voices_rm2_clo_none_stu 15 13.4 9.3 21.5 27
voices_rm2_far_none_lav 27.4 24.7 18.6 27.5 42.3
voices_rm4_far_none_stu 31.3 28.4 22.5 43.2 43.2
voices_rm3_clo_none_stu 30 27.7 21.7 28.6 40.8
voices_rm2_far_musi_stu 32.5 29.7 23.1 30.6 42.4
voices_rm2_far_babb_stu 38.1 35.1 27.5 38.5 48.2
voices_rm3_clo_musi_stu 44 41.5 35.3 38.1 51.8
voices_rm4_ceo_none_lav 48.9 45.8 37.4 42.9 52.5
voices_rm3_far_none_stu 73.7 72.2 68.3 68.8 81.6

EN V6

Google tests were run in early September 2020. EN V6 metrics updated in February 2022.

Dataset Silero Silero Google Google
small xlarge Video Phone
CE CE Premium Premium
AudioBooks / narration
lj 7.7 5.8
librispeech_test_clean 10.0 6.1 7.8 8.7
librispeech_val 15.5 10.4 11.3 13.1
librispeech_test_other 21.9 15.7 16.2 19.1
aru 16.1 9.6 16.2 19.1
mls_test 23.1 17.6
mls_dev 21.1 15.9
Lecture / speech
multi_ted_test_he 15.7 9.9 15.3 14.1
multi_ted_test_common 22.5 16.0 16.9 16
multi_ted_val 23.9 18.5 22.7 20.8
voxpopuli_dev 21.0 16.8
voxpopuli_test 21.9 17.4
Finance
kensho 8.4 4.6
In the wild
common_voice_val 25.9 19.9 20.8 20.8
common_voice_test 30.4 24.4 22.2 24
gigaspeech_test 27.5 22.1
gigaspeech_2s_test 26.1 20.5
fluent_ai_speech_commands 23.6 18.6
speech_commands 17.0 15.1
VOIP / calls
voip_test 19.7 17.5 19.7 18.3
voip_val 19.3 17.8
vystadial_dev 9.3 6.1
vystadial_test 9.1 5.6
vystadial_train 9.1 6.1
Dialects
uk_dialects 19.3 13.0
uk_dialects_midlands_english_female 16.7 8.7 9.6 8.4
uk_dialects_southern_english_female 17.4 11.3 10.8 9.3
uk_dialects_welsh_english_female 16.5 11.8 20.5 10.5
uk_dialects_southern_english_male 18.2 11.9 11.5 10.6
uk_dialects_welsh_english_male 18.7 13.3 12.1
uk_dialects_northern_english_male 20.6 13.1 15.5 11.7
uk_dialects_scottish_english_male 20.8 14.7 10 11.3
uk_dialects_midlands_english_male 22.4 13.4 11.8 10.3
uk_dialects_northern_english_female 21.8 15.7 15 12.7
uk_dialects_scottish_english_female 21.5 15.5 13.5 12.6
uk_dialects_irish_english_male 33.4 25.9 25.5 21.9
cmu_arctic_val 10.5 6.2
l2arctic_arabic 30.1 24.2
l2arctic_chinese 34.1 27.5
l2arctic_hindi 19.1 14.0
l2arctic_korean 23.9 17.6
l2arctic_spanish 28.7 22.6
l2arctic_vietnamese 39.4 33.8
Far-field / very noisy
voices_rm2_clo_none_stu 17.2 11.1 21.5 27
voices_rm2_far_none_lav 30.5 21.4 27.5 42.3
voices_rm4_far_none_stu 34.3 25.5 43.2 43.2
voices_rm3_clo_none_stu 32.3 24.1 28.6 40.8
voices_rm2_far_musi_stu 35.3 25.7 30.6 42.4
voices_rm2_far_babb_stu 42.1 31.0 38.5 48.2
voices_rm3_clo_musi_stu 45.2 36.7 38.1 51.8
voices_rm4_ceo_none_lav 48.9 38.9 42.9 52.5
voices_rm3_far_none_stu 73.8 65.5 68.8 81.6
Dataset Silero Silero Google Google
small xlarge Video Phone
EE EE Premium Premium
AudioBooks / narration
lj 5.7 5.0
librispeech_test_clean 7.5 5.4 7.8 8.7
librispeech_val 11.6 8.8 11.3 13.1
librispeech_test_other 17.3 13.6 16.2 19.1
aru 10.6 7.2 16.2 19.1
mls_test 18.3 14.8
mls_dev 16.6 13.4
Lecture / speech
multi_ted_test_he 11.3 8.4 15.3 14.1
multi_ted_test_common 17.7 13.9 16.9 16
multi_ted_val 20.6 16.8 22.7 20.8
voxpopuli_dev 17.8 15.8
voxpopuli_test 18.3 16.2
Finance
kensho 6.3 4.3
In the wild
common_voice_val 18.3 14.9 20.8 20.8
common_voice_test 22.6 19.1 22.2 24
gigaspeech_test 23.6 20.6
gigaspeech_2s_test 22.1 19.1
fluent_ai_speech_commands 17.2 15.3
speech_commands 16.6 12.0
VOIP / calls
voip_test 19.6 18.2 19.7 18.3
voip_val 18.4 18.3
vystadial_dev 8.2 6.1
vystadial_test 8.3 5.8
vystadial_train 8.7 6.0
Dialects
uk_dialects 13.1 9.7
uk_dialects_midlands_english_female 10.1 6.3 9.6 8.4
uk_dialects_southern_english_female 11.6 8.2 10.8 9.3
uk_dialects_welsh_english_female 11.4 8.8 20.5 10.5
uk_dialects_southern_english_male 12.2 8.8 11.5 10.6
uk_dialects_welsh_english_male 13.1 10.2 12.1
uk_dialects_northern_english_male 13.7 9.8 15.5 11.7
uk_dialects_scottish_english_male 14.2 10.9 10 11.3
uk_dialects_midlands_english_male 14.6 9.2 11.8 10.3
uk_dialects_northern_english_female 15.2 11.3 15 12.7
uk_dialects_scottish_english_female 14.8 11.5 13.5 12.6
uk_dialects_irish_english_male 25.1 21.2 25.5 21.9
cmu_arctic_val 7.6 5.1
l2arctic_arabic 23.1 19.4
l2arctic_chinese 26.8 22.4
l2arctic_hindi 14.1 11.3
l2arctic_korean 17.5 13.9
l2arctic_spanish 22.1 18.5
l2arctic_vietnamese 32.1 28.3
Far-field / very noisy
voices_rm2_clo_none_stu 13.1 9.3 21.5 27
voices_rm2_far_none_lav 25.2 18.5 27.5 42.3
voices_rm4_far_none_stu 29.0 22.4 43.2 43.2
voices_rm3_clo_none_stu 27.2 21.1 28.6 40.8
voices_rm2_far_musi_stu 30.0 22.6 30.6 42.4
voices_rm2_far_babb_stu 37.1 28.0 38.5 48.2
voices_rm3_clo_musi_stu 40.7 33.7 38.1 51.8
voices_rm4_ceo_none_lav 44.5 36.0 42.9 52.5
voices_rm3_far_none_stu 71.8 63.9 68.8 81.6

DE V1

All of these tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset CE EE Google
AudioBooks
de_caito_manifest_val 12.5 8.7 19.5
Narration
de_voxforge_manifest_val 3.8 2.3 5.9
In the wild
de_common_voice_test_manifest 28.0 17.6 16.1
de_common_voice_val_manifest 24.9 15.0 14.0
de_telekinect_dev_manifest 28.1 18.6 13.5
de_telekinect_test_manifest 28.3 19.4 15.7

DE V3

All of these tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset CE EE Google
Books
de_mls_test 19.5 15.0 N/A
de_mls_val 16.6 12.7 N/A
Narration
de_voxforge_manifest_val 7.4 5.2 5.9
Public speech
de_voxpopuli_dev 27.0 24.6 N/A
de_voxpopuli_test 25.0 22.8 N/A
In the wild
de_common_voice_test_manifest 21.0 14.3 16.1
de_common_voice_val_manifest 18.8 12.5 14.0
de_telekinect_dev_manifest 16.6 11.6 13.5
de_telekinect_test_manifest 17.3 12.1 15.7

DE V4

Google tests were run in early September 2020.

At the moment of this test, there was no premium model available for Google. There were several models for several regions, but with minor differences we chose the default German model.

Dataset CE EE Google
Books
de_mls_test 16.3 12.8 N/A
de_mls_val 13.3 10.5 N/A
Narration
de_voxforge_val 5.8 4.4 5.9
Public speech
de_voxpopuli_dev 26.3 23.8 N/A
de_voxpopuli_test 24 21.6 N/A
In the wild
de_common_voice_test 20.6 14.1 16.1
de_common_voice_val 18.4 12.3 14
de_telekinect_dev 16.2 11.3 13.5
de_telekinect_test 16.4 12 15.7

ES V1

All of these tests were run in early September 2020.

For Spanish, we chose the region (US) where a Premium model was available. Judging by the benchmark results, Google heavily relies on the data it sources from Android most likely due to large population and less regulation. Note that most "dialect" recordings are quite clean, but pronunciation varies.

Dataset CE EE Google Google Phone Premium
AudioBooks
es_caito_val 7.7 5.7 20.3 22.3
Narration
es_voxforge_val 1.4 1.1 18.1 19.4
In the wild
es_common_voice_test 22.0 14.4 27.2 23.1
es_common_voice_val 20.1 13.0 24.5 19.6
Dialects
es_dialects_argentinian_val 19.0 12.9 11.8 6.7
es_dialects_chilean_val 19.8 13.7 8.9 6.6
es_dialects_columbian_val 18.4 11.9 7.8 5.4
es_dialects_peruvian_val 14.4 9.1 6.2 4.7
es_dialects_puerto_rico_val 21.1 14.5 7.9 6.0
es_dialects_venezuela_val 19.2 13.2 8.2 6.4

TTS Models

RU V1

We decided to keep the quality assessment really simple: we generated audio from the validation subsets of our data (~200 files per speaker), shuffled them with the original recorded audios of the same speakers, and gave it to a group of 24 asessors to evaluate the sound quality on a five-point scale. For 8kHz and 16kHz the scores were collected separately (both for synthesized and original speech). For simplicity we had the following grades - [1, 2, 3, 4-, 4, 4+, 5-, 5] - the higher the quality the more detailed our scale is. Then, for each speaker, we simply calculated the mean.

In total people scored audios 37,403 times. 12 people annotated the whole dataset. 12 other people managed to annotate from 10% to 75% of audios. For each speaker we calculated mean (standard deviation is shown in brackets). We also tried first calculating median scores for each audio and then averaging them. But this just increases the mean values without affecting the ratios, so we just used plain averages in the end. The key metric here of course is the ratio between the mean score for synthesis vs the original audio. Some users had much lower scores overall (hence high dispersion), but we decided to keep all scores as is without cleaning outliers.

Speaker Original Synthesis Ratio Examples
aidar_8khz 4.67 (.45) 4.52 (.55) 96.8% link
baya_8khz 4.52 (.57) 4.25 (.76) 94.0% link
kseniya_8khz 4.80 (.40) 4.54 (.60) 94.5% link
aidar_16khz 4.72 (.43) 4.53 (.55) 95.9% link
baya_16khz 4.59 (.55) 4.18 (.76) 91.1% link
kseniya_16khz 4.84 (.37) 4.54 (.59) 93.9% link

We asked our asessors to rate the "naturalness of the speech" (not the audio quality). Nevertheless we were surprised that based on anecdotes people cannot tell 8 kHz from 16 kHz on their everyday devices (which is also confirmed by metrics). Baya has the lowest absolute and relative scores. Kseniya has the highest absolute scores, Aidar has the highest relative scores. Baya also has higher score dispersion.

Manually inspecting audios with high score dispersion reveals several patterns. Speaker errors, tacotron errors (pauses), proper names and hard-to-read words are the most common causes. Of course 75% of such differences are in synthesized audios and sampling rate does not seem to affect it.

We tried to rate "naturalness". But it is only natural to try estimating "unnaturalness" or "robotness" as well. It can be measured by asking people to choose between to audios. But we went one step beyond and essentially applied a double blind test. We asked our asessors to rate the same audio 4 times in random order - original and synthesis with different sampling rates. For asessors who annotated the whole dataset we calculated the following table:

Comparison Worse Same Better
16k vs 8k, original 957 4811 1512
16k vs 8k, synthesis 1668 4061 1551
Original vs synthesis, 8k 816 3697 2767
Original vs synthesis, 16k 674 3462 3144

Several conclusions can be drawn:

  • In 66% of cases people cannot hear difference between 8k и 16k;
  • In synthesis 8k helps to hide some errors;
  • In about 60% of cases synthesis is same or better than the original;
  • Two last conclusions hold regardless of the sampling rate, 8k having a slight advantage;

You can see for yourself how it sounds, both for our unique voices and for speakers from external sources (more audio for each speaker can be synthesized in the colab notebook in our repo.

TE Models

Contrary to the popular trends we aim to provide as detailed, informative and honest metrics as possible. In this particular case, we used the following datasets for validation:

  • Validation subsets of our private text corpora (5,000 sentences per language);
  • Audiobooks, we use the caito dataset, which has texts in all the languages the model was trained on (20,000 random sentences for each language);

We use the following metrics:

  • WER (word error rate) as a percentage: separately calculated for repunctuation WER_p (both sentences are transformed to lowercase) and for recapitalization WER_c (here we throw out all punctuation marks);
  • Precision / recall / F1 to check the quality of classification (i) between the space and the punctuation marks mentioned above .,-!?-, and (ii) for the restoration of capital letters - between classes a token of lowercase letters / a token starts with a capital / a token of all caps. Also we provide confusion matrices for visualization;

Results

For the correct and informative metrics calculation, the following transformations were applied to the texts beforehand:

  • Punctuation characters other than .,-!?- were removed;
  • Punctuation at the beginning of a sentence was removed;
  • In case of multiple consecutive punctuation marks we keep only the first one;
  • For Spanish ¿¡ were discarded from the model predictions, because they aren't in the texts of the books, but in general the model places them as well;

EN DE RU ES V2

WER_p / WER_c are specified in the cells below. The baseline metrics are calculated for a naive algorithm that starts the text with a capital letter and ends it with a full stop.

Metrics on Paragraphs

Domain - validation data:

Languages
en de ru es
baseline 14 / 19 13 / 41 17 / 20 10 / 16
model 6 / 6 5 / 5 7 / 7 5 / 5

Domain - books:

Languages
en de ru es
baseline 14 / 13 15 / 26 23 / 14 13 / 8
model 12 / 7 11 / 8 18 / 10 12 / 6

Metrics on Sentences

Domain - validation data:

Languages
en de ru es
baseline 12 / 18 10 / 33 13 / 12 8 / 11
model 5 / 4 5 / 4 7 / 4 5 / 4

Domain - books:

Languages
en de ru es
baseline 12 / 10 12 / 22 19 / 9 15 / 7
model 12 / 6 10 / 6 17 / 7 13 / 5

EN DE RU ES V1

Metrics on Sentences

WER

WER_p / WER_c are specified in the cells below. The baseline metrics are calculated for a naive algorithm that starts the sentence with a capital letter and ends it with a full stop.

Domain - validation data:

Languages
en de ru es
baseline 20 / 26 13 / 36 18 / 17 8 / 13
model 8 / 8 7 / 7 13 / 6 6 / 5

Domain - books:

Languages
en de ru es
baseline 14 / 13 13 / 22 20 / 11 14 / 7
model 14 / 8 11 / 6 21 / 7 13 / 6
Precision / Recall / F1

Domain - validation data:

Metric ' ' . , - ! ?
en
precision 0.98 0.97 0.78 0.91 0.80 0.89 nan
recall 0.99 0.98 0.64 0.75 0.67 0.78 nan
f1 0.98 0.98 0.71 0.82 0.73 0.84 nan
de
precision 0.98 0.98 0.86 0.81 0.74 0.90 nan
recall 0.99 0.99 0.68 0.60 0.70 0.71 nan
f1 0.99 0.98 0.76 0.69 0.72 0.79 nan
ru
precision 0.98 0.97 0.80 0.90 0.80 0.84 0
recall 0.98 0.99 0.74 0.70 0.58 0.78 nan
f1 0.98 0.98 0.77 0.78 0.67 0.81 nan
es
precision 0.98 0.96 0.70 0.74 0.85 0.83 0
recall 0.99 0.98 0.60 0.29 0.60 0.70 nan
f1 0.98 0.98 0.64 0.42 0.70 0.76 nan
Metric a A AAA
en
precision 0.98 0.94 0.97
recall 0.99 0.91 0.70
f1 0.98 0.92 0.81
de
precision 0.99 0.98 0.89
recall 0.99 0.98 0.53
f1 0.99 0.98 0.66
ru
precision 0.99 0.96 0.99
recall 0.99 0.92 0.99
f1 0.99 0.94 0.99
es
precision 0.99 0.95 0.98
recall 0.99 0.90 0.82
f1 0.99 0.92 0.89

Domain - books:

Metric ' ' . , - ! ?
en
precision 0.96 0.80 0.59 0.82 0.23 0.39 nan
recall 0.99 0.73 0.23 0.13 0.58 0.85 0
f1 0.97 0.77 0.33 0.22 0.33 0.53 nan
de
precision 0.97 0.75 0.80 0.55 0.21 0.41 nan
recall 0.99 0.71 0.49 0.35 0.58 0.67 0
f1 0.98 0.73 0.61 0.43 0.30 0.51 nan
ru
precision 0.97 0.77 0.69 0.90 0.17 0.49 0
recall 0.98 0.60 0.55 0.61 0.68 0.75 nan
f1 0.98 0.68 0.61 0.72 0.28 0.60 nan
es
precision 0.96 0.57 0.59 0.96 0.30 0.24 nan
recall 0.98 0.70 0.29 0.02 0.40 0.68 0
f1 0.97 0.63 0.38 0.04 0.34 0.36 nan
Metric a A AAA
en
precision 0.99 0.80 0.94
recall 0.98 0.89 0.95
f1 0.98 0.85 0.94
de
precision 0.99 0.90 0.77
recall 0.98 0.94 0.62
f1 0.98 0.92 0.70
ru
precision 0.99 0.81 0.82
recall 0.99 0.87 0.96
f1 0.99 0.84 0.89
es
precision 0.99 0.71 0.45
recall 0.98 0.82 0.91
f1 0.98 0.76 0.60

As one can see from the spreadsheets - even for Russian, the hyphen values remained empty, because the model suggested not to put it down at all on the data used for calculating metrics, or to replace the hyphen with some other symbol; seems that it's placed better in case of sentence in the form of definition.