Skip to content

Latest commit

 

History

History
194 lines (171 loc) · 5.11 KB

README_en.md

File metadata and controls

194 lines (171 loc) · 5.11 KB

AESRC2020

Introduction

Data preparation scripts and training pipeline for the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC).

Dependent Environment

  1. Install Kaldi (Data preparation scripts, Track2 traditional ASR model training) Github Link
  2. Install ESPnet(Track1 E2E AR Model training, Track2 E2E ASR Transformer training) Github Link
  3. (Optional) Install Google SentencePiece (Track2 E2E ASR modeling units building) Github Link
  4. (Optional) Install KenLM (N-gram language model training) Github Link

Usage

Data Preparation

  1. Download challenge data
  2. Data preparation, divide cv set, feature extraction and bpe model training ./local/prepare_data.sh

AR Track

Train Track1 ESPnet AR model ./local/track1_espnet_transformer_train.sh

ASR Track

  1. Train Track2 Kaldi GMM alignment model ./local/track2_kaldi_gmm_train.sh
  2. Generate Lattice, decision tree, Train Track2 Kaldi Chain Model ./local/track2_kaldi_chain_train.sh
  3. Train Track2 ESPnet Transformer Model (Track2 ESPnet RNN Language Model) ./local/track2_espnet_transformer_train.sh

Notice

  1. There's no lexicon provided, please prepare it by yourself.
  2. Data augment methods are not included in scirpts.
  3. Install Kaldi and ESPnet and activate their envrionment then you can run the scripts.
  4. Baseline experiments in Track2 include several data using methods.
  5. Participants should obey the rules about data strictly.

Baseline Experiments Results

Track1

Model RU KR US PT JPN UK CHN IND AVE
Transformer-3L 30.0 45.0 45.7 57.2 48.5 70.0 56.2 83.5 54.1
Transformer-6L 34.0 43.7 30.6 65.7 44.0 74.5 50.9 75.2 52.2
Transformer-12L 49.6 26.0 21.2 51.8 42.7 85.0 38.2 66.1 47.8
+ ASR-init 75.7 55.6 60.2 85.5 73.2 93.9 67.0 97.0 76.1

Transformer-3L, Transformer-6L, Transformer-12L all use./local/track1_espnet_transformer_train.sh (elayers: 3, 6, 12)

ASR-init uses encoder in Track2 to initialize self-attention parameters

*In cv sets, we found that the acc of some accent is strongly related with speaker. As there are few speakers in cv sets, the absolute value above is not statistically significant, and the test set will contain more speakers

Track2

Kaldi Hybrid Chain Model: CNN + 18 TDNN *Based on internal non open source dictionary *Results on CMU dict comes up soon

ESPnet Transformer Model: 12 Encoder + 6 Decoder (simple self-attention, CTC joint training used, 1k sub-word BPE)

You can find detailed hyperparameters settings in ./local/files/conf/ and training scripts

Data Decode Related WER on cv set
RU KR US PT JPN UK CHN IND AVE
Kaldi
Accent160 - 6.67 11.46 15.95 10.27 9.78 16.88 20.97 17.48 13.68
Libri960 ~ Accent160 6.61 10.95 15.33 9.79 9.75 16.03 19.68 16.93 13.13
Accent160 + Libri160 6.95 11.76 13.05 9.96 10.15 14.21 20.76 18.26 13.14
ESPnet
Accent160 +0.3RNNLM 5.26 7.69 9.96 7.45 6.79 10.06 11.77 10.05 8.63
Libri960 ~ Accent160 +0.3RNNLM 4.6 6.4 7.42 5.9 5.71 7.64 9.87 7.85 6.92
Accent160 +Libri160
- 5.35 9.07 8.52 7.13 7.29 8.6 12.03 9.05 8.38
+0.3RNNLM 4.68 7.59 7.7 6.42 6.37 7.76 10.88 8.41 7.48
+0.3RNNLM+0.3CTC 4.76 7.81 7.71 6.36 6.4 7.23 10.77 8.01 7.38
* Data A ~ Data B means fine-tune Data A model with Data B