Data preparation scripts and training pipeline for the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC).
- Install Kaldi (Data preparation scripts, Track2 traditional ASR model training) Github Link
- Install ESPnet(Track1 E2E AR Model training, Track2 E2E ASR Transformer training) Github Link
- (Optional) Install Google SentencePiece (Track2 E2E ASR modeling units building) Github Link
- (Optional) Install KenLM (N-gram language model training) Github Link
Data Preparation
- Download challenge data
- Data preparation, divide cv set, feature extraction and bpe model training
./local/prepare_data.sh
AR Track
Train Track1 ESPnet AR model ./local/track1_espnet_transformer_train.sh
ASR Track
- Train Track2 Kaldi GMM alignment model
./local/track2_kaldi_gmm_train.sh
- Generate Lattice, decision tree, Train Track2 Kaldi Chain Model
./local/track2_kaldi_chain_train.sh
- Train Track2 ESPnet Transformer Model (Track2 ESPnet RNN Language Model)
./local/track2_espnet_transformer_train.sh
Notice
- There's no lexicon provided, please prepare it by yourself.
- Data augment methods are not included in scirpts.
- Install Kaldi and ESPnet and activate their envrionment then you can run the scripts.
- Baseline experiments in Track2 include several data using methods.
- Participants should obey the rules about data strictly.
Track1
Model | RU | KR | US | PT | JPN | UK | CHN | IND | AVE |
---|---|---|---|---|---|---|---|---|---|
Transformer-3L | 30.0 | 45.0 | 45.7 | 57.2 | 48.5 | 70.0 | 56.2 | 83.5 | 54.1 |
Transformer-6L | 34.0 | 43.7 | 30.6 | 65.7 | 44.0 | 74.5 | 50.9 | 75.2 | 52.2 |
Transformer-12L | 49.6 | 26.0 | 21.2 | 51.8 | 42.7 | 85.0 | 38.2 | 66.1 | 47.8 |
+ ASR-init | 75.7 | 55.6 | 60.2 | 85.5 | 73.2 | 93.9 | 67.0 | 97.0 | 76.1 |
Transformer-3L, Transformer-6L, Transformer-12L all use./local/track1_espnet_transformer_train.sh
(elayers: 3, 6, 12)
ASR-init uses encoder in Track2 to initialize self-attention parameters
*In cv sets, we found that the acc of some accent is strongly related with speaker. As there are few speakers in cv sets, the absolute value above is not statistically significant, and the test set will contain more speakers
Track2
Kaldi Hybrid Chain Model: CNN + 18 TDNN *Based on internal non open source dictionary *Results on CMU dict comes up soon
ESPnet Transformer Model: 12 Encoder + 6 Decoder (simple self-attention, CTC joint training used, 1k sub-word BPE)
You can find detailed hyperparameters settings in ./local/files/conf/
and training scripts
Data | Decode Related | WER on cv set | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
RU | KR | US | PT | JPN | UK | CHN | IND | AVE | |||
Kaldi |
Accent160 | - | 6.67 | 11.46 | 15.95 | 10.27 | 9.78 | 16.88 | 20.97 | 17.48 | 13.68 |
Libri960 ~ Accent160 | 6.61 | 10.95 | 15.33 | 9.79 | 9.75 | 16.03 | 19.68 | 16.93 | 13.13 | ||
Accent160 + Libri160 | 6.95 | 11.76 | 13.05 | 9.96 | 10.15 | 14.21 | 20.76 | 18.26 | 13.14 | ||
ESPnet |
Accent160 | +0.3RNNLM | 5.26 | 7.69 | 9.96 | 7.45 | 6.79 | 10.06 | 11.77 | 10.05 | 8.63 |
Libri960 ~ Accent160 | +0.3RNNLM | 4.6 | 6.4 | 7.42 | 5.9 | 5.71 | 7.64 | 9.87 | 7.85 | 6.92 | |
Accent160 +Libri160 |
- | 5.35 | 9.07 | 8.52 | 7.13 | 7.29 | 8.6 | 12.03 | 9.05 | 8.38 | |
+0.3RNNLM | 4.68 | 7.59 | 7.7 | 6.42 | 6.37 | 7.76 | 10.88 | 8.41 | 7.48 | ||
+0.3RNNLM+0.3CTC | 4.76 | 7.81 | 7.71 | 6.36 | 6.4 | 7.23 | 10.77 | 8.01 | 7.38 |