AESRC2020

Introduction

Data preparation scripts and training pipeline for the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC).

Dependent Environment

Install Kaldi (Data preparation scripts, Track2 traditional ASR model training) Github Link
Install ESPnet(Track1 E2E AR Model training, Track2 E2E ASR Transformer training) Github Link
(Optional) Install Google SentencePiece (Track2 E2E ASR modeling units building) Github Link
(Optional) Install KenLM (N-gram language model training) Github Link

Usage

Data Preparation

Download challenge data
Data preparation, divide cv set, feature extraction and bpe model training ./local/prepare_data.sh

AR Track

Train Track1 ESPnet AR model ./local/track1_espnet_transformer_train.sh

ASR Track

Train Track2 Kaldi GMM alignment model ./local/track2_kaldi_gmm_train.sh
Generate Lattice, decision tree, Train Track2 Kaldi Chain Model ./local/track2_kaldi_chain_train.sh
Train Track2 ESPnet Transformer Model (Track2 ESPnet RNN Language Model) ./local/track2_espnet_transformer_train.sh

Notice

There's no lexicon provided, please prepare it by yourself.
Data augment methods are not included in scirpts.
Install Kaldi and ESPnet and activate their envrionment then you can run the scripts.
Baseline experiments in Track2 include several data using methods.
Participants should obey the rules about data strictly.

Baseline Experiments Results

Track1

Model	RU	KR	US	PT	JPN	UK	CHN	IND	AVE
Transformer-3L	30.0	45.0	45.7	57.2	48.5	70.0	56.2	83.5	54.1
Transformer-6L	34.0	43.7	30.6	65.7	44.0	74.5	50.9	75.2	52.2
Transformer-12L	49.6	26.0	21.2	51.8	42.7	85.0	38.2	66.1	47.8
+ ASR-init	75.7	55.6	60.2	85.5	73.2	93.9	67.0	97.0	76.1

Transformer-3L, Transformer-6L, Transformer-12L all use./local/track1_espnet_transformer_train.sh (elayers: 3, 6, 12)

ASR-init uses encoder in Track2 to initialize self-attention parameters

*In cv sets, we found that the acc of some accent is strongly related with speaker. As there are few speakers in cv sets, the absolute value above is not statistically significant, and the test set will contain more speakers

Track2

Kaldi Hybrid Chain Model: CNN + 18 TDNN *Based on internal non open source dictionary *Results on CMU dict comes up soon

ESPnet Transformer Model: 12 Encoder + 6 Decoder (simple self-attention, CTC joint training used, 1k sub-word BPE)

You can find detailed hyperparameters settings in ./local/files/conf/ and training scripts

	Data	Decode Related	WER on cv set
	Data	Decode Related	RU	KR	US	PT	JPN	UK	CHN	IND	AVE
Kaldi	Accent160	-	6.67	11.46	15.95	10.27	9.78	16.88	20.97	17.48	13.68
	Libri960 ~ Accent160		6.61	10.95	15.33	9.79	9.75	16.03	19.68	16.93	13.13
	Accent160 + Libri160		6.95	11.76	13.05	9.96	10.15	14.21	20.76	18.26	13.14
ESPnet	Accent160	+0.3RNNLM	5.26	7.69	9.96	7.45	6.79	10.06	11.77	10.05	8.63
	Libri960 ~ Accent160	+0.3RNNLM	4.6	6.4	7.42	5.9	5.71	7.64	9.87	7.85	6.92
	Accent160 +Libri160	-	5.35	9.07	8.52	7.13	7.29	8.6	12.03	9.05	8.38
		+0.3RNNLM	4.68	7.59	7.7	6.42	6.37	7.76	10.88	8.41	7.48
		+0.3RNNLM+0.3CTC	4.76	7.81	7.71	6.36	6.4	7.23	10.77	8.01	7.38

* Data A ~ Data B means fine-tune Data A model with Data B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_en.md

README_en.md

AESRC2020

Introduction

Dependent Environment

Usage

Baseline Experiments Results

Files

README_en.md

Latest commit

History

README_en.md

File metadata and controls

AESRC2020

Introduction

Dependent Environment

Usage

Baseline Experiments Results