This is a PyTorch implementation for the ACL 2022 main conference paper Learning When to Translate for Streaming Speech .
Take German for example.
Firstly, download MuST-C v1.0 archive MUSTC_v1.0_en-de.tar.gz
to the ${MUSTC_ROOT}
path, and uncompress it:
LANG=de
MUSTC_ROOT=/path/data/en-${LANG}$
tar -xzvf MUSTC_v1.0_en-de.tar.gz
Then, run the script to prepare data manifest.
python3 examples/speech_to_text/prep_mustc_data_raw.py --data-root ${MUSTC_ROOT} \
--tgt-lang ${LANG}
The generated .tsv
should be expanded with the field of source language text and doubled with asr task. Here's some examples from the .tsv
file.
id audio n_frames tgt_text speaker tgt_lang src_text src_lang
ted_2529_66 /xxx/en-de/data/train/wav/ted_2529.wav:9517120:61760 61760 Ich hatte den Vorteil einer Perspektive von dieser Breite. spk.2529 de I had the benefit of a spectrum this wide. en
ted_1257_134 /xxx/en-de/data/train/wav/ted_1257.wav:13876160:80960 80960 And outside the library, I wanted to make a place to cultivate your mind. spk.1257 en And outside the library, I wanted to make a place to cultivate your mind. en
ted_362_30 /xxx/en-de/data/train/wav/ted_362.wav:488959:156960 156960 Ich lebe genau hier im West Village, die Rauchwolke wurde zum Glück westwärts geweht, weg von uns. spk.362 de I live right there in the West Village, so the plume was luckily blowing west, away from us. en
...
ted_526_7 /xxx/en-de/data/train/wav/ted_526.wav:16538720:19360 19360 It can also happen in the brain. spk.526 en It can also happen in the brain. en
ted_190_62 /xxx/en-de/data/train/wav/ted_190.wav:7045920:47360 47360 Simple question: if you can't read and write, how do you manage your contact information? spk.190 en Simple question: if you can't read and write, how do you manage your contact information? en
ted_1771_81 /xxx/en-de/data/train/wav/ted_1771.wav:9624320:25600 25600 This is my message to you. spk.1771 en This is my message to you. en
The preprocessed directory ${MUSTC_ROOT}
should look like as follows:
.
├── en-de
│ ├── config_wave.yaml
│ ├── data
│ ├── dev_wavecif_joint.tsv
│ ├── docs
│ ├── segment
│ ├── spm_unigram10000_st.model
│ ├── spm_unigram10000_st.txt
│ ├── spm_unigram10000_st.vocab
│ ├── train_wavecif_joint.tsv
│ ├── tst-COMMON_wavecif_joint.tsv
│ ├── tst-HE_wavecif_joint.tsv
└── MUSTC_v1.0_en-de.tar.gz
The sentencepiece model and vocabulary file for En-DE can be downloaded at: spm_unigram10000_st.model , spm_unigram10000_st.txt , spm_unigram10000_st.vocab .
The sentencepiece model and vocabulary file for En-Fr can be downloaded at: spm_unigram10000_st.model , spm_unigram10000_st.txt , spm_unigram10000_st.vocab .
The sentencepiece model for generating the MSM's labels can be downloaded at: spm_unigram5000_asr.model , which should be placed to /path/spm_unigram5000_asr.model
The generated config_wave.yaml
should look like as follows:
bpe_tokenizer:
bpe: sentencepiece
sentencepiece_model: spm_unigram10000_st.model
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
vocab_filename: spm_unigram10000_st.txt
use_audio_input: true
prepend_tgt_lang_tag: true
- Training with multitask learning.
fairseq-train ${MUSTC_ROOT} \
--config-yaml config_wave.yaml \
--train-subset train_wave_joint \
--valid-subset dev_wave_joint \
--save-dir /path/${LANG}/pretrain \
--max-tokens 3200000 \
--update-freq 1 \
--max-update 3200000 \
--task speech_to_text_wav2vec \
--criterion label_smoothed_cross_entropy \
--report-accuracy \
--arch convtransformer_espnet_wav2vec \
--w2v2-model-path /path/wav2vec_small.pt \
--optimizer adam \
--lr 0.0001 \
--lr-scheduler inverse_sqrt \
--warmup-updates 25000 \
--clip-norm 10.0 \
--seed 1 \
--ddp-backend=no_c10d \
--keep-best-checkpoints 10 \
--best-checkpoint-metric accuracy \
--maximize-best-checkpoint-metric \
--patience 15 \
--max-source-positions 3200000 \
--skip-invalid-size-inputs-valid-test \
--dropout 0.0 --activation-dropout 0.1 --attention-dropout 0.1 \
--encoder-layers 8 \
--empty-cache-freq 100 \
--ignore-prefix-size 1 \
--fp16
id audio n_frames tgt_text speaker tgt_lang
ted_878_142 /xxx/en-de/data/train/wav/ted_878.wav:1216800:161760 161760 But we too rarely articulate and defend and argue about those big moral questions in our politics. spk.878 en
ted_1776_86 /xxx/en-de/data/train/wav/ted_1776.wav:8300639:39040 39040 Ich bin also so etwas wie ein Humoranalyst. spk.1776 de
ted_1312_6 /xxx/en-de/data/train/wav/ted_1312.wav:1980000:31200 31200 And I just finished a couple of months ago. spk.1312 en
ted_2889_24 /xxx/en-de/data/train/wav/ted_2889.wav:3703360:139840 139840 One reason is the stigma, with 63 percent of black Americans mistaking depression for a weakness. spk.2889 en
ted_445_163 /xxx/en-de/data/train/wav/ted_445.wav:14420960:88160 88160 They all have the same virus, but they're different enough that there's reason to believe that they've been independently acquired. spk.445 en
ted_424_60 /xxx/en-de/data/train/wav/ted_424.wav:9106080:83840 83840 Lem Sen: "I would've made this money, too, but I spent all this time looking for the American man who stole my recipe. spk.424 en
ted_1489_67 /xxx/en-de/data/train/wav/ted_1489.wav:12616000:39519 39519 India has the youngest growing population in the world. spk.1489 en
ted_1258_76 /xxx/en-de/data/train/wav/ted_1258.wav:7939040:18400 18400 I spend a lot of time on the road. spk.1258 en
ted_1513_11 /xxx/en-de/data/train/wav/ted_1513.wav:2869919:28000 28000 It's active in the Gulf of Guinea. spk.1513 en
We use the pre-trained Wav2vec 2.0 as the acoustic encoder.
- Fine-tuning with monotonic segmentation module.
fairseq-train ${MUSTC_ROOT} \
--config-yaml config_wave.yaml \
--train-subset train_wavecif_joint \
--valid-subset dev_wavecif_joint \
--save-dir /path/${LANG}/finetune/ \
--max-tokens 3200000 \
--update-freq 1 \
--max-update 3200000 \
--task speech_to_text_wav2vec_cif \
--criterion qua_ce_acc_v2 \
--arch convtransformer_espnet_wav2vec_cif \
--w2v2-model-path /path/wav2vec_small.pt \
--optimizer adam \
--lr 0.0001 \
--lr-scheduler inverse_sqrt \
--warmup-updates 10000 \
--clip-norm 10.0 \
--seed 1 \
--ddp-backend=no_c10d \
--keep-best-checkpoints 10 \
--best-checkpoint-metric accuracy \
--maximize-best-checkpoint-metric \
--patience 15 \
--max-source-positions 3200000 \
--skip-invalid-size-inputs-valid-test \
--dropout 0.0 --activation-dropout 0.1 --attention-dropout 0.1 \
--encoder-layers 8 \
--ignore-prefix-size 1 --log-interval 20 --fp16 \
--load-pretrained-encoder-from /path/${LANG}/pretrain/checkpoint.pt \
--load-pretrained-decoder-from /path/${LANG}/pretrain/checkpoint.pt
Our released models (En-De and En-Fr) can be downloaded to test the evaluation directly.
fairseq-generate ${MUSTC_ROOT} \
--config-yaml config_wave.yaml \
--gen-subset tst-COMMON_wavecif_joint_st \
--task speech_to_text_wav2vec_cif \
--path /path/${LANG}/finetune/checkpoint.pt \
--max-tokens 3200000 \
--beam 5 \
--scoring sacrebleu \
--max-source-positions 3200000 \
--prefix-size 1
Note that the offline models need to be converted to support streaming translation task. Our model (En-De can be downloaded to test streaming translation.
- Prefix-decision
lagging=5
fixed_pre_decision_ratio=7
simuleval --agent mosst/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent_wav2vec.py \
--source /path/data/tst-COMMON.wavurl \
--target /path/data/tst-COMMON.${LANG} \
--data-bin /path/data/en-${LANG}/ \
--config config_wave.yaml \
--model-path /path/${LANG}/finetune/checkpoint.pt \
--output /path/${LANG}/finetune/simuleval/ \
--waitk-lagging ${lagging} \
--fixed-pre-decision-ratio ${fixed_pre_decision_ratio} \
--scores \
--port 1234
- Dynamic-decision
simuleval --agent mosst/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent_wav2vec_cif.py \
--source /path/data/tst-COMMON.wavurl \
--target /path/data/tst-COMMON.${LANG} \
--data-bin /path/data/en-${LANG}/ \
--config config_wave.yaml \
--model-path /path/${LANG}/finetune/checkpoint.pt \
--output /path/${LANG}/finetune/simuleval/ \
--scores \
--max-source-positions 3200000 \
--port 1234
Please consider citing our papers in your publications if the project helps your research. BibTeX reference is as follows.
@inproceedings{dong-etal-2022-Learning,
title = {Learning When to Translate for Streaming Speech},
author = {Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
year = {2022},
}