Skip to content

dqqcasia/mosst

Repository files navigation

Learning When to Translate for Streaming Speech

This is a PyTorch implementation for the ACL 2022 main conference paper Learning When to Translate for Streaming Speech .

Data Processing

Take German for example. Firstly, download MuST-C v1.0 archive MUSTC_v1.0_en-de.tar.gz to the ${MUSTC_ROOT} path, and uncompress it:

LANG=de
MUSTC_ROOT=/path/data/en-${LANG}$
tar -xzvf MUSTC_v1.0_en-de.tar.gz

Then, run the script to prepare data manifest.

python3 examples/speech_to_text/prep_mustc_data_raw.py --data-root ${MUSTC_ROOT} \
  --tgt-lang ${LANG}

The generated .tsv should be expanded with the field of source language text and doubled with asr task. Here's some examples from the .tsv file.

id      audio   n_frames        tgt_text        speaker tgt_lang        src_text        src_lang
ted_2529_66     /xxx/en-de/data/train/wav/ted_2529.wav:9517120:61760      61760   Ich hatte den Vorteil einer Perspektive von dieser Breite.  spk.2529        de      I had the benefit of a spectrum this wide.      en
ted_1257_134    /xxx/en-de/data/train/wav/ted_1257.wav:13876160:80960     80960   And outside the library, I wanted to make a place to cultivate your mind.   spk.1257        en      And outside the library, I wanted to make a place to cultivate your mind.       en
ted_362_30      /xxx/en-de/data/train/wav/ted_362.wav:488959:156960       156960  Ich lebe genau hier im West Village, die Rauchwolke wurde zum Glück westwärts geweht, weg von uns.  spk.362 de      I live right there in the West Village, so the plume was luckily blowing west, away from us.        en
...
ted_526_7       /xxx/en-de/data/train/wav/ted_526.wav:16538720:19360      19360   It can also happen in the brain.    spk.526 en      It can also happen in the brain.        en
ted_190_62      /xxx/en-de/data/train/wav/ted_190.wav:7045920:47360       47360   Simple question: if you can't read and write, how do you manage your contact information?   spk.190 en      Simple question: if you can't read and write, how do you manage your contact information?   en
ted_1771_81     /xxx/en-de/data/train/wav/ted_1771.wav:9624320:25600      25600   This is my message to you. spk.1771 en      This is my message to you.      en

The preprocessed directory ${MUSTC_ROOT} should look like as follows:

.
├── en-de
│   ├── config_wave.yaml
│   ├── data
│   ├── dev_wavecif_joint.tsv
│   ├── docs
│   ├── segment
│   ├── spm_unigram10000_st.model
│   ├── spm_unigram10000_st.txt
│   ├── spm_unigram10000_st.vocab
│   ├── train_wavecif_joint.tsv
│   ├── tst-COMMON_wavecif_joint.tsv
│   ├── tst-HE_wavecif_joint.tsv
└── MUSTC_v1.0_en-de.tar.gz

The sentencepiece model and vocabulary file for En-DE can be downloaded at: spm_unigram10000_st.model , spm_unigram10000_st.txt , spm_unigram10000_st.vocab .

The sentencepiece model and vocabulary file for En-Fr can be downloaded at: spm_unigram10000_st.model , spm_unigram10000_st.txt , spm_unigram10000_st.vocab .

The sentencepiece model for generating the MSM's labels can be downloaded at: spm_unigram5000_asr.model , which should be placed to /path/spm_unigram5000_asr.model

The generated config_wave.yaml should look like as follows:

bpe_tokenizer:
  bpe: sentencepiece
  sentencepiece_model: spm_unigram10000_st.model
input_channels: 1
input_feat_per_channel: 80
sampling_alpha: 1.0
vocab_filename: spm_unigram10000_st.txt
use_audio_input: true
prepend_tgt_lang_tag: true

Training

  • Training with multitask learning.
fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_wave.yaml \
  --train-subset train_wave_joint \
  --valid-subset dev_wave_joint \
  --save-dir /path/${LANG}/pretrain \
  --max-tokens 3200000  \
  --update-freq 1 \
  --max-update 3200000 \
  --task speech_to_text_wav2vec \
  --criterion label_smoothed_cross_entropy \
  --report-accuracy \
  --arch convtransformer_espnet_wav2vec \
  --w2v2-model-path /path/wav2vec_small.pt \
  --optimizer adam \
  --lr 0.0001 \
  --lr-scheduler inverse_sqrt \
  --warmup-updates 25000 \
  --clip-norm 10.0 \
  --seed 1 \
  --ddp-backend=no_c10d \
  --keep-best-checkpoints 10 \
  --best-checkpoint-metric accuracy \
  --maximize-best-checkpoint-metric \
  --patience 15 \
  --max-source-positions 3200000 \
  --skip-invalid-size-inputs-valid-test \
  --dropout 0.0 --activation-dropout 0.1 --attention-dropout 0.1 \
  --encoder-layers 8 \
  --empty-cache-freq 100 \
  --ignore-prefix-size 1 \
  --fp16
id      audio   n_frames        tgt_text        speaker tgt_lang
ted_878_142     /xxx/en-de/data/train/wav/ted_878.wav:1216800:161760      161760  But we too rarely articulate and defend and argue about those big moral questions in our politics.   spk.878 en
ted_1776_86     /xxx/en-de/data/train/wav/ted_1776.wav:8300639:39040      39040   Ich bin also so etwas wie ein Humoranalyst.  spk.1776        de
ted_1312_6      /xxx/en-de/data/train/wav/ted_1312.wav:1980000:31200      31200   And I just finished a couple of months ago.  spk.1312        en
ted_2889_24     /xxx/en-de/data/train/wav/ted_2889.wav:3703360:139840     139840  One reason is the stigma, with 63 percent of black Americans mistaking depression for a weakness.    spk.2889        en
ted_445_163     /xxx/en-de/data/train/wav/ted_445.wav:14420960:88160      88160   They all have the same virus, but they're different enough that there's reason to believe that they've been independently acquired.  spk.445 en
ted_424_60      /xxx/en-de/data/train/wav/ted_424.wav:9106080:83840       83840   Lem Sen: "I would've made this money, too, but I spent all this time looking for the American man who stole my recipe.       spk.424 en
ted_1489_67     /xxx/en-de/data/train/wav/ted_1489.wav:12616000:39519     39519   India has the youngest growing population in the world.      spk.1489        en
ted_1258_76     /xxx/en-de/data/train/wav/ted_1258.wav:7939040:18400      18400   I spend a lot of time on the road.   spk.1258        en
ted_1513_11     /xxx/en-de/data/train/wav/ted_1513.wav:2869919:28000      28000   It's active in the Gulf of Guinea.   spk.1513        en

We use the pre-trained Wav2vec 2.0 as the acoustic encoder.

  • Fine-tuning with monotonic segmentation module.
fairseq-train ${MUSTC_ROOT} \
  --config-yaml config_wave.yaml \
  --train-subset train_wavecif_joint \
  --valid-subset dev_wavecif_joint \
  --save-dir /path/${LANG}/finetune/ \
  --max-tokens 3200000  \
  --update-freq 1 \
  --max-update 3200000 \
  --task speech_to_text_wav2vec_cif \
  --criterion qua_ce_acc_v2 \
  --arch convtransformer_espnet_wav2vec_cif \
  --w2v2-model-path /path/wav2vec_small.pt \
  --optimizer adam \
  --lr 0.0001 \
  --lr-scheduler inverse_sqrt \
  --warmup-updates 10000 \
  --clip-norm 10.0 \
  --seed 1 \
  --ddp-backend=no_c10d \
  --keep-best-checkpoints 10 \
  --best-checkpoint-metric accuracy \
  --maximize-best-checkpoint-metric \
  --patience 15 \
  --max-source-positions 3200000 \
  --skip-invalid-size-inputs-valid-test \
  --dropout 0.0 --activation-dropout 0.1 --attention-dropout 0.1 \
  --encoder-layers 8 \
  --ignore-prefix-size 1 --log-interval 20  --fp16 \
  --load-pretrained-encoder-from /path/${LANG}/pretrain/checkpoint.pt \
  --load-pretrained-decoder-from /path/${LANG}/pretrain/checkpoint.pt

Evaluation

Offline Translation

Our released models (En-De and En-Fr) can be downloaded to test the evaluation directly.

fairseq-generate ${MUSTC_ROOT} \
  --config-yaml config_wave.yaml \
  --gen-subset tst-COMMON_wavecif_joint_st \
  --task speech_to_text_wav2vec_cif \
  --path /path/${LANG}/finetune/checkpoint.pt \
  --max-tokens 3200000 \
  --beam 5 \
  --scoring sacrebleu \
  --max-source-positions 3200000 \
  --prefix-size 1

Streaming Translation

Note that the offline models need to be converted to support streaming translation task. Our model (En-De can be downloaded to test streaming translation.

  • Prefix-decision
lagging=5
fixed_pre_decision_ratio=7
simuleval --agent mosst/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent_wav2vec.py \
  --source /path/data/tst-COMMON.wavurl \
  --target /path/data/tst-COMMON.${LANG} \
  --data-bin /path/data/en-${LANG}/ \
  --config config_wave.yaml \
  --model-path /path/${LANG}/finetune/checkpoint.pt \
  --output /path/${LANG}/finetune/simuleval/ \
  --waitk-lagging ${lagging} \
  --fixed-pre-decision-ratio ${fixed_pre_decision_ratio} \
  --scores \
  --port 1234
  • Dynamic-decision
simuleval --agent mosst/examples/speech_to_text/simultaneous_translation/agents/fairseq_simul_st_agent_wav2vec_cif.py \
  --source /path/data/tst-COMMON.wavurl \
  --target /path/data/tst-COMMON.${LANG} \
  --data-bin /path/data/en-${LANG}/ \
  --config config_wave.yaml \
  --model-path /path/${LANG}/finetune/checkpoint.pt \
  --output /path/${LANG}/finetune/simuleval/ \
  --scores \
  --max-source-positions 3200000 \
  --port 1234

Citation

Please consider citing our papers in your publications if the project helps your research. BibTeX reference is as follows.

@inproceedings{dong-etal-2022-Learning,
	title = {Learning When to Translate for Streaming Speech},
	author = {Qianqian Dong, Yaoming Zhu, Mingxuan Wang, Lei Li},
	booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics},
	year = {2022},
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages