The Code for the EMNLP 2022 main conference paper Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation, which aims to train an end-to-end speech translation model in a zero-shot fashion (only ASR and MT data are available).
Training an En-De model as an example.
The following environments are required.
- Python == 3.7
- torch == 1.8, torchaudio == 0.8.0, cuda==10.1
- python library
pip install pandas sentencepiece editdistance PyYAML tqdm soudfile
- fairseq == 1.0.0a0+741fd13
Dowanload the corresponding fairseq and install it.
cd fairseq
pip install --editable ./
cd ..
NOTE: fairseq == 1.0.0a0 is not a stable release. Our codes are not compatible with the current fairseq version. Please install the corresponding version we provided, or you will need to modify the model codes.
- set configuration
Please set the global variables of WMT_DATA_ROOT
, SPEECH_DATA_ROOT
and SAVE_ROOT
. These will be where to put the WMT datasets, MUST-C dataset and checkpoints, respectively. Set the global variables target
to specify the translation direction.
For example:
export WMT_DATA_ROOT=~/WMT
export SPEECH_DATA_ROOT=~/MUSTC
export SAVE_ROOT=~/checkpoints
export target=de
mkdir -p $MUSTC_ROOT $WMT_ROOT $SAVE_ROOT
-
Download and uncompress the En-De MuST-C dataset to
$SPEECH_DATA_ROOT/en-$target
. -
Download the WMT to
$WMT_ROOT/orig
via:
bash egs/prepare_data/download-wmt.sh --wmt14 --data-dir $WMT_DATA_ROOT --target $target
- Prepare the MUST-C datasets and produce a joint spm dictionary:
bash egs/prepare_data/prepare-mustc-en2any.sh \
--speech-data-root $SPEECH_DATA_ROOT --subword unigram --subword-tokens 10000
After this step, the directory $SPEECH_DATA_ROOT
should look like:
├── en-de
│ ├── config_wave.yaml
│ ├── data
│ ├── para_text
│ ├── spm_unigram10000_wave_joint.model
│ ├── spm_unigram10000_wave_joint.txt
│ ├── spm_unigram10000_wave_joint.vocab
│ ├── train_wave_triple.tsv
│ ├── train_wave_en_asr.tsv
│ ├── dev_wave_triple.tsv
│ ├── dev_wave_en_asr.tsv
│ ├── tst-COMMON_wave_triple.tsv
│ ├── tst-COMMON_wave_en_asr.tsv
│ ├── tst-HE_wave_triple.tsv
│ ├── tst-HE_wave_triple.tsv
└── MUSTC_v1.0_en-de.tar.gz
Each .tsv
file is formed as:
id audio n_frames src_text tgt_text speaker
tgt_text
contains source transcriptions in XXX_en_asr.tsv
, and contains target translations in XXX_wave_triple.tsv
.
4. Prepare the WMT datasets:
bash egs/prepare_data/prepare-wmt-en2any.sh
This step will produce two folders mt_data
and mt_data_expand
in $SPEECH_DATA_ROOT/en-$target
. The former only contains parallel text from WMT datasets, and the latter contains WMT and in-domain MUST-C text data.
- Pre-training on MT data:
export CUDA_VISIBLE_DEVICES=0,1
bash egs/scripts/train-en2any-MT.sh --save-root $SAVE_ROOT
Our experiment is carried out on 2 V100 GPUs. If you want to use more or less GPUs, please modify the update-freq
in the training scripts.
Also, if you want to leverage the in-domain parallel text data from MUST-C, just add argument --expand
like this:
bash egs/scripts/train-en2any-MT.sh --save-root $SAVE_ROOT --expand
- Zero-shot fine-tuning:
bash egs/scripts/train-en2any-zero-shot-ST.sh
Note: This step will use the same MT data as step 5 automatically.
- Averaging Checkpoints and Evaluate It
We average the last 5 checkpoints and evaluate it.
bash egs/scripts/eval-en2any-ST.sh
The codes are dependent on fairseq code base, therefore carrying the MIT License of the original codes.
If you have any questions, please feel free to contact us by sending an email to wangchen2020@ia.ac.cn