DCMA: Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation

The Code for the EMNLP 2022 main conference paper Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation, which aims to train an end-to-end speech translation model in a zero-shot fashion (only ASR and MT data are available).

Training a Model on MuST-C

Training an En-De model as an example.

Installation

The following environments are required.

Python == 3.7
torch == 1.8, torchaudio == 0.8.0, cuda==10.1
python library

pip install pandas sentencepiece editdistance PyYAML tqdm soudfile

fairseq == 1.0.0a0+741fd13

Dowanload the corresponding fairseq and install it.

cd fairseq
pip install --editable ./
cd ..

NOTE: fairseq == 1.0.0a0 is not a stable release. Our codes are not compatible with the current fairseq version. Please install the corresponding version we provided, or you will need to modify the model codes.

Data Preparation

set configuration

Please set the global variables of WMT_DATA_ROOT, SPEECH_DATA_ROOT and SAVE_ROOT. These will be where to put the WMT datasets, MUST-C dataset and checkpoints, respectively. Set the global variables target to specify the translation direction. For example:

export WMT_DATA_ROOT=~/WMT
export SPEECH_DATA_ROOT=~/MUSTC
export SAVE_ROOT=~/checkpoints
export target=de
mkdir -p $MUSTC_ROOT $WMT_ROOT $SAVE_ROOT

Download and uncompress the En-De MuST-C dataset to $SPEECH_DATA_ROOT/en-$target.
Download the WMT to $WMT_ROOT/orig via:

bash egs/prepare_data/download-wmt.sh --wmt14 --data-dir $WMT_DATA_ROOT --target $target

Prepare the MUST-C datasets and produce a joint spm dictionary:

bash egs/prepare_data/prepare-mustc-en2any.sh \
    --speech-data-root $SPEECH_DATA_ROOT --subword unigram --subword-tokens 10000

After this step, the directory $SPEECH_DATA_ROOT should look like:

├── en-de
│   ├── config_wave.yaml
│   ├── data
│   ├── para_text
│   ├── spm_unigram10000_wave_joint.model
│   ├── spm_unigram10000_wave_joint.txt
│   ├── spm_unigram10000_wave_joint.vocab
│   ├── train_wave_triple.tsv
│   ├── train_wave_en_asr.tsv
│   ├── dev_wave_triple.tsv
│   ├── dev_wave_en_asr.tsv
│   ├── tst-COMMON_wave_triple.tsv
│   ├── tst-COMMON_wave_en_asr.tsv
│   ├── tst-HE_wave_triple.tsv
│   ├── tst-HE_wave_triple.tsv
└── MUSTC_v1.0_en-de.tar.gz

Each .tsv file is formed as:

id	audio	n_frames	src_text	tgt_text	speaker

tgt_text contains source transcriptions in XXX_en_asr.tsv, and contains target translations in XXX_wave_triple.tsv.

4. Prepare the WMT datasets:

bash egs/prepare_data/prepare-wmt-en2any.sh

This step will produce two folders mt_data and mt_data_expand in $SPEECH_DATA_ROOT/en-$target. The former only contains parallel text from WMT datasets, and the latter contains WMT and in-domain MUST-C text data.

Training

Pre-training on MT data:

export CUDA_VISIBLE_DEVICES=0,1
bash egs/scripts/train-en2any-MT.sh --save-root $SAVE_ROOT

Our experiment is carried out on 2 V100 GPUs. If you want to use more or less GPUs, please modify the update-freq in the training scripts. Also, if you want to leverage the in-domain parallel text data from MUST-C, just add argument --expand like this:

bash egs/scripts/train-en2any-MT.sh --save-root $SAVE_ROOT --expand

Zero-shot fine-tuning:

bash egs/scripts/train-en2any-zero-shot-ST.sh

Note: This step will use the same MT data as step 5 automatically.

Evaluation

Averaging Checkpoints and Evaluate It

We average the last 5 checkpoints and evaluate it.

bash egs/scripts/eval-en2any-ST.sh

License

The codes are dependent on fairseq code base, therefore carrying the MIT License of the original codes.

Contact

If you have any questions, please feel free to contact us by sending an email to wangchen2020@ia.ac.cn

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
egs		egs
zero_shot_wave_to_text		zero_shot_wave_to_text
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DCMA: Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation

Training a Model on MuST-C

Installation

Data Preparation

Training

Evaluation

License

Contact

About

Releases 1

Packages

Languages

License

cwang621/zero-shot-st

Folders and files

Latest commit

History

Repository files navigation

DCMA: Discrete Cross-Modal Alignment Enables Zero-Shot Speech Translation

Training a Model on MuST-C

Installation

Data Preparation

Training

Evaluation

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages