diff --git a/.github/workflows/docker.yml b/.github/workflows/docker.yml new file mode 100644 index 00000000000..049bdabafa1 --- /dev/null +++ b/.github/workflows/docker.yml @@ -0,0 +1,48 @@ +name: docker-builder + +on: + pull_request: + types: [closed] + branches: + - master + paths: + - 'tools/**' + - setup.py + +jobs: + docker: + runs-on: ubuntu-latest + if: github.event.pull_request.merged == true + steps: + - uses: actions/checkout@v2 + + - name: Set up QEMU + uses: docker/setup-qemu-action@v1 + + - name: Set up Docker Buildx + uses: docker/setup-buildx-action@v1 + + - name: Login to DockerHub + uses: docker/login-action@v1 + with: + username: ${{ secrets.DOCKERHUB_USERNAME }} + password: ${{ secrets.DOCKERHUB_TOKEN }} + + - name: Build and push CPU container + run: | + cd docker + docker build --build-arg FROM_TAG=runtime-latest \ + -f prebuilt/devel.dockerfile \ + --target devel \ + -t espnet/espnet:cpu-latest . + docker push espnet/espnet:cpu-latest + + - name: Build and push GPU container + run: | + cd docker + docker build --build-arg FROM_TAG=cuda-latest \ + --build-arg CUDA_VER=11.1 \ + -f prebuilt/devel.dockerfile \ + --target devel \ + -t espnet/espnet:gpu-latest . + docker push espnet/espnet:gpu-latest diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 979e7397012..9036a09b66d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -151,6 +151,11 @@ we recommend using small model parameters and avoiding dynamic imports, file acc more running time, you can annotate your test with `@pytest.mark.execution_timeout(sec)`. - For test initialization (parameters, modules, etc), you can use pytest fixtures. Refer to [pytest fixtures](https://docs.pytest.org/en/latest/fixture.html#using-fixtures-from-classes-modules-or-projects) for more information. +In addition, please follow the [PEP 8 convention](https://peps.python.org/pep-0008/) for the coding style and [Google's convention for docstrings](https://google.github.io/styleguide/pyguide.html#383-functions-and-methods). +Below are some specific points that should be taken care of in particular: +- [import ordering](https://peps.python.org/pep-0008/#imports) +- Avoid writing python2-style code. For example, `super().__init__()` is preferred over `super(CLASS_NAME, self).__init()__`. + ### 4.2 Bash scripts diff --git a/README.md b/README.md index 082e5450f78..678c52103f5 100644 --- a/README.md +++ b/README.md @@ -77,12 +77,12 @@ ESPnet uses [pytorch](http://pytorch.org/) as a deep learning engine and also fo - Self-supervised learning representations as features, using upstream models in [S3PRL](https://github.com/s3prl/s3prl) in frontend. - Set `frontend` to be `s3prl` - Select any upstream model by setting the `frontend_conf` to the corresponding name. +- Transfer Learning : + - easy usage and transfers from models previously trained by your group, or models from [ESPnet huggingface repository](https://huggingface.co/espnet). + - [Documentation](https://github.com/espnet/espnet/tree/master/egs2/mini_an4/asr1/transfer_learning.md) and [toy example runnable on colab](https://github.com/espnet/notebook/blob/master/espnet2_asr_transfer_learning_demo.ipynb). - Streaming Transformer/Conformer ASR with blockwise synchronous beam search. - Restricted Self-Attention based on [Longformer](https://arxiv.org/abs/2004.05150) as an encoder for long sequences -### SUM: Speech Summarization -- End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [[Sharma et al., 2022]](https://arxiv.org/abs/2110.06263) - Demonstration - Real-time ASR demo with ESPnet2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_asr_realtime_demo.ipynb) - [Gradio](https://github.com/gradio-app/gradio) Web Demo on [Huggingface Spaces](https://huggingface.co/docs/hub/spaces). Check out the [Web Demo](https://huggingface.co/spaces/akhaliq/espnet2_asr) @@ -133,7 +133,7 @@ To train the neural vocoder, please check the following repositories: - Multi-speaker speech separation - Unified encoder-separator-decoder structure for time-domain and frequency-domain models - Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution - - Separators: BLSTM, Transformer, Conformer, [TasNet](https://arxiv.org/abs/1809.07454), [DPRNN](https://arxiv.org/abs/1910.06379), [DC-CRN](https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf), [DCCRN](https://arxiv.org/abs/2008.00264), Neural Beamformers, etc. + - Separators: BLSTM, Transformer, Conformer, [TasNet](https://arxiv.org/abs/1809.07454), [DPRNN](https://arxiv.org/abs/1910.06379), [SkiM](https://arxiv.org/abs/2201.10800), [SVoice](https://arxiv.org/abs/2011.02329), [DC-CRN](https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf), [DCCRN](https://arxiv.org/abs/2008.00264), [Deep Clustering](https://ieeexplore.ieee.org/document/7471631), [Deep Attractor Network](https://pubmed.ncbi.nlm.nih.gov/29430212/), [FaSNet](https://arxiv.org/abs/1909.13387), [iFaSNet](https://arxiv.org/abs/1910.14104), Neural Beamformers, etc. - Flexible ASR integration: working as an individual task or as the ASR frontend - Easy to import pretrained models from [Asteroid](https://github.com/asteroid-team/asteroid) - Both the pre-trained models from Asteroid and the specific configuration are supported. @@ -141,7 +141,6 @@ To train the neural vocoder, please check the following repositories: Demonstration - Interactive SE demo with ESPnet2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing) - ### ST: Speech Translation & MT: Machine Translation - **State-of-the-art performance** in several ST benchmarks (comparable/superior to cascaded ASR and MT) - Transformer based end-to-end ST (new!) @@ -152,9 +151,34 @@ Demonstration - End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!) ### SLU: Speech Language Understanding -- Predicting intent by directly classifying it as one of intent or decoding by character -- Transformer & RNN based encoder-decoder model -- Establish SOTA results with spectral augmentation (Performs better than reported results of pretrained model on Fluent Speech Command Dataset) +- Architecture + - Transformer based Encoder + - Conformer based Encoder + - RNN based Decoder + - Transformer based Decoder +- Support Multitasking with ASR + - Predict both intent and ASR transcript +- Support Multitasking with NLU + - Deliberation encoder based 2 pass model +- Support using pretrained ASR models + - Hubert + - Wav2vec2 + - VQ-APC + - TERA and more ... +- Support using pretrained NLP models + - BERT + - MPNet And more... +- Various language support + - En / Jp / Zn / Nl / And more... +- Supports using context from previous utterances +- Supports using other tasks like SE in pipeline manner +Demonstration +- Performing noisy spoken language understanding using speech enhancement model followed by spoken language understanding model. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14nCrJ05vJcQX0cJuXjbMVFWUHJ3Wfb6N?usp=sharing) +- Integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See SLU demo on multiple languages: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Siddhant/ESPnet2-SLU) + + +### SUM: Speech Summarization +- End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [[Sharma et al., 2022]](https://arxiv.org/abs/2110.06263) ### DNN Framework - Flexible network architecture thanks to chainer and pytorch @@ -532,11 +556,33 @@ You can download converted samples of the cascade ASR+TTS baseline system [here] ### SLU results -
ESPnet2
+
expand
+ + +We list the performance on various SLU tasks and dataset using the metric reported in the original dataset paper + +| Task | Dataset | Metric | Result | Pretrained Model | +| ----------------------------------------------------------------- | :-------------: | :-------------: | :-------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | +| Intent Classification | SLURP | Acc | 86.3 | [link](https://github.com/espnet/espnet/tree/master/egs2/slurp/asr1/README.md) | +| Intent Classification | FSC | Acc | 99.6 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc/asr1/README.md) | +| Intent Classification | FSC Unseen Speaker Set | Acc | 98.6 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_unseen/asr1/README.md) | +| Intent Classification | FSC Unseen Utterance Set | Acc | 86.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_unseen/asr1/README.md) | +| Intent Classification | FSC Challenge Speaker Set | Acc | 97.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_challenge/asr1/README.md) | +| Intent Classification | FSC Challenge Utterance Set | Acc | 78.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_challenge/asr1/README.md) | +| Intent Classification | SNIPS | F1 | 91.7 | [link](https://github.com/espnet/espnet/tree/master/egs2/snips/asr1/README.md) | +| Intent Classification | Grabo (Nl) | Acc | 97.2 | [link](https://github.com/espnet/espnet/tree/master/egs2/grabo/asr1/README.md) | +| Intent Classification | CAT SLU MAP (Zn) | Acc | 78.9 | [link](https://github.com/espnet/espnet/tree/master/egs2/catslu/asr1/README.md) | +| Intent Classification | Google Speech Commands | Acc | 98.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/speechcommands/asr1/README.md) | +| Slot Filling | SLURP | SLU-F1 | 71.9 | [link](https://github.com/espnet/espnet/tree/master/egs2/slurp_entity/asr1/README.md) | +| Dialogue Act Classification | Switchboard | Acc | 67.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/swbd_da/asr1/README.md) | +| Dialogue Act Classification | Jdcinal (Jp) | Acc | 67.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/jdcinal/asr1/README.md) | +| Emotion Recognition | IEMOCAP | Acc | 69.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/iemocap/asr1/README.md) | +| Emotion Recognition | swbd_sentiment | Macro F1 | 61.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/swbd_sentiment/asr1/README.md) | +| Emotion Recognition | slue_voxceleb | Macro F1 | 44.0 | [link](https://github.com/espnet/espnet/tree/master/egs2/slue-voxceleb/asr1/README.md) | -- Transformer based SLU for Fluent Speech Command Dataset + +If you want to check the results of the other recipes, please check `egs2//asr1/RESULTS.md`. -In SLU, The objective is to infer the meaning or intent of spoken utterance. The [Fluent Speech Command Dataset](https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/) describes an intent as combination of 3 slot values: action, object and location. You can see baseline results on this dataset [here](https://github.com/espnet/espnet/blob/master/egs2/fsc/asr1/RESULTS.md)
@@ -689,6 +735,8 @@ See the module documentation for more information. It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files; rather than using Transformer models that have a high memory consumption on longer audio data. The sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed. + +Also, we can use this tool to provide token-level segmentation information if we prepare a list of tokens instead of that of utterances in the `text` file. See the discussion in https://github.com/espnet/espnet/issues/4278#issuecomment-1100756463.
diff --git a/ci/test_integration_espnet2.sh b/ci/test_integration_espnet2.sh index 78086272af7..58951c04011 100755 --- a/ci/test_integration_espnet2.sh +++ b/ci/test_integration_espnet2.sh @@ -100,6 +100,50 @@ if python3 -c "import fairseq" &> /dev/null; then cd "${cwd}" fi +# [ESPnet2] test enh_asr1 recipe +if python -c 'import torch as t; from distutils.version import LooseVersion as L; assert L(t.__version__) >= L("1.2.0")' &> /dev/null; then + cd ./egs2/mini_an4/enh_asr1 + echo "==== [ESPnet2] ENH_ASR ===" + ./run.sh --ngpu 0 --stage 0 --stop-stage 15 --skip-upload_hf false --feats-type "raw" --spk-num 1 --enh_asr_args "--max_epoch=1 --enh_separator_conf num_spk=1" --python "${python}" + # Remove generated files in order to reduce the disk usage + rm -rf exp dump data + cd "${cwd}" +fi + +# [ESPnet2] test st recipe +cd ./egs2/mini_an4/st1 +echo "==== [ESPnet2] ST ===" +./run.sh --stage 1 --stop-stage 1 +feats_types="raw fbank_pitch" +token_types="bpe char" +for t in ${feats_types}; do + ./run.sh --stage 2 --stop-stage 4 --feats-type "${t}" --python "${python}" +done +for t in ${token_types}; do + ./run.sh --stage 5 --stop-stage 5 --tgt_token_type "${t}" --src_token_type "${t}" --python "${python}" +done +for t in ${feats_types}; do + for t2 in ${token_types}; do + echo "==== feats_type=${t}, token_types=${t2} ===" + ./run.sh --ngpu 0 --stage 6 --stop-stage 13 --skip-upload false --feats-type "${t}" --tgt_token_type "${t2}" --src_token_type "${t2}" \ + --st-args "--max_epoch=1" --lm-args "--max_epoch=1" --inference_args "--beam_size 5" --python "${python}" + done +done +echo "==== feats_type=raw, token_types=bpe, model_conf.extract_feats_in_collect_stats=False, normalize=utt_mvn ===" +./run.sh --ngpu 0 --stage 10 --stop-stage 13 --skip-upload false --feats-type "raw" --tgt_token_type "bpe" --src_token_type "bpe" \ + --feats_normalize "utterance_mvn" --lm-args "--max_epoch=1" --inference_args "--beam_size 5" --python "${python}" \ + --st-args "--model_conf extract_feats_in_collect_stats=false --max_epoch=1" + +echo "==== use_streaming, feats_type=raw, token_types=bpe, model_conf.extract_feats_in_collect_stats=False, normalize=utt_mvn ===" +./run.sh --use_streaming true --ngpu 0 --stage 6 --stop-stage 13 --skip-upload false --feats-type "raw" --tgt_token_type "bpe" --src_token_type "bpe" \ + --feats_normalize "utterance_mvn" --lm-args "--max_epoch=1" --inference_args "--beam_size 5" --python "${python}" \ + --st-args "--model_conf extract_feats_in_collect_stats=false --max_epoch=1 --encoder=contextual_block_transformer --decoder=transformer + --encoder_conf block_size=40 --encoder_conf hop_size=16 --encoder_conf look_ahead=16" + +# Remove generated files in order to reduce the disk usage +rm -rf exp dump data +cd "${cwd}" + # [ESPnet2] Validate configuration files echo "" > dummy_token_list echo "==== [ESPnet2] Validation configuration files ===" @@ -124,6 +168,9 @@ if python3 -c 'import torch as t; from distutils.version import LooseVersion as for f in egs2/*/ssl1/conf/train*.yaml; do ${python} -m espnet2.bin.hubert_train --config "${f}" --iterator_type none --normalize none --dry_run true --output_dir out --token_list dummy_token_list done + for f in egs2/*/enh_asr1/conf/train_enh_asr*.yaml; do + ${python} -m espnet2.bin.enh_s2t_train --config "${f}" --iterator_type none --dry_run true --output_dir out --token_list dummy_token_list + done fi # These files must be same each other. diff --git a/egs/README.md b/egs/README.md index 61951b84d47..78fa57049ae 100755 --- a/egs/README.md +++ b/egs/README.md @@ -8,6 +8,7 @@ See: https://espnet.github.io/espnet/tutorial.html | Directory name | Corpus name | Task | Language | URL | Note | | ----------------------- | ------------------------------------------------------------ | ------------------------------------------ | -------------- | ------------------------------------------------------------ | ----------------------------- | |||| +| aesrc2020 | Accented English Speech Recognition Challenge 2020 | ASR | EN | https://arxiv.org/abs/2102.10233 | | | aidatatang_200zh | Aidatatang_200zh A free Chinese Mandarin speech corpus | ASR | ZH | http://www.openslr.org/62/ | | | aishell | AISHELL-ASR0009-OS1 Open Source Mandarin Speech Corpus | ASR | ZH | http://www.aishelltech.com/kysjcp | | | aishell2 | AISHELL-2 Open Source Mandarin Speech Corpus | ASR | ZH | http://www.aishelltech.com/aishell_2 | @@ -49,7 +50,8 @@ See: https://espnet.github.io/espnet/tutorial.html | librispeech | LibriSpeech ASR corpus | ASR | EN | http://www.openslr.org/12 | | | libritts | LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech | TTS | EN | http://www.openslr.org/60/ | | | ljspeech | The LJ Speech Dataset | TTS | EN | https://keithito.com/LJ-Speech-Dataset/ | | -| lrs | The Lip Reading Sentences Dataset | ASR/AVSR | EN | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | | +| lrs2 | The Lip Reading Sentences 2 Dataset | ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | | +| lrs | The Lip Reading Sentences 2 and 3 Dataset | AVSR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html | | | m_ailabs | The M-AILABS Speech Dataset | TTS | ~5 languages | https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ | | mucs_2021 | MUCS 2021: MUltilingual and Code-Switching ASR Challenges for Low Resource Indian Languages | ASR/Code Switching | HI, MR, OR, TA, TE, GU, HI-EN, BN-EN | https://navana-tech.github.io/MUCS2021/data.html | | | mtedx | Multilingual TEDx | ASR/Machine Translation/Speech Translation | 13 Language pairs | http://www.openslr.org/100/ | diff --git a/egs/aesrc2020/asr1/RESULTS.md b/egs/aesrc2020/asr1/RESULTS.md new file mode 100644 index 00000000000..e69de29bb2d diff --git a/egs/lrs/asr1/cmd.sh b/egs/aesrc2020/asr1/cmd.sh similarity index 100% rename from egs/lrs/asr1/cmd.sh rename to egs/aesrc2020/asr1/cmd.sh diff --git a/egs/aesrc2020/asr1/conf/decode.yaml b/egs/aesrc2020/asr1/conf/decode.yaml new file mode 120000 index 00000000000..1f358f011d4 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/decode.yaml @@ -0,0 +1 @@ +tuning/decode_pytorch_transformer.yaml \ No newline at end of file diff --git a/egs/lrs/asr1/conf/fbank.conf b/egs/aesrc2020/asr1/conf/fbank.conf similarity index 100% rename from egs/lrs/asr1/conf/fbank.conf rename to egs/aesrc2020/asr1/conf/fbank.conf diff --git a/egs/lrs/asr1/conf/gpu.conf b/egs/aesrc2020/asr1/conf/gpu.conf similarity index 100% rename from egs/lrs/asr1/conf/gpu.conf rename to egs/aesrc2020/asr1/conf/gpu.conf diff --git a/egs/aesrc2020/asr1/conf/lm.yaml b/egs/aesrc2020/asr1/conf/lm.yaml new file mode 100644 index 00000000000..ea738c16807 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/lm.yaml @@ -0,0 +1,8 @@ +# rnnlm related +layer: 2 +unit: 650 +opt: sgd # or adam +batchsize: 64 # batch size in LM training +epoch: 20 # if the data size is large, we can reduce this +patience: 3 +maxlen: 100 # if sentence length > lm_maxlen, lm_batchsize is automatically reduced diff --git a/egs/lrs/asr1/conf/pitch.conf b/egs/aesrc2020/asr1/conf/pitch.conf similarity index 100% rename from egs/lrs/asr1/conf/pitch.conf rename to egs/aesrc2020/asr1/conf/pitch.conf diff --git a/egs/lrs/asr1/conf/queue.conf b/egs/aesrc2020/asr1/conf/queue.conf similarity index 100% rename from egs/lrs/asr1/conf/queue.conf rename to egs/aesrc2020/asr1/conf/queue.conf diff --git a/egs/lrs/asr1/conf/slurm.conf b/egs/aesrc2020/asr1/conf/slurm.conf similarity index 100% rename from egs/lrs/asr1/conf/slurm.conf rename to egs/aesrc2020/asr1/conf/slurm.conf diff --git a/egs/aesrc2020/asr1/conf/specaug.yaml b/egs/aesrc2020/asr1/conf/specaug.yaml new file mode 100644 index 00000000000..3351630d2f3 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/specaug.yaml @@ -0,0 +1,16 @@ +process: + # these three processes are a.k.a. SpecAugument + - type: "time_warp" + max_time_warp: 5 + inplace: true + mode: "PIL" + - type: "freq_mask" + F: 30 + n_mask: 2 + inplace: true + replace_with_zero: false + - type: "time_mask" + T: 40 + n_mask: 2 + inplace: true + replace_with_zero: false \ No newline at end of file diff --git a/egs/aesrc2020/asr1/conf/train.yaml b/egs/aesrc2020/asr1/conf/train.yaml new file mode 120000 index 00000000000..5e11a9c3db2 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/train.yaml @@ -0,0 +1 @@ +tuning/train_pytorch_conformer_kernel15.yaml \ No newline at end of file diff --git a/egs/aesrc2020/asr1/conf/tuning/decode_pytorch_transformer.yaml b/egs/aesrc2020/asr1/conf/tuning/decode_pytorch_transformer.yaml new file mode 100644 index 00000000000..2ece5128686 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/tuning/decode_pytorch_transformer.yaml @@ -0,0 +1,8 @@ +batchsize: 0 +beam-size: 10 +penalty: 0.0 +maxlenratio: 0.0 +minlenratio: 0.0 +ctc-weight: 0.5 +lm-weight: 0.3 +ngram-weight: 0.3 diff --git a/egs/aesrc2020/asr1/conf/tuning/decode_rnn.yaml b/egs/aesrc2020/asr1/conf/tuning/decode_rnn.yaml new file mode 100644 index 00000000000..739044dce1a --- /dev/null +++ b/egs/aesrc2020/asr1/conf/tuning/decode_rnn.yaml @@ -0,0 +1,6 @@ +beam-size: 20 +penalty: 0.0 +maxlenratio: 0.0 +minlenratio: 0.0 +ctc-weight: 0.6 +lm-weight: 0.3 diff --git a/egs/aesrc2020/asr1/conf/tuning/train_pytorch_conformer_kernel15.yaml b/egs/aesrc2020/asr1/conf/tuning/train_pytorch_conformer_kernel15.yaml new file mode 100644 index 00000000000..8769ba67139 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/tuning/train_pytorch_conformer_kernel15.yaml @@ -0,0 +1,47 @@ +# network architecture +# encoder related +elayers: 12 +eunits: 2048 +# decoder related +dlayers: 6 +dunits: 2048 +# attention related +adim: 256 +aheads: 4 + +# hybrid CTC/attention +mtlalpha: 0.3 + +# label smoothing +lsm-weight: 0.1 + +# minibatch related +batch-size: 32 +maxlen-in: 512 # if input length > maxlen-in, batchsize is automatically reduced +maxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced + +# optimization related +sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs +opt: noam +accum-grad: 2 +grad-clip: 5 +patience: 0 +epochs: 50 +dropout-rate: 0.1 + +# transformer specific setting +backend: pytorch +model-module: "espnet.nets.pytorch_backend.e2e_asr_conformer:E2E" +transformer-input-layer: conv2d # encoder architecture type +transformer-lr: 1.0 +transformer-warmup-steps: 25000 +transformer-attn-dropout-rate: 0.0 +transformer-length-normalized-loss: false +transformer-init: pytorch + +# conformer specific setting +transformer-encoder-pos-enc-layer-type: rel_pos +transformer-encoder-selfattn-layer-type: rel_selfattn +macaron-style: true +use-cnn-module: true +cnn-module-kernel: 15 diff --git a/egs/aesrc2020/asr1/conf/tuning/train_pytorch_conformer_kernel31.yaml b/egs/aesrc2020/asr1/conf/tuning/train_pytorch_conformer_kernel31.yaml new file mode 100644 index 00000000000..50d44abb5ab --- /dev/null +++ b/egs/aesrc2020/asr1/conf/tuning/train_pytorch_conformer_kernel31.yaml @@ -0,0 +1,47 @@ +# network architecture +# encoder related +elayers: 12 +eunits: 2048 +# decoder related +dlayers: 6 +dunits: 2048 +# attention related +adim: 256 +aheads: 4 + +# hybrid CTC/attention +mtlalpha: 0.3 + +# label smoothing +lsm-weight: 0.1 + +# minibatch related +batch-size: 32 +maxlen-in: 512 # if input length > maxlen-in, batchsize is automatically reduced +maxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced + +# optimization related +sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs +opt: noam +accum-grad: 2 +grad-clip: 5 +patience: 0 +epochs: 50 +dropout-rate: 0.1 + +# transformer specific setting +backend: pytorch +model-module: "espnet.nets.pytorch_backend.e2e_asr_conformer:E2E" +transformer-input-layer: conv2d # encoder architecture type +transformer-lr: 1.0 +transformer-warmup-steps: 25000 +transformer-attn-dropout-rate: 0.0 +transformer-length-normalized-loss: false +transformer-init: pytorch + +# conformer specific setting +transformer-encoder-pos-enc-layer-type: rel_pos +transformer-encoder-selfattn-layer-type: rel_selfattn +macaron-style: true +use-cnn-module: true +cnn-module-kernel: 31 diff --git a/egs/aesrc2020/asr1/conf/tuning/train_pytorch_transformer.yaml b/egs/aesrc2020/asr1/conf/tuning/train_pytorch_transformer.yaml new file mode 100644 index 00000000000..4dd0b4e8247 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/tuning/train_pytorch_transformer.yaml @@ -0,0 +1,40 @@ +# network architecture +# encoder related +elayers: 12 +eunits: 2048 +# decoder related +dlayers: 6 +dunits: 2048 +# attention related +adim: 256 +aheads: 4 + +# hybrid CTC/attention +mtlalpha: 0.3 + +# label smoothing +lsm-weight: 0.1 + +# minibatch related +batch-size: 32 +maxlen-in: 512 # if input length > maxlen-in, batchsize is automatically reduced +maxlen-out: 150 # if output length > maxlen-out, batchsize is automatically reduced + +# optimization related +sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs +opt: noam +accum-grad: 2 +grad-clip: 5 +patience: 0 +epochs: 50 +dropout-rate: 0.1 + +# transformer specific setting +backend: pytorch +model-module: "espnet.nets.pytorch_backend.e2e_asr_transformer:E2E" +transformer-input-layer: conv2d # encoder architecture type +transformer-lr: 2.0 +transformer-warmup-steps: 25000 +transformer-attn-dropout-rate: 0.0 +transformer-length-normalized-loss: false +transformer-init: pytorch diff --git a/egs/aesrc2020/asr1/conf/tuning/train_rnn.yaml b/egs/aesrc2020/asr1/conf/tuning/train_rnn.yaml new file mode 100644 index 00000000000..ca5e99fa320 --- /dev/null +++ b/egs/aesrc2020/asr1/conf/tuning/train_rnn.yaml @@ -0,0 +1,31 @@ +# network architecture +# encoder related +etype: vggblstm # encoder architecture type +elayers: 3 +eunits: 1024 +eprojs: 1024 +subsample: "1_2_2_1_1" # skip every n frame from input to nth layers +# decoder related +dlayers: 2 +dunits: 1024 +# attention related +atype: location +adim: 1024 +aconv-chans: 10 +aconv-filts: 100 + +# hybrid CTC/attention +mtlalpha: 0.5 + +# minibatch related +batch-size: 30 +maxlen-in: 800 # if input length > maxlen_in, batchsize is automatically reduced +maxlen-out: 150 # if output length > maxlen_out, batchsize is automatically reduced + +# optimization related +opt: adadelta +epochs: 10 +patience: 0 + +# scheduled sampling option +sampling-probability: 0.0 diff --git a/egs/aesrc2020/asr1/local/create_subsets.sh b/egs/aesrc2020/asr1/local/create_subsets.sh new file mode 100755 index 00000000000..f2667260c7b --- /dev/null +++ b/egs/aesrc2020/asr1/local/create_subsets.sh @@ -0,0 +1,24 @@ +#!/bin/bash + +. ./path.sh || exit 1; +. ./cmd.sh || exit 1; + +data=$1 # data transformed into kaldi format + + # divide development set for cross validation + if [ -d ${data} ];then + for i in US UK IND CHN JPN PT RU KR CA ES;do + ./utils/subset_data_dir.sh --spk-list local/files/cvlist/${i}_cv_spk $data/data_all $data/cv/$i + cat $data/cv/$i/feats.scp >> $data/cv.scp + done + ./utils/filter_scp.pl --exclude $data/cv.scp $data/data_all/feats.scp > $data/train_and_dev.scp + #95-5 split for dev set + sed -n '0~20p' $data/train_and_dev.scp > $data/dev.scp + ./utils/filter_scp.pl --exclude $data/dev.scp $data/train_and_dev.scp > $data/train.scp + ./utils/subset_data_dir.sh --utt-list $data/train.scp $data/data_all $data/train_org + ./utils/subset_data_dir.sh --utt-list $data/dev.scp $data/data_all $data/dev_org + ./utils/subset_data_dir.sh --utt-list $data/cv.scp $data/data_all $data/cv_all + fi + +echo "local/subset_data.sh succeeded" +exit 0; diff --git a/egs/aesrc2020/asr1/local/data_prep.sh b/egs/aesrc2020/asr1/local/data_prep.sh new file mode 100755 index 00000000000..4d5b26bd217 --- /dev/null +++ b/egs/aesrc2020/asr1/local/data_prep.sh @@ -0,0 +1,45 @@ +#!/bin/bash + +# Copyright 2020 Audio, Speech and Language Processing Group @ NWPU (Author: Xian Shi) +# Apache 2.0 + +. ./path.sh || exit 1; +. ./cmd.sh || exit 1; + +raw_data=$1 # raw data with metadata, txt and wav +data=$2 # data transformed into kaldi format + +# generate kaldi format data for all +if [ -d ${raw_data} ];then + echo "Generating kaldi format data." + mkdir -p $data/data_all + find $raw_data -type f -name "*.wav" > $data/data_all/wavpath + awk -F'/' '{print $(NF-2)"-"$(NF-1)"-"$NF}' $data/data_all/wavpath | sed 's:\.wav::g' > $data/data_all/uttlist + paste $data/data_all/uttlist $data/data_all/wavpath > $data/data_all/wav.scp + python local/preprocess.py $data/data_all/wav.scp $data/data_all/trans $data/data_all/utt2spk # faster than for in shell + ./utils/utt2spk_to_spk2utt.pl $data/data_all/utt2spk > $data/data_all/spk2utt +fi + +# clean transcription +if [ -d $data/data_all ];then + echo "Cleaning transcription." + tr '[a-z]' '[A-Z]' < $data/data_all/trans > $data/data_all/trans_upper + # turn "." in specific abbreviations into "" tag + sed -i -e 's: MR\.: MR:g' -e 's: MRS\.: MRS:g' -e 's: MS\.: MS:g' \ + -e 's:^MR\.:MR:g' -e 's:^MRS\.:MRS:g' -e 's:^MS\.:MS:g' $data/data_all/trans_upper + # fix bug + sed -i 's:^ST\.:STREET:g' $data/data_all/trans_upper + sed -i 's: ST\.: STREET:g' $data/data_all/trans_upper + # punctuation marks + sed -i "s%,\|\.\|?\|!\|;\|-\|:\|,'\|\.'\|?'\|!'\| '% %g" $data/data_all/trans_upper + sed -i 's::.:g' $data/data_all/trans_upper + # blank + sed -i 's:[ ][ ]*: :g' $data/data_all/trans_upper + paste $data/data_all/uttlist $data/data_all/trans_upper > $data/data_all/text + + # critally, must replace tab with space between uttid and text + sed -e "s/\t/ /g" -i $data/data_all/text +fi + +echo "local/data_prep.sh succeeded" +exit 0; diff --git a/egs/aesrc2020/asr1/local/download_and_untar.sh b/egs/aesrc2020/asr1/local/download_and_untar.sh new file mode 100755 index 00000000000..046ce35bb1b --- /dev/null +++ b/egs/aesrc2020/asr1/local/download_and_untar.sh @@ -0,0 +1,23 @@ +#!/usr/bin/env bash + +. ./path.sh || exit 1; +. ./cmd.sh || exit 1; + +zipped_data=$1 +raw_data=$2/Datatang-English/data + +# unzip and rename each accent +unzip $zipped_data -d ${2} +mv $raw_data/American\ English\ Speech\ Data $raw_data/US +mv $raw_data/British\ English\ Speech\ Data $raw_data/UK +mv $raw_data/Chinese\ Speaking\ English\ Speech\ Data $raw_data/CHN +mv $raw_data/Indian\ English\ Speech\ Data $raw_data/IND +mv $raw_data/Portuguese\ Speaking\ English\ Speech\ Data $raw_data/PT +mv $raw_data/Russian\ Speaking\ English\ Speech\ Data $raw_data/RU +mv $raw_data/Japanese\ Speaking\ English\ Speech\ Data $raw_data/JPN +mv $raw_data/Korean\ Speaking\ English\ Speech\ Data $raw_data/KR +mv $raw_data/Canadian\ English\ Speech\ Data $raw_data/CA +mv $raw_data/Spanish\ Speaking\ English\ Speech\ Data $raw_data/ES + +echo "local/download_and_untar.sh succeeded" +exit 0; diff --git a/egs/aesrc2020/asr1/local/files/ar.dict b/egs/aesrc2020/asr1/local/files/ar.dict new file mode 100644 index 00000000000..d17cfb0a5e0 --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/ar.dict @@ -0,0 +1,8 @@ + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 diff --git a/egs/aesrc2020/asr1/local/files/cvlist/CA_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/CA_cv_spk new file mode 100644 index 00000000000..9362f7dc693 --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/CA_cv_spk @@ -0,0 +1,4 @@ +CA-G00034 +CA-G00086 +CA-G00414 +CA-G20113 diff --git a/egs/aesrc2020/asr1/local/files/cvlist/CHN_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/CHN_cv_spk new file mode 100644 index 00000000000..f5ed8b6241c --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/CHN_cv_spk @@ -0,0 +1,4 @@ +CHN-G00190 +CHN-G00992 +CHN-G61365 +CHN-G01372 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/files/cvlist/ES_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/ES_cv_spk new file mode 100644 index 00000000000..509dd652f44 --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/ES_cv_spk @@ -0,0 +1,4 @@ +ES-G00714 +ES-G01878 +ES-G11701 +ES-G20575 diff --git a/egs/aesrc2020/asr1/local/files/cvlist/IND_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/IND_cv_spk new file mode 100644 index 00000000000..72b5df67cf8 --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/IND_cv_spk @@ -0,0 +1,4 @@ +IND-G00892 +IND-G01006 +IND-G01501 +IND-G0760 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/files/cvlist/JPN_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/JPN_cv_spk new file mode 100644 index 00000000000..957a43af30b --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/JPN_cv_spk @@ -0,0 +1,4 @@ +JPN-G00040 +JPN-G00125 +JPN-G00354 +JPN-G20194 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/files/cvlist/KR_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/KR_cv_spk new file mode 100644 index 00000000000..0e078514d72 --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/KR_cv_spk @@ -0,0 +1,4 @@ +KR-G00022 +KR-G00276 +KR-G10029 +KR-G10122 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/files/cvlist/PT_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/PT_cv_spk new file mode 100644 index 00000000000..89f09e4756e --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/PT_cv_spk @@ -0,0 +1,5 @@ +PT-G00600 +PT-G00643 +PT-G00963 +PT-G10618 +PT-G20539 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/files/cvlist/RU_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/RU_cv_spk new file mode 100644 index 00000000000..3069b2e4f6f --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/RU_cv_spk @@ -0,0 +1,4 @@ +RU-G00163 +RU-G00196 +RU-G00439 +RU-G10416 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/files/cvlist/UK_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/UK_cv_spk new file mode 100644 index 00000000000..fe7cd8b43cd --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/UK_cv_spk @@ -0,0 +1,8 @@ +UK-G00025 +UK-G00808 +UK-G01337 +UK-G01807 +UK-G10261 +UK-G11032 +UK-G11739 +UK-G40517 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/files/cvlist/US_cv_spk b/egs/aesrc2020/asr1/local/files/cvlist/US_cv_spk new file mode 100644 index 00000000000..760edcea2a0 --- /dev/null +++ b/egs/aesrc2020/asr1/local/files/cvlist/US_cv_spk @@ -0,0 +1,6 @@ +US-G00007 +US-G01459 +US-G10948 +US-G20537 +US-G20939 +US-G30201 \ No newline at end of file diff --git a/egs/aesrc2020/asr1/local/preprocess.py b/egs/aesrc2020/asr1/local/preprocess.py new file mode 100755 index 00000000000..f5939848f4e --- /dev/null +++ b/egs/aesrc2020/asr1/local/preprocess.py @@ -0,0 +1,18 @@ +# Copyright 2020 Audio, Speech and Language Processing Group @ NWPU (Author: Xian Shi) +# Apache 2.0 + +import sys + +fin = open(sys.argv[1], "r") +fout_text = open(sys.argv[2], "w") +fout_utt2spk = open(sys.argv[3], "w") + +for line in fin.readlines(): + uttid, path = line.strip("\n").split("\t") + text_path = path.replace(".wav", ".txt") + text_ori = open(text_path, "r").readlines()[0].strip("\n") + feild = path.split("/") + accid = feild[-3] + spkid = accid + "-" + feild[-2] + fout_utt2spk.write(uttid + "\t" + spkid + "\n") + fout_text.write(text_ori + "\n") diff --git a/egs/aesrc2020/asr1/path.sh b/egs/aesrc2020/asr1/path.sh new file mode 100644 index 00000000000..d405bf59826 --- /dev/null +++ b/egs/aesrc2020/asr1/path.sh @@ -0,0 +1,17 @@ +MAIN_ROOT=$PWD/../../.. +KALDI_ROOT=$MAIN_ROOT/tools/kaldi + + +export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PATH +[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1 +. $KALDI_ROOT/tools/config/common_path.sh +export LC_ALL=C + +export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/tools/chainer_ctc/ext/warp-ctc/build +. "${MAIN_ROOT}"/tools/activate_python.sh && . "${MAIN_ROOT}"/tools/extra_path.sh +export PATH=$MAIN_ROOT/utils:$MAIN_ROOT/espnet/bin:$PATH + +export OMP_NUM_THREADS=1 + +# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 diff --git a/egs/aesrc2020/asr1/run.sh b/egs/aesrc2020/asr1/run.sh new file mode 100755 index 00000000000..1cd0d51791f --- /dev/null +++ b/egs/aesrc2020/asr1/run.sh @@ -0,0 +1,322 @@ +#!/usr/bin/env bash + +# Copyright 2017 Johns Hopkins University (Shinji Watanabe) +# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0) + +. ./path.sh || exit 1; +. ./cmd.sh || exit 1; + +# general configuration +backend=pytorch +stage=-1 # start from -1 if you need to start from data download +stop_stage=100 +ngpu=8 # number of gpus ("0" uses cpu, otherwise use gpu) +nj=32 +debugmode=1 +dumpdir=dump # directory to dump full features +N=0 # number of minibatches to be used (mainly for debugging). "0" uses all minibatches. +verbose=0 # verbose option +resume= # Resume the training from snapshot + +# feature configuration +do_delta=false + +preprocess_config=conf/specaug.yaml +train_config=conf/train.yaml # current default recipe requires 4 gpus. + # if you do not have 4 gpus, please reconfigure the `batch-bins` and `accum-grad` parameters in config. +lm_config=conf/lm.yaml +decode_config=conf/decode.yaml + +# rnnlm related +lm_resume= # specify a snapshot file to resume LM training +lmtag= # tag for managing LMs + +# decoding parameter +recog_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best' +lang_model=rnnlm.model.best # set a language model to be used for decoding + +# model average realted (only for transformer) +n_average=5 # the number of ASR models to be averaged +use_valbest_average=true # if true, the validation `n_average`-best ASR models will be averaged. + # if false, the last `n_average` ASR models will be averaged. +lm_n_average=0 # the number of languge models to be averaged +use_lm_valbest_average=false # if true, the validation `lm_n_average`-best language models will be averaged. + # if false, the last `lm_n_average` language models will be averaged. + +# Set this to somewhere where you want to put your data, or where +# someone else has already put it. You'll want to change this +# if you're not on the CLSP grid. +datadir= + +# The AESRC2020 data needs to be requested via services@datatang.com +# The provided data will be a zip +datazip= + +# bpemode (unigram or bpe) +nbpe=5000 +bpemode=unigram + +# exp tag +tag="" # tag for managing experiments. + +. utils/parse_options.sh || exit 1; + +# Set bash to 'debug' mode, it will exit on : +# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands', +set -e +set -u +set -o pipefail + +train_set=train +train_sp=train_sp +train_dev=dev +recog_set="US UK IND CHN JPN PT RU KR CA ES" + +if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then + echo "stage -1: Data Download" + if [ ! -f ${datazip} ]; then + echo "The AESRC2020 data needs to be requested via services@datatang.com" + exit 1 + fi + local/download_and_untar.sh ${datazip} ${datadir} +fi + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + ### Task dependent. You have to make data the following preparation part by yourself. + ### But you can utilize Kaldi recipes in most cases + echo "stage 0: Data preparation" + local/data_prep.sh $datadir/Datatang-English/data data + ./utils/fix_data_dir.sh data/data_all +fi + +feat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir} +feat_sp_dir=${dumpdir}/${train_sp}/delta${do_delta}; mkdir -p ${feat_sp_dir} +feat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir} +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + ### Task dependent. You have to design training and dev sets by yourself. + ### But you can utilize Kaldi recipes in most cases + echo "stage 1: Feature Generation" + fbankdir=fbank + # Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame + steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj ${nj} --write_utt2num_frames true \ + data/data_all exp/make_fbank/data_all ${fbankdir} + utils/fix_data_dir.sh data/data_all + + # Data splits + local/create_subsets.sh data + + utils/perturb_data_dir_speed.sh 0.9 data/${train_set}_org data/temp1 + utils/perturb_data_dir_speed.sh 1.0 data/${train_set}_org data/temp2 + utils/perturb_data_dir_speed.sh 1.1 data/${train_set}_org data/temp3 + + utils/combine_data.sh --extra-files utt2uniq data/${train_sp}_org data/temp1 data/temp2 data/temp3 + + # remove utt having more than 3000 frames + # remove utt having more than 400 characters + remove_longshortdata.sh --maxframes 3000 --maxchars 400 data/${train_set}_org data/${train_set} + remove_longshortdata.sh --maxframes 3000 --maxchars 400 data/${train_sp}_org data/${train_sp} + remove_longshortdata.sh --maxframes 3000 --maxchars 400 data/${train_dev}_org data/${train_dev} + steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj $nj --write_utt2num_frames true \ + data/$train_sp exp/make_fbank/$train_sp ${fbankdir} + rm data/train_sp/utt2dur #hacked + utils/fix_data_dir.sh data/train_sp + # compute global CMVN + compute-cmvn-stats scp:data/${train_sp}/feats.scp data/${train_sp}/cmvn.ark + + # dump features for training + if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_tr_dir}/storage ]; then + utils/create_split_dir.pl \ + /export/b{14,15,16,17}/${USER}/espnet-data/egs/librispeech/asr1/dump/${train_set}/delta${do_delta}/storage \ + ${feat_tr_dir}/storage + fi + if [[ $(hostname -f) == *.clsp.jhu.edu ]] && [ ! -d ${feat_dt_dir}/storage ]; then + utils/create_split_dir.pl \ + /export/b{14,15,16,17}/${USER}/espnet-data/egs/librispeech/asr1/dump/${train_dev}/delta${do_delta}/storage \ + ${feat_dt_dir}/storage + fi + dump.sh --cmd "$train_cmd" --nj ${nj} --do_delta ${do_delta} \ + data/${train_sp}/feats.scp data/${train_sp}/cmvn.ark exp/dump_feats/$train_sp ${feat_sp_dir} + dump.sh --cmd "$train_cmd" --nj ${nj} --do_delta ${do_delta} \ + data/${train_dev}/feats.scp data/${train_sp}/cmvn.ark exp/dump_feats/data_all ${feat_dt_dir} + for rtask in ${recog_set}; do + feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}; mkdir -p ${feat_recog_dir} + dump.sh --cmd "$train_cmd" --nj ${nj} --do_delta ${do_delta} \ + data/cv/${rtask}/feats.scp data/${train_sp}/cmvn.ark exp/dump_feats/recog/data_all \ + ${feat_recog_dir} + done +fi + +dict=data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt +bpemodel=data/lang_char/${train_set}_${bpemode}${nbpe} +echo "dictionary: ${dict}" +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + ### Task dependent. You have to check non-linguistic symbols used in the corpus. + echo "stage 2: Dictionary and Json Data Preparation" + mkdir -p data/lang_char/ + echo " 1" > ${dict} # must be 1, 0 will be used for "blank" in CTC + cut -f 2- -d" " data/${train_set}/text > data/lang_char/input.txt + spm_train --input=data/lang_char/input.txt --vocab_size=${nbpe} --model_type=${bpemode} --model_prefix=${bpemodel} --input_sentence_size=100000000 + spm_encode --model=${bpemodel}.model --output_format=piece < data/lang_char/input.txt | tr ' ' '\n' | sort | uniq | awk '{print $0 " " NR+1}' >> ${dict} + wc -l ${dict} + + # make json labels + data2json.sh --nj ${nj} --feat ${feat_sp_dir}/feats.scp --bpecode ${bpemodel}.model \ + data/${train_sp} ${dict} > ${feat_sp_dir}/data_${bpemode}${nbpe}.json + data2json.sh --nj ${nj} --feat ${feat_dt_dir}/feats.scp --bpecode ${bpemodel}.model \ + data/${train_dev} ${dict} > ${feat_dt_dir}/data_${bpemode}${nbpe}.json + + for rtask in ${recog_set}; do + feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta} + data2json.sh --nj ${nj} --feat ${feat_recog_dir}/feats.scp --bpecode ${bpemodel}.model \ + data/cv/${rtask} ${dict} > ${feat_recog_dir}/data_${bpemode}${nbpe}.json + done +fi + +# You can skip this and remove --rnnlm option in the recognition (stage 5) +if [ -z ${lmtag} ]; then + lmtag=$(basename ${lm_config%.*}) +fi +lmexpname=train_rnnlm_${backend}_${lmtag}_${bpemode}${nbpe}_ngpu${ngpu} +lmexpdir=exp/${lmexpname} +mkdir -p ${lmexpdir} + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + echo "stage 3: LM Preparation" + lmdatadir=data/local/lm_train_${bpemode}${nbpe} + # use external data + if [ ! -e data/local/lm_train/librispeech-lm-norm.txt.gz ]; then + wget http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz -P data/local/lm_train/ + fi + if [ ! -e ${lmdatadir} ]; then + mkdir -p ${lmdatadir} + cut -f 2- -d" " data/${train_set}/text | gzip -c > data/local/lm_train/${train_set}_text.gz + # combine external text and transcriptions and shuffle them with seed 777 + zcat data/local/lm_train/librispeech-lm-norm.txt.gz data/local/lm_train/${train_set}_text.gz |\ + spm_encode --model=${bpemodel}.model --output_format=piece > ${lmdatadir}/train.txt + cut -f 2- -d" " data/${train_dev}/text | spm_encode --model=${bpemodel}.model --output_format=piece \ + > ${lmdatadir}/valid.txt + fi + ${cuda_cmd} --gpu ${ngpu} ${lmexpdir}/train.log \ + lm_train.py \ + --config ${lm_config} \ + --ngpu ${ngpu} \ + --backend ${backend} \ + --verbose 1 \ + --outdir ${lmexpdir} \ + --tensorboard-dir tensorboard/${lmexpname} \ + --train-label ${lmdatadir}/train.txt \ + --valid-label ${lmdatadir}/valid.txt \ + --resume ${lm_resume} \ + --dict ${dict} \ + --dump-hdf5-path ${lmdatadir} +fi + +if [ -z ${tag} ]; then + expname=${train_set}_${backend}_$(basename ${train_config%.*}) + if ${do_delta}; then + expname=${expname}_delta + fi + if [ -n "${preprocess_config}" ]; then + expname=${expname}_$(basename ${preprocess_config%.*}) + fi +else + expname=${train_set}_${backend}_${tag} +fi +expdir=exp/${expname} +mkdir -p ${expdir} + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + echo "stage 4: Network Training" + ${cuda_cmd} --gpu ${ngpu} ${expdir}/train.log \ + asr_train.py \ + --config ${train_config} \ + --preprocess-conf ${preprocess_config} \ + --ngpu ${ngpu} \ + --backend ${backend} \ + --outdir ${expdir}/results \ + --tensorboard-dir tensorboard/${expname} \ + --debugmode ${debugmode} \ + --dict ${dict} \ + --debugdir ${expdir} \ + --minibatches ${N} \ + --verbose ${verbose} \ + --resume ${resume} \ + --train-json ${feat_sp_dir}/data_${bpemode}${nbpe}.json \ + --valid-json ${feat_dt_dir}/data_${bpemode}${nbpe}.json +fi + +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + echo "stage 5: Decoding" + if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \ + [[ $(get_yaml.py ${train_config} model-module) = *conformer* ]] || \ + [[ $(get_yaml.py ${train_config} etype) = custom ]] || \ + [[ $(get_yaml.py ${train_config} dtype) = custom ]]; then + # Average ASR models + if ${use_valbest_average}; then + recog_model=model.val${n_average}.avg.best + opt="--log ${expdir}/results/log" + else + recog_model=model.last${n_average}.avg.best + opt="--log" + fi + average_checkpoints.py \ + ${opt} \ + --backend ${backend} \ + --snapshots ${expdir}/results/snapshot.ep.* \ + --out ${expdir}/results/${recog_model} \ + --num ${n_average} + + # Average LM models + if [ ${lm_n_average} -eq 0 ]; then + lang_model=rnnlm.model.best + else + if ${use_lm_valbest_average}; then + lang_model=rnnlm.val${lm_n_average}.avg.best + opt="--log ${lmexpdir}/log" + else + lang_model=rnnlm.last${lm_n_average}.avg.best + opt="--log" + fi + average_checkpoints.py \ + ${opt} \ + --backend ${backend} \ + --snapshots ${lmexpdir}/snapshot.ep.* \ + --out ${lmexpdir}/${lang_model} \ + --num ${lm_n_average} + fi + fi + + pids=() # initialize pids + for rtask in ${recog_set}; do + ( + decode_dir=decode_${rtask}_${recog_model}_$(basename ${decode_config%.*})_${lmtag} + feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta} + + # split data + splitjson.py --parts ${nj} ${feat_recog_dir}/data_${bpemode}${nbpe}.json + + #### use CPU for decoding + ngpu=0 + + # set batchsize 0 to disable batch decoding + ${decode_cmd} JOB=1:${nj} ${expdir}/${decode_dir}/log/decode.JOB.log \ + asr_recog.py \ + --config ${decode_config} \ + --ngpu ${ngpu} \ + --backend ${backend} \ + --batchsize 0 \ + --recog-json ${feat_recog_dir}/split${nj}utt/data_${bpemode}${nbpe}.JOB.json \ + --result-label ${expdir}/${decode_dir}/data.JOB.json \ + --model ${expdir}/results/${recog_model} \ + --rnnlm ${lmexpdir}/${lang_model} \ + --api v2 + + score_sclite.sh --bpe ${nbpe} --bpemodel ${bpemodel}.model --wer true ${expdir}/${decode_dir} ${dict} + + ) & + pids+=($!) # store background pids + done + i=0; for pid in "${pids[@]}"; do wait ${pid} || ((++i)); done + [ ${i} -gt 0 ] && echo "$0: ${i} background jobs are failed." && false + echo "Finished" +fi diff --git a/egs/lrs/asr1/steps b/egs/aesrc2020/asr1/steps similarity index 100% rename from egs/lrs/asr1/steps rename to egs/aesrc2020/asr1/steps diff --git a/egs/lrs/asr1/utils b/egs/aesrc2020/asr1/utils similarity index 100% rename from egs/lrs/asr1/utils rename to egs/aesrc2020/asr1/utils diff --git a/egs/commonvoice/asr1/local/download_and_untar.sh b/egs/commonvoice/asr1/local/download_and_untar.sh index 1f5c40d9b0e..cce26302127 100755 --- a/egs/commonvoice/asr1/local/download_and_untar.sh +++ b/egs/commonvoice/asr1/local/download_and_untar.sh @@ -16,7 +16,7 @@ fi if [ $# -ne 3 ]; then echo "Usage: $0 [--remove-archive] " - echo "e.g.: $0 /export/data/ https://common-voice-data-download.s3.amazonaws.com/cv_corpus_v1.tar.gz cv_corpus_v1.tar.gz" + echo "e.g.: $0 /export/data/ https://us.openslr.org/resources/108/FR.tgz" echo "With --remove-archive it will remove the archive after successfully un-tarring it." exit 0; fi diff --git a/egs/lrs/README.md b/egs/lrs/README.md new file mode 100644 index 00000000000..26f623cd08b --- /dev/null +++ b/egs/lrs/README.md @@ -0,0 +1,335 @@ +# ESPnet-AVSR + +## Introduction +This repository contains an implementation of end-to-end (E2E) audio-visual speech recognition (AVSR) based on the ESPnet ASR toolkit. The new fusion strategy follows the paper "Fusing information streams in end-to-end audio-visual speech recognition." (https://ieeexplore.ieee.org/document/9414553) [[1]](#literature). A broad range of reliability measures are used to help the integration model improve the performance of the AVSR model. We use two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 corpora for all our experiments. +In addition, this project also contains an audio-only model for comparison. + +## Table of Contents +- [Installation](#installation-of-required-packages) + * [Requirements](#requirements) +- [Project Structure](#project-structure) + * [Basics](#project-structure) + * [AVSR1](#detailed-description-of-avsr1) +- [Usage of the scripts](#running-the-script) + + [Notes](#notes) + + +## Installation of required packages + +### Requirements + +For installation, approximately 40GB of free disk space is needed. avsr1/run.sh stage 0 installs all required packages in avsr1/local/installations: + +**Required Packages:** +1. ESPNet: https://github.com/espnet/espnet +1. OpenFace: https://github.com/TadasBaltrusaitis/OpenFace +2. DeepXi: https://github.com/anicolson/DeepXi +3. Vidaug: https://github.com/okankop/vidaug + + + +## Project structure +The main folder avsr1/, contains the code for the audio-visual speech recognition system, also trained on the LRS2 [[2]](#literature) dataset together with the LRS3 dataset (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature). It follows the basic ESPnet structure. +The main code for the recognition system is the run.sh script. In the script, the workflow of the systems is performed in multiple stages: + +| AVSR | +|-------------------------------------------------------------| +| Stage 0: Install required packages | +| Stage 1: Data Download and preparation | +| Stage 2: Audio augmentation | +| Stage 3: MP3 files and Feature Generation | +| Stage 4: Dictionary and JSON data preparation | +| Stage 5: Reliability measures generation | +| Stage 6: Language model trainin | +| Stage 7: Training of the E2E-AVSR model and Decoding | + + + + + + +### Detailed description of AVSR1: + +##### Stage 0: Packages installations + * Install the required packages: ESPNet, OpenFace, DeepXi, Vidaug in avsr1/local/installations. To install OpenFace, you will need sudo right. + +##### Stage 1: Data preparation + * The data set LRS2 [2] must be downloaded in advance by yourself. For downloading the dataset, please visit https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html/ [2]. You will need to sign a data-sharing agreement with BBC Research & Development before getting access. After downloading, please edit path.sh file and assign the dataset directory path to the DATA_DIR variable + * The same applies to the LRS3 dataset https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html [3]. After downloading, please edit path.sh file and assign the dataset directory path to the DATALRS3_DIR variable + * Download the Musan dataset for audio data augmentation and save it under ${MUSAN_DIR} directory + * Download Room Impulse Response and Noise Database (RIRS-Noises) and save it under RIRS_NOISES/ directory + * Run audio_data_prep.sh script: Create file lists for the given part of the Dataset, prepare the Kaldi files + * Dump useful data for training + +##### Stage 2: Audio Augmentation + * Augment the audio data with RIRS Noise + * Augment the audio data with Musan Noise + * The augmented files are saved under data/audio/augment whereas the clear audio files can be found in data/audio/clear for all the used datasets (Test, Validation(Val), Train and optional Pretrain) + +##### Stage 3: Feature Generation + * Make augmented MP3 files + * Generate the fbank and mfcc features for the audio signals. By default, 80-dimensional filterbanks with pitch on each frame are used + * Compute global Cepstral mean and variance normalization (CMVN). This computes goodness of pronunciation (GOP) and extracts phone-level pronunciation features for mispronunciations detection tasks (https://kaldi-asr.org/doc/compute-cmvn-stats_8cc.html). + +##### Stage 4: Dictionary and JSON data preparation + * Build Dictionary and JSON Data Preparation + * Build a tokenizer using Sentencepiece: https://github.com/google/sentencepiece + +##### Stage 5: Reliability measures generation + * Stage 5.0: Creat dump file for MFCC features + * Stage 5.1: Video augmentation with Gaussian blur and salt&pepper noise + * Stage 5.2: OpenFace face recognition for facial recognition (especially the mouth region, for further details see documentation in avsr1/local folder ) + * Stage 5.3: Extract video frames + * Stage 5.4: Estimate SNRs using DeepXi framework + * Stage 5.5: Extract video features by pretrained video feature extractor [[4]](#literature) + * Stage 5.6: Make video .ark files + * Stage 5.7: Remake audio and video dump files + * Stage 5.8: Split test decode dump files by different signal-to-noise ratios + +##### Stage 6: Language Model Training + * Train your own language model on the librispeech dataset (https://www.openslr.org/11/) or use a pretrained language model + * It is possible to skip the language model and use the system without an external language model. + +##### Stage 7: Network Training + * Train audio model + * Pretrain video model + * Finetune video model + * Pretrain av model + * Finetune av model (model used for decoding) + +##### Other important references: + * Explanation of the CSV-file for OpenFace: https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format#featureextraction + + +## Running the script +The runtime script is the script **run.sh**. It can be found in avsr1/ directory. +> Before running the script, please download the LRS2 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html) [[2]](#literature) and LRS3 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature) datasets by yourself and save the download paths to the variables DATA_DIR (LRS2 path) and DATALRS3_DIR (LRS3 path) inside run.sh file. + +### Notes +Due to the long runtime, it could be useful to run the script using screen command in combination with monitoring in a terminal window and also redirect the output to a log file. + +Screen is a terminal multiplexer which means that you can start any number of virtual terminals inside the current terminal session. The advantage is, that you can detach virtual terminals so that they are running in the background. Furthermore, the processes keep still running, even if you are closing the main session or close an ssh connection if you are working remote on a server. +Screen can be installed from the official package repositories via +```console +foo@bar:~$ sudo apt install screen +``` +As an example, to redirect the output into a file named "log_run_sh.txt", the script could be started with: +```console +foo@bar:~/avsr1$ screen bash -c 'bash run.sh |& tee -a log_run_sh.txt' +``` +This will start a virtual terminal session, which is executing and monitoring the run.sh file. The output is printed to this session as well as saved into the file "log_run_sh.txt". You can leave the monitoring session by simply pressing ctrl+A+D. If you want to return to the process, simply type +```console +foo@bar:~$ screen -ls +``` +into a terminal to see all running screen processes with their corresponding ID. Then execute +```console +foo@bar:~$ screen -r [ID] +``` +to return to the process. +Source: https://wiki.ubuntuusers.de/Screen/ + +*** +### Literature + +[1] W. Yu, S. Zeiler and D. Kolossa, "Fusing Information Streams in End-to-End Audio-Visual Speech Recognition," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3430-3434, doi: 10.1109/ICASSP39728.2021.9414553. + +[2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman
+Deep Audio-Visual Speech Recognition +arXiv: 1809.02108 + +[3] T. Afouras, J. S. Chung, A. Zisserman
+LRS3-TED: a large-scale dataset for visual speech recognition +arXiv preprint arXiv: 1809.00496 + +[4] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, andM. Pantic, “Audio-visual speech recognition with a hybridCTC/Attention architecture,” in IEEE SLT. IEEE, 2018. + diff --git a/egs/lrs/avsr1/RESULTS.md b/egs/lrs/avsr1/RESULTS.md new file mode 100755 index 00000000000..2615db795f8 --- /dev/null +++ b/egs/lrs/avsr1/RESULTS.md @@ -0,0 +1,294 @@ +## pretrain_Train_pytorch_audio_delta_specaug (Audio-Only) + +* Model files (archived to model.tar.gz by $ pack_model.sh) + - download link: https://drive.google.com/file/d/1ITgdZoa8vQ7lDwi1jLziYGXOyUtgE2ow/view + - training config file: conf/train.yaml + - decoding config file: conf/decode.yaml + - preprocess config file: conf/specaug.yaml + - lm config file: conf/lm.yaml + - cmvn file: data/train/cmvn.ark + - e2e file: exp/audio/model.last10.avg.best + - e2e json file: exp/audio/model.json + - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best + - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json + - dict file: data/lang_char/train_unigram500_units.txt + +## Environments +- date: `Mon Feb 21 11:52:07 UTC 2022` +- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]` +- espnet version: `espnet 0.6.0` +- chainer version: `chainer 6.0.0` +- pytorch version: `pytorch 1.0.1.post2` + +### CER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise|-12|171|1669|82.0|11.2|6.8|2.2|20.3|38.6| +||-9|187|1897|87.0|8.3|4.7|0.8|13.8|33.2| +||-6|176|1821|92.0|5.5|2.5|1.1|9.1|26.7| +||-3|201|2096|94.4|2.2|3.3|0.2|5.8|20.4| +||0|158|1611|95.0|3.0|2.0|0.4|5.4|19.0| +||3|173|1710|94.7|2.7|2.6|0.4|5.7|24.9| +||6|185|1920|96.2|1.8|2.0|0.5|4.3|17.8| +||9|157|1533|97.6|1.0|1.4|0.5|2.9|13.4| +||12|150|1536|96.4|1.6|2.1|0.3|4.0|20.7| +||clean|138|1390|96.7|1.4|1.9|0.4|3.7|17.4| +||reverb|177|1755|93.7|3.6|2.7|0.7|7.0|23.2| +|ambient noise|-12|187|1873|76.4|16.3|7.3|2.3|25.9|51.9| +||-9 |193|1965|84.2|10.3|5.4|1.8|17.6|40.4| +||-6 |176|1883|90.2|5.8|4.0|1.3|11.2|26.1| +||-3 |173|1851|91.2|4.8|4.0|1.0|9.8|32.9| +|| 0 |148|1470|94.8|3.0|2.2|0.7|5.9|23.6| +|| 3 |176|1718|96.0|2.1|1.9|0.3|4.3|17.0| +|| 6 |166|1714|93.7|2.9|3.4|0.5|6.8|20.5| +|| 9 |170|1601|96.9|1.5|1.6|0.3|3.4|18.2| +||12 |169|1718|95.9|2.5|1.6|0.2|4.3|20.1| +||clean |138|1390|96.7|1.4|1.9|0.4|3.7|17.4| +||reverb |177|1755|93.7|3.6|2.7|0.7|7.0|23.2| + +### WER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise|-12|171|912|83.4|12.5|4.1|2.4|19.0|38.6| +||-9 |187|1005|87.6|8.6|3.9|1.9|14.3|33.2| +||-6 |176|951|90.6|5.9|3.5|0.8|10.2|26.7| +||-3 |201|1097|94.4|3.3|2.3|0.6|6.2|20.4| +|| 0 |158|847|94.9|3.2|1.9|0.4|5.4|19.0| +|| 3 |173|884|94.2|3.8|1.9|0.6|6.3|24.9| +|| 6 |185|997|96.3|2.7|1.0|0.7|4.4|17.8| +|| 9 |157|817|96.9|1.7|1.3|0.4|3.4|13.4| +||12 |150|832|95.2|2.9|1.9|0.5|5.3|20.7| +||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4| +||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2| +|ambient noise|-12|187|995|73.7|18.4|7.9|1.7|28.0|51.9| +||-9 |193|1060|83.0|11.7|5.3|1.4|18.4|40.4| +||-6 |176|971|90.2|6.8|3.0|1.4|11.2|26.1| +||-3 |173|972|90.0|6.9|3.1|1.0|11.0|32.9| +|| 0 |148|838|94.0|4.1|1.9|0.4|6.3|23.6| +|| 3 |176|909|95.5|2.9|1.7|0.3|4.8|17.0| +|| 6 |166|830|94.1|3.3|2.7|1.0|6.9|20.5| +|| 9 |170|872|95.4|3.1|1.5|0.2|4.8|18.2| +||12 |169|895|95.0|4.0|1.0|0.2|5.3|20.1| +||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4| +||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2| + +## Train_pytorch_trainvideo_delta_specaug (Video-Only) + +* Model files (archived to model.tar.gz by $ pack_model.sh) + - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view + - training config file: conf/finetunevideo/trainvideo.yaml + - decoding config file: conf/decode.yaml + - preprocess config file: conf/specaug.yaml + - lm config file: conf/lm.yaml + - e2e file: exp/vfintune/model.last10.avg.best + - e2e json file: exp/vfintune/model.json + - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best + - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json + - dict file: data/lang_char/train_unigram500_units.txt + +## Environments +- date: `Mon Feb 21 11:52:07 UTC 2022` +- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]` +- espnet version: `espnet 0.6.0` +- chainer version: `chainer 6.0.0` +- pytorch version: `pytorch 1.0.1.post2` + + +### CER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|clean visual data|171|1669|42.3|42.5|15.2|6.4|64.1|91.8| +||-9 |187|1897|46.4|38.8|14.8|8.5|62.2|90.9| +||-6 |176|1821|48.1|37.7|14.2|9.2|61.1|92.0| +||-3 |201|2096|41.7|46.4|11.9|8.9|67.2|90.0| +|| 0 |158|1611|43.4|42.6|14.0|7.1|63.7|94.9| +|| 3 |173|1710|49.2|37.6|13.2|8.9|59.7|91.9| +|| 6 |185|1920|39.3|45.6|15.2|9.4|70.2|95.1| +|| 9 |157|1533|46.2|39.1|14.7|8.5|62.3|89.2| +||12 |150|1536|49.5|37.6|12.9|7.2|57.7|87.3| +||clean |138|1390|44.2|42.3|13.5|7.8|63.7|92.8| +||reverb |177|1755|44.8|41.5|13.6|7.5|62.7|92.1| +|visual gaussian blur|-12|187|1873|37.3|46.6|16.1|9.0|71.6|93.0| +||-9 |193|1965|43.0|44.1|13.0|11.0|68.1|93.8| +||-6 |176|1883|39.9|43.3|16.7|7.5|67.6|93.8| +||-3 |173|1851|43.7|43.8|12.5|8.2|64.5|91.9| +|| 0 |148|1470|42.3|45.4|12.3|8.2|65.9|93.9| +|| 3 |176|1718|44.8|41.5|13.7|7.9|63.1|89.2| +|| 6 |166|1714|38.5|45.4|16.0|10.7|72.2|94.6| +|| 9 |170|1601|45.1|42.8|12.1|11.7|66.6|91.2| +||12 |169|1718|42.0|40.1|17.9|8.2|66.2|92.3| +||clean |138|1390|40.4|45.5|14.2|8.7|68.3|93.5| +||reverb |177|1755|40.2|45.6|14.2|8.5|68.3|92.7| +|visual salt and pepper noise|-12|187|1873|36.2|48.1|15.8|9.9|73.7|92.0| +||-9 |193|1965|41.7|44.6|13.7|10.6|68.9|92.7| +||-6 |176|1883|36.5|47.2|16.4|8.6|72.1|93.2| +||-3 |173|1851|42.1|45.4|12.5|10.8|68.6|92.5| +|| 0 |148|1470|42.3|45.1|12.6|9.5|67.2|91.9| +|| 3 |176|1718|40.0|45.1|15.0|7.6|67.6|92.0| +|| 6 |166|1714|38.1|45.2|16.7|10.1|72.0|94.0| +|| 9 |170|1601|40.2|45.9|13.9|12.0|71.8|92.9| +||12 |169|1718|37.5|46.8|15.7|8.7|71.2|94.1| +||clean |138|1390|39.9|46.0|14.0|9.1|69.1|92.8| +||reverb |177|1755|39.9|46.2|13.9|9.1|69.2|92.7| + +### WER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|clean visual data|-12|171|912|39.4|42.7|18.0|4.3|64.9|89.5| +||-9 |187|1005|43.7|40.6|15.7|5.4|61.7|86.1| +||-6 |176|951|43.3|42.6|14.1|4.1|60.8|88.6| +||-3 |201|1097|41.3|44.2|14.5|5.3|64.0|85.6| +|| 0 |158|847|44.3|37.8|17.9|6.1|61.9|85.4| +|| 3 |173|884|44.2|39.7|16.1|5.3|61.1|84.4| +|| 6 |185|997|38.2|44.8|17.0|3.9|65.7|84.9| +|| 9 |157|817|47.9|37.1|15.1|5.5|57.6|80.3| +||12 |150|832|42.9|37.6|19.5|5.3|62.4|84.0| +||clean |138|739|45.9|39.1|15.0|5.3|59.4|85.5| +||reverb |177|943|43.4|40.5|16.1|5.3|61.9|85.9| +|visual Gaussian blur|-12|187|995|35.9|45.4|18.7|5.3|69.4|86.6| +||-9 |193|1060|35.0|44.2|20.8|5.0|70.0|92.2| +||-6 |176|971|38.2|43.2|18.6|4.6|66.4|87.5| +||-3 |173|972|37.9|45.5|16.7|4.8|67.0|86.1| +|| 0 |148|838|38.1|40.7|21.2|4.2|66.1|89.2| +|| 3 |176|909|36.0|48.5|15.5|5.9|70.0|88.6| +|| 6 |166|830|36.7|46.6|16.6|6.1|69.4|89.8| +|| 9 |170|872|39.0|45.5|15.5|4.7|65.7|87.6| +||12 |169|895|35.2|46.8|18.0|4.6|69.4|89.9| +||clean |138|739|40.7|42.2|17.1|5.0|64.3|88.4| +||reverb |177|943|38.0|44.3|17.7|5.0|67.0|89.3| +|visual salt and pepper noise|-12|187|995|32.5|48.9|18.6|4.6|72.2|83.4| +||-9 |193|1060|32.3|51.5|16.2|6.1|73.9|92.2| +||-6 |176|971|36.5|47.3|16.3|7.2|70.8|86.4| +||-3 |173|972|35.5|47.2|17.3|4.6|69.1|88.4| +|| 0 |148|838|36.9|41.5|21.6|3.7|66.8|88.5| +|| 3 |176|909|33.0|51.9|15.1|5.4|72.4|88.6| +|| 6 |166|830|35.3|49.9|14.8|8.8|73.5|88.0| +|| 9 |170|872|41.2|43.3|15.5|5.6|64.4|84.7| +||12 |169|895|34.2|47.8|18.0|7.3|73.1|91.1| +||clean |138|739|37.5|47.8|14.7|7.3|69.8|86.2| +||reverb |177|943|35.9|47.9|16.1|6.7|70.7|87.0| + +## Train_pytorch_trainavs_delta_specaug (Audio-Visual) + +* Model files (archived to model.tar.gz by $ pack_model.sh) + - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view + - training config file: conf/finetuneav/trainavs.yaml + - decoding config file: conf/decode.yaml + - preprocess config file: conf/specaug.yaml + - lm config file: conf/lm.yaml + - cmvn file: data/train/cmvn.ark + - e2e file: exp/avfintune/model.last10.avg.best + - e2e json file: exp/avfintune/model.json + - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best + - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json + - dict file: data/lang_char/train_unigram500_units.txt + +## Environments +- date: `Mon Feb 21 11:52:07 UTC 2022` +- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]` +- espnet version: `espnet 0.6.0` +- chainer version: `chainer 6.0.0` +- pytorch version: `pytorch 1.0.1.post2` + + +### CER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise with clean visual data |-12|171|1669|90.7|5.4|3.9|0.7|9.9|26.3| +||-9 |187|1897|93.7|3.5|2.7|0.4|6.7|25.1| +||-6 |176|1821|95.1|2.9|2.0|0.4|5.4|18.8| +||-3 |201|2096|96.2|1.6|2.2|0.3|4.2|15.9| +|| 0 |158|1611|96.4|1.9|1.7|0.2|3.8|13.9| +|| 3 |173|1710|96.7|1.7|1.6|0.2|3.6|17.9| +|| 6 |185|1920|96.1|1.6|2.2|0.5|4.3|18.9| +|| 9 |157|1533|96.9|1.4|1.7|0.5|3.6|14.0| +||12 |150|1536|96.5|1.4|2.1|0.5|4.0|21.3| +||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8| +||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4| +|ambient noise with clean visual data |-12|187|1873|89.6|5.8|4.6|1.2|11.5|31.0| +||-9 |193|1965|91.2|5.0|3.8|0.9|9.6|29.0| +||-6 |176|1883|94.3|1.9|3.8|0.3|6.0|21.0| +||-3 |173|1851|94.8|2.7|2.5|0.9|6.1|22.0| +|| 0 |148|1470|96.3|1.6|2.0|0.1|3.8|16.9| +|| 3 |176|1718|97.7|1.5|0.8|0.1|2.4|12.5| +|| 6 |166|1714|96.6|1.6|1.8|0.2|3.6|16.3| +|| 9 |170|1601|97.0|1.6|1.4|0.3|3.3|17.1| +||12 |169|1718|95.4|2.6|2.0|0.1|4.7|20.7| +||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8| +||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4| +|ambient noise with visual Gaussian blur|-12|187|1873|86.9|7.3|5.8|1.1|14.2|35.8| +||-9 |193|1965|91.1|5.4|3.5|1.0|9.9|30.1| +||-6 |176|1883|93.3|2.7|4.0|0.3|7.0|24.4| +||-3 |173|1851|95.1|2.5|2.4|0.8|5.7|21.4| +|| 0 |148|1470|96.3|1.6|2.1|0.1|3.8|17.6| +|| 3 |176|1718|97.3|1.6|1.2|0.2|2.9|13.6| +|| 6 |166|1714|96.2|1.8|2.0|0.2|4.0|18.1| +|| 9 |170|1601|97.0|1.4|1.6|0.2|3.2|16.5| +||12 |169|1718|94.9|2.8|2.3|0.3|5.4|23.1| +||clean |138|1390|97.8|0.9|1.3|0.2|2.4|14.5| +||reverb |177|1755|96.5|1.5|2.1|0.2|3.7|16.9| +|ambient noise with visual salt and pepper noise|-12|187|1873|87.6|7.0|5.4|1.3|13.8|35.8| +||-9 |193|1965|91.0|5.8|3.2|1.3|10.3|30.6| +||-6 |176|1883|93.6|2.0|4.4|0.4|6.9|24.4| +||-3 |173|1851|95.6|2.9|1.6|0.8|5.2|20.2| +|| 0 |148|1470|95.9|1.9|2.2|0.1|4.2|18.2| +|| 3 |176|1718|98.0|1.0|1.0|0.3|2.3|13.1| +|| 6 |166|1714|96.4|1.8|1.8|0.2|3.7|17.5| +|| 9 |170|1601|97.0|1.4|1.6|0.4|3.4|16.5| +||12 |169|1718|96.2|2.2|1.6|0.2|4.1|18.9| +||clean |138|1390|98.1|0.9|1.1|0.2|2.2|13.0| +||reverb |177|1755|96.6|1.5|1.9|0.2|3.6|16.9| + +### WER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise with clean visual data |-12|171|912|91.2|6.0|2.7|1.5|10.3|26.3| +||-9 |187|1005|93.2|4.5|2.3|0.4|7.2|25.1| +||-6 |176|951|94.1|3.7|2.2|0.3|6.2|18.8| +||-3 |201|1097|95.2|2.7|2.1|0.4|5.2|15.9| +|| 0 |158|847|96.7|2.2|1.1|0.4|3.7|13.9| +|| 3 |173|884|95.6|2.6|1.8|0.3|4.8|17.9| +|| 6 |185|997|95.5|2.3|2.2|0.7|5.2|18.9| +|| 9 |157|817|96.2|2.1|1.7|0.7|4.5|14.0| +||12 |150|832|95.1|2.4|2.5|0.2|5.2|21.3| +||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8| +||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4| +|ambient noise with clean visual data |-12|187|995|90.4|6.9|2.7|1.1|10.8|31.0| +||-9 |193|1060|91.3|5.6|3.1|1.4|10.1|29.0| +||-6 |176|971|94.4|2.9|2.7|0.3|5.9|21.0| +||-3 |173|972|93.7|3.7|2.6|0.1|6.4|22.0| +|| 0 |148|838|95.7|2.0|2.3|0.1|4.4|16.9| +|| 3 |176|909|97.0|1.5|1.4|0.3|3.3|12.5| +|| 6 |166|830|96.0|1.9|2.0|0.6|4.6|16.3| +|| 9 |170|872|95.6|3.4|0.9|0.2|4.6|17.1| +||12 |169|895|94.0|3.7|2.3|0.4|6.5|20.7| +||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8| +||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4| +|ambient noise with visual Gaussian blur|-12|187|995|87.0|9.1|3.8|1.0|14.0|35.8| +||-9 |193|1060|90.6|6.2|3.2|1.1|10.6|30.1| +||-6 |176|971|93.2|3.6|3.2|0.3|7.1|24.4| +||-3 |173|972|94.0|3.6|2.4|0.1|6.1|21.4| +|| 0 |148|838|95.6|2.3|2.1|0.2|4.7|17.6| +|| 3 |176|909|96.3|1.7|2.1|0.3|4.1|13.6| +|| 6 |166|830|95.4|2.3|2.3|0.6|5.2|18.1| +|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5| +||12 |169|895|93.2|4.4|2.5|0.4|7.3|23.1| +||clean |138|739|97.0|1.5|1.5|0.4|3.4|14.5| +||reverb |177|943|95.7|1.7|2.7|0.3|4.7|16.9| +|ambient noise with visual salt and pepper noise|-12|187|995|87.1|8.8|4.0|0.9|13.8|35.8| +||-9 |193|1060|90.5|6.3|3.2|1.1|10.7|30.6| +||-6 |176|971|93.3|3.2|3.5|0.3|7.0|24.4| +||-3 |173|972|94.7|3.8|1.5|0.2|5.6|20.2| +|| 0 |148|838|95.3|2.4|2.3|0.2|4.9|18.2| +|| 3 |176|909|96.8|1.4|1.8|0.3|3.5|13.1| +|| 6 |166|830|95.9|2.2|1.9|0.7|4.8|17.5| +|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5| +||12 |169|895|94.7|3.5|1.8|0.3|5.6|18.9| +||clean |138|739|97.4|1.5|1.1|0.4|3.0|13.0| +||average |177|943|95.8|1.9|2.3|0.4|4.7|16.9| diff --git a/egs/lrs/avsr1/cmd.sh b/egs/lrs/avsr1/cmd.sh new file mode 100755 index 00000000000..4d70c9c7a79 --- /dev/null +++ b/egs/lrs/avsr1/cmd.sh @@ -0,0 +1,89 @@ +# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ====== +# Usage: .pl [options] JOB=1: +# e.g. +# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB +# +# Options: +# --time