diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 979e7397012..9036a09b66d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -151,6 +151,11 @@ we recommend using small model parameters and avoiding dynamic imports, file acc more running time, you can annotate your test with `@pytest.mark.execution_timeout(sec)`. - For test initialization (parameters, modules, etc), you can use pytest fixtures. Refer to [pytest fixtures](https://docs.pytest.org/en/latest/fixture.html#using-fixtures-from-classes-modules-or-projects) for more information. +In addition, please follow the [PEP 8 convention](https://peps.python.org/pep-0008/) for the coding style and [Google's convention for docstrings](https://google.github.io/styleguide/pyguide.html#383-functions-and-methods). +Below are some specific points that should be taken care of in particular: +- [import ordering](https://peps.python.org/pep-0008/#imports) +- Avoid writing python2-style code. For example, `super().__init__()` is preferred over `super(CLASS_NAME, self).__init()__`. + ### 4.2 Bash scripts diff --git a/egs/README.md b/egs/README.md index 61951b84d47..f95dc5233d9 100755 --- a/egs/README.md +++ b/egs/README.md @@ -49,7 +49,8 @@ See: https://espnet.github.io/espnet/tutorial.html | librispeech | LibriSpeech ASR corpus | ASR | EN | http://www.openslr.org/12 | | | libritts | LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech | TTS | EN | http://www.openslr.org/60/ | | | ljspeech | The LJ Speech Dataset | TTS | EN | https://keithito.com/LJ-Speech-Dataset/ | | -| lrs | The Lip Reading Sentences Dataset | ASR/AVSR | EN | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | | +| lrs2 | The Lip Reading Sentences 2 Dataset | ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | | +| lrs | The Lip Reading Sentences 2 and 3 Dataset | AVSR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html | | | m_ailabs | The M-AILABS Speech Dataset | TTS | ~5 languages | https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ | | mucs_2021 | MUCS 2021: MUltilingual and Code-Switching ASR Challenges for Low Resource Indian Languages | ASR/Code Switching | HI, MR, OR, TA, TE, GU, HI-EN, BN-EN | https://navana-tech.github.io/MUCS2021/data.html | | | mtedx | Multilingual TEDx | ASR/Machine Translation/Speech Translation | 13 Language pairs | http://www.openslr.org/100/ | diff --git a/egs/lrs/README.md b/egs/lrs/README.md new file mode 100644 index 00000000000..26f623cd08b --- /dev/null +++ b/egs/lrs/README.md @@ -0,0 +1,335 @@ +# ESPnet-AVSR + +## Introduction +This repository contains an implementation of end-to-end (E2E) audio-visual speech recognition (AVSR) based on the ESPnet ASR toolkit. The new fusion strategy follows the paper "Fusing information streams in end-to-end audio-visual speech recognition." (https://ieeexplore.ieee.org/document/9414553) [[1]](#literature). A broad range of reliability measures are used to help the integration model improve the performance of the AVSR model. We use two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 corpora for all our experiments. +In addition, this project also contains an audio-only model for comparison. + +## Table of Contents +- [Installation](#installation-of-required-packages) + * [Requirements](#requirements) +- [Project Structure](#project-structure) + * [Basics](#project-structure) + * [AVSR1](#detailed-description-of-avsr1) +- [Usage of the scripts](#running-the-script) + + [Notes](#notes) + + +## Installation of required packages + +### Requirements + +For installation, approximately 40GB of free disk space is needed. avsr1/run.sh stage 0 installs all required packages in avsr1/local/installations: + +**Required Packages:** +1. ESPNet: https://github.com/espnet/espnet +1. OpenFace: https://github.com/TadasBaltrusaitis/OpenFace +2. DeepXi: https://github.com/anicolson/DeepXi +3. Vidaug: https://github.com/okankop/vidaug + + + +## Project structure +The main folder avsr1/, contains the code for the audio-visual speech recognition system, also trained on the LRS2 [[2]](#literature) dataset together with the LRS3 dataset (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature). It follows the basic ESPnet structure. +The main code for the recognition system is the run.sh script. In the script, the workflow of the systems is performed in multiple stages: + +| AVSR | +|-------------------------------------------------------------| +| Stage 0: Install required packages | +| Stage 1: Data Download and preparation | +| Stage 2: Audio augmentation | +| Stage 3: MP3 files and Feature Generation | +| Stage 4: Dictionary and JSON data preparation | +| Stage 5: Reliability measures generation | +| Stage 6: Language model trainin | +| Stage 7: Training of the E2E-AVSR model and Decoding | + + + + + + +### Detailed description of AVSR1: + +##### Stage 0: Packages installations + * Install the required packages: ESPNet, OpenFace, DeepXi, Vidaug in avsr1/local/installations. To install OpenFace, you will need sudo right. + +##### Stage 1: Data preparation + * The data set LRS2 [2] must be downloaded in advance by yourself. For downloading the dataset, please visit https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html/ [2]. You will need to sign a data-sharing agreement with BBC Research & Development before getting access. After downloading, please edit path.sh file and assign the dataset directory path to the DATA_DIR variable + * The same applies to the LRS3 dataset https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html [3]. After downloading, please edit path.sh file and assign the dataset directory path to the DATALRS3_DIR variable + * Download the Musan dataset for audio data augmentation and save it under ${MUSAN_DIR} directory + * Download Room Impulse Response and Noise Database (RIRS-Noises) and save it under RIRS_NOISES/ directory + * Run audio_data_prep.sh script: Create file lists for the given part of the Dataset, prepare the Kaldi files + * Dump useful data for training + +##### Stage 2: Audio Augmentation + * Augment the audio data with RIRS Noise + * Augment the audio data with Musan Noise + * The augmented files are saved under data/audio/augment whereas the clear audio files can be found in data/audio/clear for all the used datasets (Test, Validation(Val), Train and optional Pretrain) + +##### Stage 3: Feature Generation + * Make augmented MP3 files + * Generate the fbank and mfcc features for the audio signals. By default, 80-dimensional filterbanks with pitch on each frame are used + * Compute global Cepstral mean and variance normalization (CMVN). This computes goodness of pronunciation (GOP) and extracts phone-level pronunciation features for mispronunciations detection tasks (https://kaldi-asr.org/doc/compute-cmvn-stats_8cc.html). + +##### Stage 4: Dictionary and JSON data preparation + * Build Dictionary and JSON Data Preparation + * Build a tokenizer using Sentencepiece: https://github.com/google/sentencepiece + +##### Stage 5: Reliability measures generation + * Stage 5.0: Creat dump file for MFCC features + * Stage 5.1: Video augmentation with Gaussian blur and salt&pepper noise + * Stage 5.2: OpenFace face recognition for facial recognition (especially the mouth region, for further details see documentation in avsr1/local folder ) + * Stage 5.3: Extract video frames + * Stage 5.4: Estimate SNRs using DeepXi framework + * Stage 5.5: Extract video features by pretrained video feature extractor [[4]](#literature) + * Stage 5.6: Make video .ark files + * Stage 5.7: Remake audio and video dump files + * Stage 5.8: Split test decode dump files by different signal-to-noise ratios + +##### Stage 6: Language Model Training + * Train your own language model on the librispeech dataset (https://www.openslr.org/11/) or use a pretrained language model + * It is possible to skip the language model and use the system without an external language model. + +##### Stage 7: Network Training + * Train audio model + * Pretrain video model + * Finetune video model + * Pretrain av model + * Finetune av model (model used for decoding) + +##### Other important references: + * Explanation of the CSV-file for OpenFace: https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format#featureextraction + + +## Running the script +The runtime script is the script **run.sh**. It can be found in avsr1/ directory. +> Before running the script, please download the LRS2 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html) [[2]](#literature) and LRS3 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature) datasets by yourself and save the download paths to the variables DATA_DIR (LRS2 path) and DATALRS3_DIR (LRS3 path) inside run.sh file. + +### Notes +Due to the long runtime, it could be useful to run the script using screen command in combination with monitoring in a terminal window and also redirect the output to a log file. + +Screen is a terminal multiplexer which means that you can start any number of virtual terminals inside the current terminal session. The advantage is, that you can detach virtual terminals so that they are running in the background. Furthermore, the processes keep still running, even if you are closing the main session or close an ssh connection if you are working remote on a server. +Screen can be installed from the official package repositories via +```console +foo@bar:~$ sudo apt install screen +``` +As an example, to redirect the output into a file named "log_run_sh.txt", the script could be started with: +```console +foo@bar:~/avsr1$ screen bash -c 'bash run.sh |& tee -a log_run_sh.txt' +``` +This will start a virtual terminal session, which is executing and monitoring the run.sh file. The output is printed to this session as well as saved into the file "log_run_sh.txt". You can leave the monitoring session by simply pressing ctrl+A+D. If you want to return to the process, simply type +```console +foo@bar:~$ screen -ls +``` +into a terminal to see all running screen processes with their corresponding ID. Then execute +```console +foo@bar:~$ screen -r [ID] +``` +to return to the process. +Source: https://wiki.ubuntuusers.de/Screen/ + +*** +### Literature + +[1] W. Yu, S. Zeiler and D. Kolossa, "Fusing Information Streams in End-to-End Audio-Visual Speech Recognition," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3430-3434, doi: 10.1109/ICASSP39728.2021.9414553. + +[2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman
+Deep Audio-Visual Speech Recognition +arXiv: 1809.02108 + +[3] T. Afouras, J. S. Chung, A. Zisserman
+LRS3-TED: a large-scale dataset for visual speech recognition +arXiv preprint arXiv: 1809.00496 + +[4] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, andM. Pantic, “Audio-visual speech recognition with a hybridCTC/Attention architecture,” in IEEE SLT. IEEE, 2018. + diff --git a/egs/lrs/avsr1/RESULTS.md b/egs/lrs/avsr1/RESULTS.md new file mode 100755 index 00000000000..2615db795f8 --- /dev/null +++ b/egs/lrs/avsr1/RESULTS.md @@ -0,0 +1,294 @@ +## pretrain_Train_pytorch_audio_delta_specaug (Audio-Only) + +* Model files (archived to model.tar.gz by $ pack_model.sh) + - download link: https://drive.google.com/file/d/1ITgdZoa8vQ7lDwi1jLziYGXOyUtgE2ow/view + - training config file: conf/train.yaml + - decoding config file: conf/decode.yaml + - preprocess config file: conf/specaug.yaml + - lm config file: conf/lm.yaml + - cmvn file: data/train/cmvn.ark + - e2e file: exp/audio/model.last10.avg.best + - e2e json file: exp/audio/model.json + - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best + - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json + - dict file: data/lang_char/train_unigram500_units.txt + +## Environments +- date: `Mon Feb 21 11:52:07 UTC 2022` +- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]` +- espnet version: `espnet 0.6.0` +- chainer version: `chainer 6.0.0` +- pytorch version: `pytorch 1.0.1.post2` + +### CER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise|-12|171|1669|82.0|11.2|6.8|2.2|20.3|38.6| +||-9|187|1897|87.0|8.3|4.7|0.8|13.8|33.2| +||-6|176|1821|92.0|5.5|2.5|1.1|9.1|26.7| +||-3|201|2096|94.4|2.2|3.3|0.2|5.8|20.4| +||0|158|1611|95.0|3.0|2.0|0.4|5.4|19.0| +||3|173|1710|94.7|2.7|2.6|0.4|5.7|24.9| +||6|185|1920|96.2|1.8|2.0|0.5|4.3|17.8| +||9|157|1533|97.6|1.0|1.4|0.5|2.9|13.4| +||12|150|1536|96.4|1.6|2.1|0.3|4.0|20.7| +||clean|138|1390|96.7|1.4|1.9|0.4|3.7|17.4| +||reverb|177|1755|93.7|3.6|2.7|0.7|7.0|23.2| +|ambient noise|-12|187|1873|76.4|16.3|7.3|2.3|25.9|51.9| +||-9 |193|1965|84.2|10.3|5.4|1.8|17.6|40.4| +||-6 |176|1883|90.2|5.8|4.0|1.3|11.2|26.1| +||-3 |173|1851|91.2|4.8|4.0|1.0|9.8|32.9| +|| 0 |148|1470|94.8|3.0|2.2|0.7|5.9|23.6| +|| 3 |176|1718|96.0|2.1|1.9|0.3|4.3|17.0| +|| 6 |166|1714|93.7|2.9|3.4|0.5|6.8|20.5| +|| 9 |170|1601|96.9|1.5|1.6|0.3|3.4|18.2| +||12 |169|1718|95.9|2.5|1.6|0.2|4.3|20.1| +||clean |138|1390|96.7|1.4|1.9|0.4|3.7|17.4| +||reverb |177|1755|93.7|3.6|2.7|0.7|7.0|23.2| + +### WER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise|-12|171|912|83.4|12.5|4.1|2.4|19.0|38.6| +||-9 |187|1005|87.6|8.6|3.9|1.9|14.3|33.2| +||-6 |176|951|90.6|5.9|3.5|0.8|10.2|26.7| +||-3 |201|1097|94.4|3.3|2.3|0.6|6.2|20.4| +|| 0 |158|847|94.9|3.2|1.9|0.4|5.4|19.0| +|| 3 |173|884|94.2|3.8|1.9|0.6|6.3|24.9| +|| 6 |185|997|96.3|2.7|1.0|0.7|4.4|17.8| +|| 9 |157|817|96.9|1.7|1.3|0.4|3.4|13.4| +||12 |150|832|95.2|2.9|1.9|0.5|5.3|20.7| +||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4| +||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2| +|ambient noise|-12|187|995|73.7|18.4|7.9|1.7|28.0|51.9| +||-9 |193|1060|83.0|11.7|5.3|1.4|18.4|40.4| +||-6 |176|971|90.2|6.8|3.0|1.4|11.2|26.1| +||-3 |173|972|90.0|6.9|3.1|1.0|11.0|32.9| +|| 0 |148|838|94.0|4.1|1.9|0.4|6.3|23.6| +|| 3 |176|909|95.5|2.9|1.7|0.3|4.8|17.0| +|| 6 |166|830|94.1|3.3|2.7|1.0|6.9|20.5| +|| 9 |170|872|95.4|3.1|1.5|0.2|4.8|18.2| +||12 |169|895|95.0|4.0|1.0|0.2|5.3|20.1| +||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4| +||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2| + +## Train_pytorch_trainvideo_delta_specaug (Video-Only) + +* Model files (archived to model.tar.gz by $ pack_model.sh) + - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view + - training config file: conf/finetunevideo/trainvideo.yaml + - decoding config file: conf/decode.yaml + - preprocess config file: conf/specaug.yaml + - lm config file: conf/lm.yaml + - e2e file: exp/vfintune/model.last10.avg.best + - e2e json file: exp/vfintune/model.json + - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best + - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json + - dict file: data/lang_char/train_unigram500_units.txt + +## Environments +- date: `Mon Feb 21 11:52:07 UTC 2022` +- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]` +- espnet version: `espnet 0.6.0` +- chainer version: `chainer 6.0.0` +- pytorch version: `pytorch 1.0.1.post2` + + +### CER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|clean visual data|171|1669|42.3|42.5|15.2|6.4|64.1|91.8| +||-9 |187|1897|46.4|38.8|14.8|8.5|62.2|90.9| +||-6 |176|1821|48.1|37.7|14.2|9.2|61.1|92.0| +||-3 |201|2096|41.7|46.4|11.9|8.9|67.2|90.0| +|| 0 |158|1611|43.4|42.6|14.0|7.1|63.7|94.9| +|| 3 |173|1710|49.2|37.6|13.2|8.9|59.7|91.9| +|| 6 |185|1920|39.3|45.6|15.2|9.4|70.2|95.1| +|| 9 |157|1533|46.2|39.1|14.7|8.5|62.3|89.2| +||12 |150|1536|49.5|37.6|12.9|7.2|57.7|87.3| +||clean |138|1390|44.2|42.3|13.5|7.8|63.7|92.8| +||reverb |177|1755|44.8|41.5|13.6|7.5|62.7|92.1| +|visual gaussian blur|-12|187|1873|37.3|46.6|16.1|9.0|71.6|93.0| +||-9 |193|1965|43.0|44.1|13.0|11.0|68.1|93.8| +||-6 |176|1883|39.9|43.3|16.7|7.5|67.6|93.8| +||-3 |173|1851|43.7|43.8|12.5|8.2|64.5|91.9| +|| 0 |148|1470|42.3|45.4|12.3|8.2|65.9|93.9| +|| 3 |176|1718|44.8|41.5|13.7|7.9|63.1|89.2| +|| 6 |166|1714|38.5|45.4|16.0|10.7|72.2|94.6| +|| 9 |170|1601|45.1|42.8|12.1|11.7|66.6|91.2| +||12 |169|1718|42.0|40.1|17.9|8.2|66.2|92.3| +||clean |138|1390|40.4|45.5|14.2|8.7|68.3|93.5| +||reverb |177|1755|40.2|45.6|14.2|8.5|68.3|92.7| +|visual salt and pepper noise|-12|187|1873|36.2|48.1|15.8|9.9|73.7|92.0| +||-9 |193|1965|41.7|44.6|13.7|10.6|68.9|92.7| +||-6 |176|1883|36.5|47.2|16.4|8.6|72.1|93.2| +||-3 |173|1851|42.1|45.4|12.5|10.8|68.6|92.5| +|| 0 |148|1470|42.3|45.1|12.6|9.5|67.2|91.9| +|| 3 |176|1718|40.0|45.1|15.0|7.6|67.6|92.0| +|| 6 |166|1714|38.1|45.2|16.7|10.1|72.0|94.0| +|| 9 |170|1601|40.2|45.9|13.9|12.0|71.8|92.9| +||12 |169|1718|37.5|46.8|15.7|8.7|71.2|94.1| +||clean |138|1390|39.9|46.0|14.0|9.1|69.1|92.8| +||reverb |177|1755|39.9|46.2|13.9|9.1|69.2|92.7| + +### WER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|clean visual data|-12|171|912|39.4|42.7|18.0|4.3|64.9|89.5| +||-9 |187|1005|43.7|40.6|15.7|5.4|61.7|86.1| +||-6 |176|951|43.3|42.6|14.1|4.1|60.8|88.6| +||-3 |201|1097|41.3|44.2|14.5|5.3|64.0|85.6| +|| 0 |158|847|44.3|37.8|17.9|6.1|61.9|85.4| +|| 3 |173|884|44.2|39.7|16.1|5.3|61.1|84.4| +|| 6 |185|997|38.2|44.8|17.0|3.9|65.7|84.9| +|| 9 |157|817|47.9|37.1|15.1|5.5|57.6|80.3| +||12 |150|832|42.9|37.6|19.5|5.3|62.4|84.0| +||clean |138|739|45.9|39.1|15.0|5.3|59.4|85.5| +||reverb |177|943|43.4|40.5|16.1|5.3|61.9|85.9| +|visual Gaussian blur|-12|187|995|35.9|45.4|18.7|5.3|69.4|86.6| +||-9 |193|1060|35.0|44.2|20.8|5.0|70.0|92.2| +||-6 |176|971|38.2|43.2|18.6|4.6|66.4|87.5| +||-3 |173|972|37.9|45.5|16.7|4.8|67.0|86.1| +|| 0 |148|838|38.1|40.7|21.2|4.2|66.1|89.2| +|| 3 |176|909|36.0|48.5|15.5|5.9|70.0|88.6| +|| 6 |166|830|36.7|46.6|16.6|6.1|69.4|89.8| +|| 9 |170|872|39.0|45.5|15.5|4.7|65.7|87.6| +||12 |169|895|35.2|46.8|18.0|4.6|69.4|89.9| +||clean |138|739|40.7|42.2|17.1|5.0|64.3|88.4| +||reverb |177|943|38.0|44.3|17.7|5.0|67.0|89.3| +|visual salt and pepper noise|-12|187|995|32.5|48.9|18.6|4.6|72.2|83.4| +||-9 |193|1060|32.3|51.5|16.2|6.1|73.9|92.2| +||-6 |176|971|36.5|47.3|16.3|7.2|70.8|86.4| +||-3 |173|972|35.5|47.2|17.3|4.6|69.1|88.4| +|| 0 |148|838|36.9|41.5|21.6|3.7|66.8|88.5| +|| 3 |176|909|33.0|51.9|15.1|5.4|72.4|88.6| +|| 6 |166|830|35.3|49.9|14.8|8.8|73.5|88.0| +|| 9 |170|872|41.2|43.3|15.5|5.6|64.4|84.7| +||12 |169|895|34.2|47.8|18.0|7.3|73.1|91.1| +||clean |138|739|37.5|47.8|14.7|7.3|69.8|86.2| +||reverb |177|943|35.9|47.9|16.1|6.7|70.7|87.0| + +## Train_pytorch_trainavs_delta_specaug (Audio-Visual) + +* Model files (archived to model.tar.gz by $ pack_model.sh) + - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view + - training config file: conf/finetuneav/trainavs.yaml + - decoding config file: conf/decode.yaml + - preprocess config file: conf/specaug.yaml + - lm config file: conf/lm.yaml + - cmvn file: data/train/cmvn.ark + - e2e file: exp/avfintune/model.last10.avg.best + - e2e json file: exp/avfintune/model.json + - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best + - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json + - dict file: data/lang_char/train_unigram500_units.txt + +## Environments +- date: `Mon Feb 21 11:52:07 UTC 2022` +- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]` +- espnet version: `espnet 0.6.0` +- chainer version: `chainer 6.0.0` +- pytorch version: `pytorch 1.0.1.post2` + + +### CER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise with clean visual data |-12|171|1669|90.7|5.4|3.9|0.7|9.9|26.3| +||-9 |187|1897|93.7|3.5|2.7|0.4|6.7|25.1| +||-6 |176|1821|95.1|2.9|2.0|0.4|5.4|18.8| +||-3 |201|2096|96.2|1.6|2.2|0.3|4.2|15.9| +|| 0 |158|1611|96.4|1.9|1.7|0.2|3.8|13.9| +|| 3 |173|1710|96.7|1.7|1.6|0.2|3.6|17.9| +|| 6 |185|1920|96.1|1.6|2.2|0.5|4.3|18.9| +|| 9 |157|1533|96.9|1.4|1.7|0.5|3.6|14.0| +||12 |150|1536|96.5|1.4|2.1|0.5|4.0|21.3| +||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8| +||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4| +|ambient noise with clean visual data |-12|187|1873|89.6|5.8|4.6|1.2|11.5|31.0| +||-9 |193|1965|91.2|5.0|3.8|0.9|9.6|29.0| +||-6 |176|1883|94.3|1.9|3.8|0.3|6.0|21.0| +||-3 |173|1851|94.8|2.7|2.5|0.9|6.1|22.0| +|| 0 |148|1470|96.3|1.6|2.0|0.1|3.8|16.9| +|| 3 |176|1718|97.7|1.5|0.8|0.1|2.4|12.5| +|| 6 |166|1714|96.6|1.6|1.8|0.2|3.6|16.3| +|| 9 |170|1601|97.0|1.6|1.4|0.3|3.3|17.1| +||12 |169|1718|95.4|2.6|2.0|0.1|4.7|20.7| +||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8| +||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4| +|ambient noise with visual Gaussian blur|-12|187|1873|86.9|7.3|5.8|1.1|14.2|35.8| +||-9 |193|1965|91.1|5.4|3.5|1.0|9.9|30.1| +||-6 |176|1883|93.3|2.7|4.0|0.3|7.0|24.4| +||-3 |173|1851|95.1|2.5|2.4|0.8|5.7|21.4| +|| 0 |148|1470|96.3|1.6|2.1|0.1|3.8|17.6| +|| 3 |176|1718|97.3|1.6|1.2|0.2|2.9|13.6| +|| 6 |166|1714|96.2|1.8|2.0|0.2|4.0|18.1| +|| 9 |170|1601|97.0|1.4|1.6|0.2|3.2|16.5| +||12 |169|1718|94.9|2.8|2.3|0.3|5.4|23.1| +||clean |138|1390|97.8|0.9|1.3|0.2|2.4|14.5| +||reverb |177|1755|96.5|1.5|2.1|0.2|3.7|16.9| +|ambient noise with visual salt and pepper noise|-12|187|1873|87.6|7.0|5.4|1.3|13.8|35.8| +||-9 |193|1965|91.0|5.8|3.2|1.3|10.3|30.6| +||-6 |176|1883|93.6|2.0|4.4|0.4|6.9|24.4| +||-3 |173|1851|95.6|2.9|1.6|0.8|5.2|20.2| +|| 0 |148|1470|95.9|1.9|2.2|0.1|4.2|18.2| +|| 3 |176|1718|98.0|1.0|1.0|0.3|2.3|13.1| +|| 6 |166|1714|96.4|1.8|1.8|0.2|3.7|17.5| +|| 9 |170|1601|97.0|1.4|1.6|0.4|3.4|16.5| +||12 |169|1718|96.2|2.2|1.6|0.2|4.1|18.9| +||clean |138|1390|98.1|0.9|1.1|0.2|2.2|13.0| +||reverb |177|1755|96.6|1.5|1.9|0.2|3.6|16.9| + +### WER + +|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---|---| +|music noise with clean visual data |-12|171|912|91.2|6.0|2.7|1.5|10.3|26.3| +||-9 |187|1005|93.2|4.5|2.3|0.4|7.2|25.1| +||-6 |176|951|94.1|3.7|2.2|0.3|6.2|18.8| +||-3 |201|1097|95.2|2.7|2.1|0.4|5.2|15.9| +|| 0 |158|847|96.7|2.2|1.1|0.4|3.7|13.9| +|| 3 |173|884|95.6|2.6|1.8|0.3|4.8|17.9| +|| 6 |185|997|95.5|2.3|2.2|0.7|5.2|18.9| +|| 9 |157|817|96.2|2.1|1.7|0.7|4.5|14.0| +||12 |150|832|95.1|2.4|2.5|0.2|5.2|21.3| +||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8| +||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4| +|ambient noise with clean visual data |-12|187|995|90.4|6.9|2.7|1.1|10.8|31.0| +||-9 |193|1060|91.3|5.6|3.1|1.4|10.1|29.0| +||-6 |176|971|94.4|2.9|2.7|0.3|5.9|21.0| +||-3 |173|972|93.7|3.7|2.6|0.1|6.4|22.0| +|| 0 |148|838|95.7|2.0|2.3|0.1|4.4|16.9| +|| 3 |176|909|97.0|1.5|1.4|0.3|3.3|12.5| +|| 6 |166|830|96.0|1.9|2.0|0.6|4.6|16.3| +|| 9 |170|872|95.6|3.4|0.9|0.2|4.6|17.1| +||12 |169|895|94.0|3.7|2.3|0.4|6.5|20.7| +||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8| +||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4| +|ambient noise with visual Gaussian blur|-12|187|995|87.0|9.1|3.8|1.0|14.0|35.8| +||-9 |193|1060|90.6|6.2|3.2|1.1|10.6|30.1| +||-6 |176|971|93.2|3.6|3.2|0.3|7.1|24.4| +||-3 |173|972|94.0|3.6|2.4|0.1|6.1|21.4| +|| 0 |148|838|95.6|2.3|2.1|0.2|4.7|17.6| +|| 3 |176|909|96.3|1.7|2.1|0.3|4.1|13.6| +|| 6 |166|830|95.4|2.3|2.3|0.6|5.2|18.1| +|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5| +||12 |169|895|93.2|4.4|2.5|0.4|7.3|23.1| +||clean |138|739|97.0|1.5|1.5|0.4|3.4|14.5| +||reverb |177|943|95.7|1.7|2.7|0.3|4.7|16.9| +|ambient noise with visual salt and pepper noise|-12|187|995|87.1|8.8|4.0|0.9|13.8|35.8| +||-9 |193|1060|90.5|6.3|3.2|1.1|10.7|30.6| +||-6 |176|971|93.3|3.2|3.5|0.3|7.0|24.4| +||-3 |173|972|94.7|3.8|1.5|0.2|5.6|20.2| +|| 0 |148|838|95.3|2.4|2.3|0.2|4.9|18.2| +|| 3 |176|909|96.8|1.4|1.8|0.3|3.5|13.1| +|| 6 |166|830|95.9|2.2|1.9|0.7|4.8|17.5| +|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5| +||12 |169|895|94.7|3.5|1.8|0.3|5.6|18.9| +||clean |138|739|97.4|1.5|1.1|0.4|3.0|13.0| +||average |177|943|95.8|1.9|2.3|0.4|4.7|16.9| diff --git a/egs/lrs/avsr1/cmd.sh b/egs/lrs/avsr1/cmd.sh new file mode 100755 index 00000000000..4d70c9c7a79 --- /dev/null +++ b/egs/lrs/avsr1/cmd.sh @@ -0,0 +1,89 @@ +# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ====== +# Usage: .pl [options] JOB=1: +# e.g. +# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB +# +# Options: +# --time