avsr1/
, contains the code for the audio-visual speech recognition system, also trained on the LRS2 [[2]](#literature) dataset together with the LRS3 dataset (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature). It follows the basic ESPnet structure.
+The main code for the recognition system is the run.sh
script. In the script, the workflow of the systems is performed in multiple stages:
+
+| AVSR |
+|-------------------------------------------------------------|
+| Stage 0: Install required packages |
+| Stage 1: Data Download and preparation |
+| Stage 2: Audio augmentation |
+| Stage 3: MP3 files and Feature Generation |
+| Stage 4: Dictionary and JSON data preparation |
+| Stage 5: Reliability measures generation |
+| Stage 6: Language model trainin |
+| Stage 7: Training of the E2E-AVSR model and Decoding |
+
+
+
+
+
+
+### Detailed description of AVSR1:
+
+##### Stage 0: Packages installations
+ * Install the required packages: ESPNet, OpenFace, DeepXi, Vidaug in avsr1/local/installations. To install OpenFace, you will need sudo right.
+
+##### Stage 1: Data preparation
+ * The data set LRS2 [2] must be downloaded in advance by yourself. For downloading the dataset, please visit https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html/ [2]. You will need to sign a data-sharing agreement with BBC Research & Development before getting access. After downloading, please edit path.sh
file and assign the dataset directory path to the DATA_DIR
variable
+ * The same applies to the LRS3 dataset https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html [3]. After downloading, please edit path.sh
file and assign the dataset directory path to the DATALRS3_DIR
variable
+ * Download the Musan dataset for audio data augmentation and save it under ${MUSAN_DIR}
directory
+ * Download Room Impulse Response and Noise Database (RIRS-Noises) and save it under RIRS_NOISES/
directory
+ * Run audio_data_prep.sh
script: Create file lists for the given part of the Dataset, prepare the Kaldi files
+ * Dump useful data for training
+
+##### Stage 2: Audio Augmentation
+ * Augment the audio data with RIRS Noise
+ * Augment the audio data with Musan Noise
+ * The augmented files are saved under data/audio/augment whereas the clear audio files can be found in data/audio/clear for all the used datasets (Test, Validation(Val), Train and optional Pretrain)
+
+##### Stage 3: Feature Generation
+ * Make augmented MP3 files
+ * Generate the fbank and mfcc features for the audio signals. By default, 80-dimensional filterbanks with pitch on each frame are used
+ * Compute global Cepstral mean and variance normalization (CMVN). This computes goodness of pronunciation (GOP) and extracts phone-level pronunciation features for mispronunciations detection tasks (https://kaldi-asr.org/doc/compute-cmvn-stats_8cc.html).
+
+##### Stage 4: Dictionary and JSON data preparation
+ * Build Dictionary and JSON Data Preparation
+ * Build a tokenizer using Sentencepiece: https://github.com/google/sentencepiece
+
+##### Stage 5: Reliability measures generation
+ * Stage 5.0: Creat dump file for MFCC features
+ * Stage 5.1: Video augmentation with Gaussian blur and salt&pepper noise
+ * Stage 5.2: OpenFace face recognition for facial recognition (especially the mouth region, for further details see documentation in avsr1/local folder )
+ * Stage 5.3: Extract video frames
+ * Stage 5.4: Estimate SNRs using DeepXi framework
+ * Stage 5.5: Extract video features by pretrained video feature extractor [[4]](#literature)
+ * Stage 5.6: Make video .ark files
+ * Stage 5.7: Remake audio and video dump files
+ * Stage 5.8: Split test decode dump files by different signal-to-noise ratios
+
+##### Stage 6: Language Model Training
+ * Train your own language model on the librispeech dataset (https://www.openslr.org/11/) or use a pretrained language model
+ * It is possible to skip the language model and use the system without an external language model.
+
+##### Stage 7: Network Training
+ * Train audio model
+ * Pretrain video model
+ * Finetune video model
+ * Pretrain av model
+ * Finetune av model (model used for decoding)
+
+##### Other important references:
+ * Explanation of the CSV-file for OpenFace: https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format#featureextraction
+
+
+## Running the script
+The runtime script is the script **run.sh**. It can be found in avsr1/
directory.
+> Before running the script, please download the LRS2 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html) [[2]](#literature) and LRS3 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature) datasets by yourself and save the download paths to the variables DATA_DIR
(LRS2 path) and DATALRS3_DIR
(LRS3 path) inside run.sh
file.
+
+### Notes
+Due to the long runtime, it could be useful to run the script using screen command in combination with monitoring in a terminal window and also redirect the output to a log file.
+
+Screen is a terminal multiplexer which means that you can start any number of virtual terminals inside the current terminal session. The advantage is, that you can detach virtual terminals so that they are running in the background. Furthermore, the processes keep still running, even if you are closing the main session or close an ssh connection if you are working remote on a server.
+Screen can be installed from the official package repositories via
+```console
+foo@bar:~$ sudo apt install screen
+```
+As an example, to redirect the output into a file named "log_run_sh.txt", the script could be started with:
+```console
+foo@bar:~/avsr1$ screen bash -c 'bash run.sh |& tee -a log_run_sh.txt'
+```
+This will start a virtual terminal session, which is executing and monitoring the run.sh file. The output is printed to this session as well as saved into the file "log_run_sh.txt". You can leave the monitoring session by simply pressing ctrl+A+D
. If you want to return to the process, simply type
+```console
+foo@bar:~$ screen -ls
+```
+into a terminal to see all running screen processes with their corresponding ID. Then execute
+```console
+foo@bar:~$ screen -r [ID]
+```
+to return to the process.
+Source: https://wiki.ubuntuusers.de/Screen/
+
+***
+### Literature
+
+[1] W. Yu, S. Zeiler and D. Kolossa, "Fusing Information Streams in End-to-End Audio-Visual Speech Recognition," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3430-3434, doi: 10.1109/ICASSP39728.2021.9414553.
+
+[2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman $ pack_model.sh
)
+ - download link: https://drive.google.com/file/d/1ITgdZoa8vQ7lDwi1jLziYGXOyUtgE2ow/view
+ - training config file: conf/train.yaml
+ - decoding config file: conf/decode.yaml
+ - preprocess config file: conf/specaug.yaml
+ - lm config file: conf/lm.yaml
+ - cmvn file: data/train/cmvn.ark
+ - e2e file: exp/audio/model.last10.avg.best
+ - e2e json file: exp/audio/model.json
+ - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best
+ - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json
+ - dict file: data/lang_char/train_unigram500_units.txt
+
+## Environments
+- date: `Mon Feb 21 11:52:07 UTC 2022`
+- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]`
+- espnet version: `espnet 0.6.0`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.0.1.post2`
+
+### CER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise|-12|171|1669|82.0|11.2|6.8|2.2|20.3|38.6|
+||-9|187|1897|87.0|8.3|4.7|0.8|13.8|33.2|
+||-6|176|1821|92.0|5.5|2.5|1.1|9.1|26.7|
+||-3|201|2096|94.4|2.2|3.3|0.2|5.8|20.4|
+||0|158|1611|95.0|3.0|2.0|0.4|5.4|19.0|
+||3|173|1710|94.7|2.7|2.6|0.4|5.7|24.9|
+||6|185|1920|96.2|1.8|2.0|0.5|4.3|17.8|
+||9|157|1533|97.6|1.0|1.4|0.5|2.9|13.4|
+||12|150|1536|96.4|1.6|2.1|0.3|4.0|20.7|
+||clean|138|1390|96.7|1.4|1.9|0.4|3.7|17.4|
+||reverb|177|1755|93.7|3.6|2.7|0.7|7.0|23.2|
+|ambient noise|-12|187|1873|76.4|16.3|7.3|2.3|25.9|51.9|
+||-9 |193|1965|84.2|10.3|5.4|1.8|17.6|40.4|
+||-6 |176|1883|90.2|5.8|4.0|1.3|11.2|26.1|
+||-3 |173|1851|91.2|4.8|4.0|1.0|9.8|32.9|
+|| 0 |148|1470|94.8|3.0|2.2|0.7|5.9|23.6|
+|| 3 |176|1718|96.0|2.1|1.9|0.3|4.3|17.0|
+|| 6 |166|1714|93.7|2.9|3.4|0.5|6.8|20.5|
+|| 9 |170|1601|96.9|1.5|1.6|0.3|3.4|18.2|
+||12 |169|1718|95.9|2.5|1.6|0.2|4.3|20.1|
+||clean |138|1390|96.7|1.4|1.9|0.4|3.7|17.4|
+||reverb |177|1755|93.7|3.6|2.7|0.7|7.0|23.2|
+
+### WER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise|-12|171|912|83.4|12.5|4.1|2.4|19.0|38.6|
+||-9 |187|1005|87.6|8.6|3.9|1.9|14.3|33.2|
+||-6 |176|951|90.6|5.9|3.5|0.8|10.2|26.7|
+||-3 |201|1097|94.4|3.3|2.3|0.6|6.2|20.4|
+|| 0 |158|847|94.9|3.2|1.9|0.4|5.4|19.0|
+|| 3 |173|884|94.2|3.8|1.9|0.6|6.3|24.9|
+|| 6 |185|997|96.3|2.7|1.0|0.7|4.4|17.8|
+|| 9 |157|817|96.9|1.7|1.3|0.4|3.4|13.4|
+||12 |150|832|95.2|2.9|1.9|0.5|5.3|20.7|
+||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4|
+||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2|
+|ambient noise|-12|187|995|73.7|18.4|7.9|1.7|28.0|51.9|
+||-9 |193|1060|83.0|11.7|5.3|1.4|18.4|40.4|
+||-6 |176|971|90.2|6.8|3.0|1.4|11.2|26.1|
+||-3 |173|972|90.0|6.9|3.1|1.0|11.0|32.9|
+|| 0 |148|838|94.0|4.1|1.9|0.4|6.3|23.6|
+|| 3 |176|909|95.5|2.9|1.7|0.3|4.8|17.0|
+|| 6 |166|830|94.1|3.3|2.7|1.0|6.9|20.5|
+|| 9 |170|872|95.4|3.1|1.5|0.2|4.8|18.2|
+||12 |169|895|95.0|4.0|1.0|0.2|5.3|20.1|
+||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4|
+||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2|
+
+## Train_pytorch_trainvideo_delta_specaug (Video-Only)
+
+* Model files (archived to model.tar.gz by $ pack_model.sh
)
+ - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view
+ - training config file: conf/finetunevideo/trainvideo.yaml
+ - decoding config file: conf/decode.yaml
+ - preprocess config file: conf/specaug.yaml
+ - lm config file: conf/lm.yaml
+ - e2e file: exp/vfintune/model.last10.avg.best
+ - e2e json file: exp/vfintune/model.json
+ - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best
+ - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json
+ - dict file: data/lang_char/train_unigram500_units.txt
+
+## Environments
+- date: `Mon Feb 21 11:52:07 UTC 2022`
+- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]`
+- espnet version: `espnet 0.6.0`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.0.1.post2`
+
+
+### CER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|clean visual data|171|1669|42.3|42.5|15.2|6.4|64.1|91.8|
+||-9 |187|1897|46.4|38.8|14.8|8.5|62.2|90.9|
+||-6 |176|1821|48.1|37.7|14.2|9.2|61.1|92.0|
+||-3 |201|2096|41.7|46.4|11.9|8.9|67.2|90.0|
+|| 0 |158|1611|43.4|42.6|14.0|7.1|63.7|94.9|
+|| 3 |173|1710|49.2|37.6|13.2|8.9|59.7|91.9|
+|| 6 |185|1920|39.3|45.6|15.2|9.4|70.2|95.1|
+|| 9 |157|1533|46.2|39.1|14.7|8.5|62.3|89.2|
+||12 |150|1536|49.5|37.6|12.9|7.2|57.7|87.3|
+||clean |138|1390|44.2|42.3|13.5|7.8|63.7|92.8|
+||reverb |177|1755|44.8|41.5|13.6|7.5|62.7|92.1|
+|visual gaussian blur|-12|187|1873|37.3|46.6|16.1|9.0|71.6|93.0|
+||-9 |193|1965|43.0|44.1|13.0|11.0|68.1|93.8|
+||-6 |176|1883|39.9|43.3|16.7|7.5|67.6|93.8|
+||-3 |173|1851|43.7|43.8|12.5|8.2|64.5|91.9|
+|| 0 |148|1470|42.3|45.4|12.3|8.2|65.9|93.9|
+|| 3 |176|1718|44.8|41.5|13.7|7.9|63.1|89.2|
+|| 6 |166|1714|38.5|45.4|16.0|10.7|72.2|94.6|
+|| 9 |170|1601|45.1|42.8|12.1|11.7|66.6|91.2|
+||12 |169|1718|42.0|40.1|17.9|8.2|66.2|92.3|
+||clean |138|1390|40.4|45.5|14.2|8.7|68.3|93.5|
+||reverb |177|1755|40.2|45.6|14.2|8.5|68.3|92.7|
+|visual salt and pepper noise|-12|187|1873|36.2|48.1|15.8|9.9|73.7|92.0|
+||-9 |193|1965|41.7|44.6|13.7|10.6|68.9|92.7|
+||-6 |176|1883|36.5|47.2|16.4|8.6|72.1|93.2|
+||-3 |173|1851|42.1|45.4|12.5|10.8|68.6|92.5|
+|| 0 |148|1470|42.3|45.1|12.6|9.5|67.2|91.9|
+|| 3 |176|1718|40.0|45.1|15.0|7.6|67.6|92.0|
+|| 6 |166|1714|38.1|45.2|16.7|10.1|72.0|94.0|
+|| 9 |170|1601|40.2|45.9|13.9|12.0|71.8|92.9|
+||12 |169|1718|37.5|46.8|15.7|8.7|71.2|94.1|
+||clean |138|1390|39.9|46.0|14.0|9.1|69.1|92.8|
+||reverb |177|1755|39.9|46.2|13.9|9.1|69.2|92.7|
+
+### WER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|clean visual data|-12|171|912|39.4|42.7|18.0|4.3|64.9|89.5|
+||-9 |187|1005|43.7|40.6|15.7|5.4|61.7|86.1|
+||-6 |176|951|43.3|42.6|14.1|4.1|60.8|88.6|
+||-3 |201|1097|41.3|44.2|14.5|5.3|64.0|85.6|
+|| 0 |158|847|44.3|37.8|17.9|6.1|61.9|85.4|
+|| 3 |173|884|44.2|39.7|16.1|5.3|61.1|84.4|
+|| 6 |185|997|38.2|44.8|17.0|3.9|65.7|84.9|
+|| 9 |157|817|47.9|37.1|15.1|5.5|57.6|80.3|
+||12 |150|832|42.9|37.6|19.5|5.3|62.4|84.0|
+||clean |138|739|45.9|39.1|15.0|5.3|59.4|85.5|
+||reverb |177|943|43.4|40.5|16.1|5.3|61.9|85.9|
+|visual Gaussian blur|-12|187|995|35.9|45.4|18.7|5.3|69.4|86.6|
+||-9 |193|1060|35.0|44.2|20.8|5.0|70.0|92.2|
+||-6 |176|971|38.2|43.2|18.6|4.6|66.4|87.5|
+||-3 |173|972|37.9|45.5|16.7|4.8|67.0|86.1|
+|| 0 |148|838|38.1|40.7|21.2|4.2|66.1|89.2|
+|| 3 |176|909|36.0|48.5|15.5|5.9|70.0|88.6|
+|| 6 |166|830|36.7|46.6|16.6|6.1|69.4|89.8|
+|| 9 |170|872|39.0|45.5|15.5|4.7|65.7|87.6|
+||12 |169|895|35.2|46.8|18.0|4.6|69.4|89.9|
+||clean |138|739|40.7|42.2|17.1|5.0|64.3|88.4|
+||reverb |177|943|38.0|44.3|17.7|5.0|67.0|89.3|
+|visual salt and pepper noise|-12|187|995|32.5|48.9|18.6|4.6|72.2|83.4|
+||-9 |193|1060|32.3|51.5|16.2|6.1|73.9|92.2|
+||-6 |176|971|36.5|47.3|16.3|7.2|70.8|86.4|
+||-3 |173|972|35.5|47.2|17.3|4.6|69.1|88.4|
+|| 0 |148|838|36.9|41.5|21.6|3.7|66.8|88.5|
+|| 3 |176|909|33.0|51.9|15.1|5.4|72.4|88.6|
+|| 6 |166|830|35.3|49.9|14.8|8.8|73.5|88.0|
+|| 9 |170|872|41.2|43.3|15.5|5.6|64.4|84.7|
+||12 |169|895|34.2|47.8|18.0|7.3|73.1|91.1|
+||clean |138|739|37.5|47.8|14.7|7.3|69.8|86.2|
+||reverb |177|943|35.9|47.9|16.1|6.7|70.7|87.0|
+
+## Train_pytorch_trainavs_delta_specaug (Audio-Visual)
+
+* Model files (archived to model.tar.gz by $ pack_model.sh
)
+ - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view
+ - training config file: conf/finetuneav/trainavs.yaml
+ - decoding config file: conf/decode.yaml
+ - preprocess config file: conf/specaug.yaml
+ - lm config file: conf/lm.yaml
+ - cmvn file: data/train/cmvn.ark
+ - e2e file: exp/avfintune/model.last10.avg.best
+ - e2e json file: exp/avfintune/model.json
+ - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best
+ - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json
+ - dict file: data/lang_char/train_unigram500_units.txt
+
+## Environments
+- date: `Mon Feb 21 11:52:07 UTC 2022`
+- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]`
+- espnet version: `espnet 0.6.0`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.0.1.post2`
+
+
+### CER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise with clean visual data |-12|171|1669|90.7|5.4|3.9|0.7|9.9|26.3|
+||-9 |187|1897|93.7|3.5|2.7|0.4|6.7|25.1|
+||-6 |176|1821|95.1|2.9|2.0|0.4|5.4|18.8|
+||-3 |201|2096|96.2|1.6|2.2|0.3|4.2|15.9|
+|| 0 |158|1611|96.4|1.9|1.7|0.2|3.8|13.9|
+|| 3 |173|1710|96.7|1.7|1.6|0.2|3.6|17.9|
+|| 6 |185|1920|96.1|1.6|2.2|0.5|4.3|18.9|
+|| 9 |157|1533|96.9|1.4|1.7|0.5|3.6|14.0|
+||12 |150|1536|96.5|1.4|2.1|0.5|4.0|21.3|
+||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8|
+||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4|
+|ambient noise with clean visual data |-12|187|1873|89.6|5.8|4.6|1.2|11.5|31.0|
+||-9 |193|1965|91.2|5.0|3.8|0.9|9.6|29.0|
+||-6 |176|1883|94.3|1.9|3.8|0.3|6.0|21.0|
+||-3 |173|1851|94.8|2.7|2.5|0.9|6.1|22.0|
+|| 0 |148|1470|96.3|1.6|2.0|0.1|3.8|16.9|
+|| 3 |176|1718|97.7|1.5|0.8|0.1|2.4|12.5|
+|| 6 |166|1714|96.6|1.6|1.8|0.2|3.6|16.3|
+|| 9 |170|1601|97.0|1.6|1.4|0.3|3.3|17.1|
+||12 |169|1718|95.4|2.6|2.0|0.1|4.7|20.7|
+||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8|
+||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4|
+|ambient noise with visual Gaussian blur|-12|187|1873|86.9|7.3|5.8|1.1|14.2|35.8|
+||-9 |193|1965|91.1|5.4|3.5|1.0|9.9|30.1|
+||-6 |176|1883|93.3|2.7|4.0|0.3|7.0|24.4|
+||-3 |173|1851|95.1|2.5|2.4|0.8|5.7|21.4|
+|| 0 |148|1470|96.3|1.6|2.1|0.1|3.8|17.6|
+|| 3 |176|1718|97.3|1.6|1.2|0.2|2.9|13.6|
+|| 6 |166|1714|96.2|1.8|2.0|0.2|4.0|18.1|
+|| 9 |170|1601|97.0|1.4|1.6|0.2|3.2|16.5|
+||12 |169|1718|94.9|2.8|2.3|0.3|5.4|23.1|
+||clean |138|1390|97.8|0.9|1.3|0.2|2.4|14.5|
+||reverb |177|1755|96.5|1.5|2.1|0.2|3.7|16.9|
+|ambient noise with visual salt and pepper noise|-12|187|1873|87.6|7.0|5.4|1.3|13.8|35.8|
+||-9 |193|1965|91.0|5.8|3.2|1.3|10.3|30.6|
+||-6 |176|1883|93.6|2.0|4.4|0.4|6.9|24.4|
+||-3 |173|1851|95.6|2.9|1.6|0.8|5.2|20.2|
+|| 0 |148|1470|95.9|1.9|2.2|0.1|4.2|18.2|
+|| 3 |176|1718|98.0|1.0|1.0|0.3|2.3|13.1|
+|| 6 |166|1714|96.4|1.8|1.8|0.2|3.7|17.5|
+|| 9 |170|1601|97.0|1.4|1.6|0.4|3.4|16.5|
+||12 |169|1718|96.2|2.2|1.6|0.2|4.1|18.9|
+||clean |138|1390|98.1|0.9|1.1|0.2|2.2|13.0|
+||reverb |177|1755|96.6|1.5|1.9|0.2|3.6|16.9|
+
+### WER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise with clean visual data |-12|171|912|91.2|6.0|2.7|1.5|10.3|26.3|
+||-9 |187|1005|93.2|4.5|2.3|0.4|7.2|25.1|
+||-6 |176|951|94.1|3.7|2.2|0.3|6.2|18.8|
+||-3 |201|1097|95.2|2.7|2.1|0.4|5.2|15.9|
+|| 0 |158|847|96.7|2.2|1.1|0.4|3.7|13.9|
+|| 3 |173|884|95.6|2.6|1.8|0.3|4.8|17.9|
+|| 6 |185|997|95.5|2.3|2.2|0.7|5.2|18.9|
+|| 9 |157|817|96.2|2.1|1.7|0.7|4.5|14.0|
+||12 |150|832|95.1|2.4|2.5|0.2|5.2|21.3|
+||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8|
+||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4|
+|ambient noise with clean visual data |-12|187|995|90.4|6.9|2.7|1.1|10.8|31.0|
+||-9 |193|1060|91.3|5.6|3.1|1.4|10.1|29.0|
+||-6 |176|971|94.4|2.9|2.7|0.3|5.9|21.0|
+||-3 |173|972|93.7|3.7|2.6|0.1|6.4|22.0|
+|| 0 |148|838|95.7|2.0|2.3|0.1|4.4|16.9|
+|| 3 |176|909|97.0|1.5|1.4|0.3|3.3|12.5|
+|| 6 |166|830|96.0|1.9|2.0|0.6|4.6|16.3|
+|| 9 |170|872|95.6|3.4|0.9|0.2|4.6|17.1|
+||12 |169|895|94.0|3.7|2.3|0.4|6.5|20.7|
+||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8|
+||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4|
+|ambient noise with visual Gaussian blur|-12|187|995|87.0|9.1|3.8|1.0|14.0|35.8|
+||-9 |193|1060|90.6|6.2|3.2|1.1|10.6|30.1|
+||-6 |176|971|93.2|3.6|3.2|0.3|7.1|24.4|
+||-3 |173|972|94.0|3.6|2.4|0.1|6.1|21.4|
+|| 0 |148|838|95.6|2.3|2.1|0.2|4.7|17.6|
+|| 3 |176|909|96.3|1.7|2.1|0.3|4.1|13.6|
+|| 6 |166|830|95.4|2.3|2.3|0.6|5.2|18.1|
+|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5|
+||12 |169|895|93.2|4.4|2.5|0.4|7.3|23.1|
+||clean |138|739|97.0|1.5|1.5|0.4|3.4|14.5|
+||reverb |177|943|95.7|1.7|2.7|0.3|4.7|16.9|
+|ambient noise with visual salt and pepper noise|-12|187|995|87.1|8.8|4.0|0.9|13.8|35.8|
+||-9 |193|1060|90.5|6.3|3.2|1.1|10.7|30.6|
+||-6 |176|971|93.3|3.2|3.5|0.3|7.0|24.4|
+||-3 |173|972|94.7|3.8|1.5|0.2|5.6|20.2|
+|| 0 |148|838|95.3|2.4|2.3|0.2|4.9|18.2|
+|| 3 |176|909|96.8|1.4|1.8|0.3|3.5|13.1|
+|| 6 |166|830|95.9|2.2|1.9|0.7|4.8|17.5|
+|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5|
+||12 |169|895|94.7|3.5|1.8|0.3|5.6|18.9|
+||clean |138|739|97.4|1.5|1.1|0.4|3.0|13.0|
+||average |177|943|95.8|1.9|2.3|0.4|4.7|16.9|
diff --git a/egs/lrs/avsr1/cmd.sh b/egs/lrs/avsr1/cmd.sh
new file mode 100755
index 00000000000..4d70c9c7a79
--- /dev/null
+++ b/egs/lrs/avsr1/cmd.sh
@@ -0,0 +1,89 @@
+# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
+# Usage: