diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 979e7397012..9036a09b66d 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -151,6 +151,11 @@ we recommend using small model parameters and avoiding dynamic imports, file acc
more running time, you can annotate your test with `@pytest.mark.execution_timeout(sec)`.
- For test initialization (parameters, modules, etc), you can use pytest fixtures. Refer to [pytest fixtures](https://docs.pytest.org/en/latest/fixture.html#using-fixtures-from-classes-modules-or-projects) for more information.
+In addition, please follow the [PEP 8 convention](https://peps.python.org/pep-0008/) for the coding style and [Google's convention for docstrings](https://google.github.io/styleguide/pyguide.html#383-functions-and-methods).
+Below are some specific points that should be taken care of in particular:
+- [import ordering](https://peps.python.org/pep-0008/#imports)
+- Avoid writing python2-style code. For example, `super().__init__()` is preferred over `super(CLASS_NAME, self).__init()__`.
+
### 4.2 Bash scripts
diff --git a/egs/README.md b/egs/README.md
index 61951b84d47..f95dc5233d9 100755
--- a/egs/README.md
+++ b/egs/README.md
@@ -49,7 +49,8 @@ See: https://espnet.github.io/espnet/tutorial.html
| librispeech | LibriSpeech ASR corpus | ASR | EN | http://www.openslr.org/12 | |
| libritts | LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech | TTS | EN | http://www.openslr.org/60/ | |
| ljspeech | The LJ Speech Dataset | TTS | EN | https://keithito.com/LJ-Speech-Dataset/ | |
-| lrs | The Lip Reading Sentences Dataset | ASR/AVSR | EN | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | |
+| lrs2 | The Lip Reading Sentences 2 Dataset | ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | |
+| lrs | The Lip Reading Sentences 2 and 3 Dataset | AVSR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html | |
| m_ailabs | The M-AILABS Speech Dataset | TTS | ~5 languages | https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/ |
| mucs_2021 | MUCS 2021: MUltilingual and Code-Switching ASR Challenges for Low Resource Indian Languages | ASR/Code Switching | HI, MR, OR, TA, TE, GU, HI-EN, BN-EN | https://navana-tech.github.io/MUCS2021/data.html | |
| mtedx | Multilingual TEDx | ASR/Machine Translation/Speech Translation | 13 Language pairs | http://www.openslr.org/100/ |
diff --git a/egs/lrs/README.md b/egs/lrs/README.md
new file mode 100644
index 00000000000..26f623cd08b
--- /dev/null
+++ b/egs/lrs/README.md
@@ -0,0 +1,335 @@
+# ESPnet-AVSR
+
+## Introduction
+This repository contains an implementation of end-to-end (E2E) audio-visual speech recognition (AVSR) based on the ESPnet ASR toolkit. The new fusion strategy follows the paper "Fusing information streams in end-to-end audio-visual speech recognition." (https://ieeexplore.ieee.org/document/9414553) [[1]](#literature). A broad range of reliability measures are used to help the integration model improve the performance of the AVSR model. We use two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 corpora for all our experiments.
+In addition, this project also contains an audio-only model for comparison.
+
+## Table of Contents
+- [Installation](#installation-of-required-packages)
+ * [Requirements](#requirements)
+- [Project Structure](#project-structure)
+ * [Basics](#project-structure)
+ * [AVSR1](#detailed-description-of-avsr1)
+- [Usage of the scripts](#running-the-script)
+ + [Notes](#notes)
+
+
+## Installation of required packages
+
+### Requirements
+
+For installation, approximately 40GB of free disk space is needed. avsr1/run.sh stage 0 installs all required packages in avsr1/local/installations:
+
+**Required Packages:**
+1. ESPNet: https://github.com/espnet/espnet
+1. OpenFace: https://github.com/TadasBaltrusaitis/OpenFace
+2. DeepXi: https://github.com/anicolson/DeepXi
+3. Vidaug: https://github.com/okankop/vidaug
+
+
+
+## Project structure
+The main folder avsr1/
, contains the code for the audio-visual speech recognition system, also trained on the LRS2 [[2]](#literature) dataset together with the LRS3 dataset (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature). It follows the basic ESPnet structure.
+The main code for the recognition system is the run.sh
script. In the script, the workflow of the systems is performed in multiple stages:
+
+| AVSR |
+|-------------------------------------------------------------|
+| Stage 0: Install required packages |
+| Stage 1: Data Download and preparation |
+| Stage 2: Audio augmentation |
+| Stage 3: MP3 files and Feature Generation |
+| Stage 4: Dictionary and JSON data preparation |
+| Stage 5: Reliability measures generation |
+| Stage 6: Language model trainin |
+| Stage 7: Training of the E2E-AVSR model and Decoding |
+
+
+
+
+
+
+### Detailed description of AVSR1:
+
+##### Stage 0: Packages installations
+ * Install the required packages: ESPNet, OpenFace, DeepXi, Vidaug in avsr1/local/installations. To install OpenFace, you will need sudo right.
+
+##### Stage 1: Data preparation
+ * The data set LRS2 [2] must be downloaded in advance by yourself. For downloading the dataset, please visit https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html/ [2]. You will need to sign a data-sharing agreement with BBC Research & Development before getting access. After downloading, please edit path.sh
file and assign the dataset directory path to the DATA_DIR
variable
+ * The same applies to the LRS3 dataset https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html [3]. After downloading, please edit path.sh
file and assign the dataset directory path to the DATALRS3_DIR
variable
+ * Download the Musan dataset for audio data augmentation and save it under ${MUSAN_DIR}
directory
+ * Download Room Impulse Response and Noise Database (RIRS-Noises) and save it under RIRS_NOISES/
directory
+ * Run audio_data_prep.sh
script: Create file lists for the given part of the Dataset, prepare the Kaldi files
+ * Dump useful data for training
+
+##### Stage 2: Audio Augmentation
+ * Augment the audio data with RIRS Noise
+ * Augment the audio data with Musan Noise
+ * The augmented files are saved under data/audio/augment whereas the clear audio files can be found in data/audio/clear for all the used datasets (Test, Validation(Val), Train and optional Pretrain)
+
+##### Stage 3: Feature Generation
+ * Make augmented MP3 files
+ * Generate the fbank and mfcc features for the audio signals. By default, 80-dimensional filterbanks with pitch on each frame are used
+ * Compute global Cepstral mean and variance normalization (CMVN). This computes goodness of pronunciation (GOP) and extracts phone-level pronunciation features for mispronunciations detection tasks (https://kaldi-asr.org/doc/compute-cmvn-stats_8cc.html).
+
+##### Stage 4: Dictionary and JSON data preparation
+ * Build Dictionary and JSON Data Preparation
+ * Build a tokenizer using Sentencepiece: https://github.com/google/sentencepiece
+
+##### Stage 5: Reliability measures generation
+ * Stage 5.0: Creat dump file for MFCC features
+ * Stage 5.1: Video augmentation with Gaussian blur and salt&pepper noise
+ * Stage 5.2: OpenFace face recognition for facial recognition (especially the mouth region, for further details see documentation in avsr1/local folder )
+ * Stage 5.3: Extract video frames
+ * Stage 5.4: Estimate SNRs using DeepXi framework
+ * Stage 5.5: Extract video features by pretrained video feature extractor [[4]](#literature)
+ * Stage 5.6: Make video .ark files
+ * Stage 5.7: Remake audio and video dump files
+ * Stage 5.8: Split test decode dump files by different signal-to-noise ratios
+
+##### Stage 6: Language Model Training
+ * Train your own language model on the librispeech dataset (https://www.openslr.org/11/) or use a pretrained language model
+ * It is possible to skip the language model and use the system without an external language model.
+
+##### Stage 7: Network Training
+ * Train audio model
+ * Pretrain video model
+ * Finetune video model
+ * Pretrain av model
+ * Finetune av model (model used for decoding)
+
+##### Other important references:
+ * Explanation of the CSV-file for OpenFace: https://github.com/TadasBaltrusaitis/OpenFace/wiki/Output-Format#featureextraction
+
+
+## Running the script
+The runtime script is the script **run.sh**. It can be found in avsr1/
directory.
+> Before running the script, please download the LRS2 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html) [[2]](#literature) and LRS3 (https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html) [[3]](#literature) datasets by yourself and save the download paths to the variables DATA_DIR
(LRS2 path) and DATALRS3_DIR
(LRS3 path) inside run.sh
file.
+
+### Notes
+Due to the long runtime, it could be useful to run the script using screen command in combination with monitoring in a terminal window and also redirect the output to a log file.
+
+Screen is a terminal multiplexer which means that you can start any number of virtual terminals inside the current terminal session. The advantage is, that you can detach virtual terminals so that they are running in the background. Furthermore, the processes keep still running, even if you are closing the main session or close an ssh connection if you are working remote on a server.
+Screen can be installed from the official package repositories via
+```console
+foo@bar:~$ sudo apt install screen
+```
+As an example, to redirect the output into a file named "log_run_sh.txt", the script could be started with:
+```console
+foo@bar:~/avsr1$ screen bash -c 'bash run.sh |& tee -a log_run_sh.txt'
+```
+This will start a virtual terminal session, which is executing and monitoring the run.sh file. The output is printed to this session as well as saved into the file "log_run_sh.txt". You can leave the monitoring session by simply pressing ctrl+A+D
. If you want to return to the process, simply type
+```console
+foo@bar:~$ screen -ls
+```
+into a terminal to see all running screen processes with their corresponding ID. Then execute
+```console
+foo@bar:~$ screen -r [ID]
+```
+to return to the process.
+Source: https://wiki.ubuntuusers.de/Screen/
+
+***
+### Literature
+
+[1] W. Yu, S. Zeiler and D. Kolossa, "Fusing Information Streams in End-to-End Audio-Visual Speech Recognition," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 3430-3434, doi: 10.1109/ICASSP39728.2021.9414553.
+
+[2] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, A. Zisserman
+Deep Audio-Visual Speech Recognition
+arXiv: 1809.02108
+
+[3] T. Afouras, J. S. Chung, A. Zisserman
+LRS3-TED: a large-scale dataset for visual speech recognition
+arXiv preprint arXiv: 1809.00496
+
+[4] S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, andM. Pantic, “Audio-visual speech recognition with a hybridCTC/Attention architecture,” in IEEE SLT. IEEE, 2018.
+
diff --git a/egs/lrs/avsr1/RESULTS.md b/egs/lrs/avsr1/RESULTS.md
new file mode 100755
index 00000000000..2615db795f8
--- /dev/null
+++ b/egs/lrs/avsr1/RESULTS.md
@@ -0,0 +1,294 @@
+## pretrain_Train_pytorch_audio_delta_specaug (Audio-Only)
+
+* Model files (archived to model.tar.gz by $ pack_model.sh
)
+ - download link: https://drive.google.com/file/d/1ITgdZoa8vQ7lDwi1jLziYGXOyUtgE2ow/view
+ - training config file: conf/train.yaml
+ - decoding config file: conf/decode.yaml
+ - preprocess config file: conf/specaug.yaml
+ - lm config file: conf/lm.yaml
+ - cmvn file: data/train/cmvn.ark
+ - e2e file: exp/audio/model.last10.avg.best
+ - e2e json file: exp/audio/model.json
+ - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best
+ - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json
+ - dict file: data/lang_char/train_unigram500_units.txt
+
+## Environments
+- date: `Mon Feb 21 11:52:07 UTC 2022`
+- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]`
+- espnet version: `espnet 0.6.0`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.0.1.post2`
+
+### CER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise|-12|171|1669|82.0|11.2|6.8|2.2|20.3|38.6|
+||-9|187|1897|87.0|8.3|4.7|0.8|13.8|33.2|
+||-6|176|1821|92.0|5.5|2.5|1.1|9.1|26.7|
+||-3|201|2096|94.4|2.2|3.3|0.2|5.8|20.4|
+||0|158|1611|95.0|3.0|2.0|0.4|5.4|19.0|
+||3|173|1710|94.7|2.7|2.6|0.4|5.7|24.9|
+||6|185|1920|96.2|1.8|2.0|0.5|4.3|17.8|
+||9|157|1533|97.6|1.0|1.4|0.5|2.9|13.4|
+||12|150|1536|96.4|1.6|2.1|0.3|4.0|20.7|
+||clean|138|1390|96.7|1.4|1.9|0.4|3.7|17.4|
+||reverb|177|1755|93.7|3.6|2.7|0.7|7.0|23.2|
+|ambient noise|-12|187|1873|76.4|16.3|7.3|2.3|25.9|51.9|
+||-9 |193|1965|84.2|10.3|5.4|1.8|17.6|40.4|
+||-6 |176|1883|90.2|5.8|4.0|1.3|11.2|26.1|
+||-3 |173|1851|91.2|4.8|4.0|1.0|9.8|32.9|
+|| 0 |148|1470|94.8|3.0|2.2|0.7|5.9|23.6|
+|| 3 |176|1718|96.0|2.1|1.9|0.3|4.3|17.0|
+|| 6 |166|1714|93.7|2.9|3.4|0.5|6.8|20.5|
+|| 9 |170|1601|96.9|1.5|1.6|0.3|3.4|18.2|
+||12 |169|1718|95.9|2.5|1.6|0.2|4.3|20.1|
+||clean |138|1390|96.7|1.4|1.9|0.4|3.7|17.4|
+||reverb |177|1755|93.7|3.6|2.7|0.7|7.0|23.2|
+
+### WER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise|-12|171|912|83.4|12.5|4.1|2.4|19.0|38.6|
+||-9 |187|1005|87.6|8.6|3.9|1.9|14.3|33.2|
+||-6 |176|951|90.6|5.9|3.5|0.8|10.2|26.7|
+||-3 |201|1097|94.4|3.3|2.3|0.6|6.2|20.4|
+|| 0 |158|847|94.9|3.2|1.9|0.4|5.4|19.0|
+|| 3 |173|884|94.2|3.8|1.9|0.6|6.3|24.9|
+|| 6 |185|997|96.3|2.7|1.0|0.7|4.4|17.8|
+|| 9 |157|817|96.9|1.7|1.3|0.4|3.4|13.4|
+||12 |150|832|95.2|2.9|1.9|0.5|5.3|20.7|
+||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4|
+||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2|
+|ambient noise|-12|187|995|73.7|18.4|7.9|1.7|28.0|51.9|
+||-9 |193|1060|83.0|11.7|5.3|1.4|18.4|40.4|
+||-6 |176|971|90.2|6.8|3.0|1.4|11.2|26.1|
+||-3 |173|972|90.0|6.9|3.1|1.0|11.0|32.9|
+|| 0 |148|838|94.0|4.1|1.9|0.4|6.3|23.6|
+|| 3 |176|909|95.5|2.9|1.7|0.3|4.8|17.0|
+|| 6 |166|830|94.1|3.3|2.7|1.0|6.9|20.5|
+|| 9 |170|872|95.4|3.1|1.5|0.2|4.8|18.2|
+||12 |169|895|95.0|4.0|1.0|0.2|5.3|20.1|
+||clean |138|739|95.7|2.4|1.9|0.4|4.7|17.4|
+||reverb |177|943|93.6|4.0|2.3|0.4|6.8|23.2|
+
+## Train_pytorch_trainvideo_delta_specaug (Video-Only)
+
+* Model files (archived to model.tar.gz by $ pack_model.sh
)
+ - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view
+ - training config file: conf/finetunevideo/trainvideo.yaml
+ - decoding config file: conf/decode.yaml
+ - preprocess config file: conf/specaug.yaml
+ - lm config file: conf/lm.yaml
+ - e2e file: exp/vfintune/model.last10.avg.best
+ - e2e json file: exp/vfintune/model.json
+ - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best
+ - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json
+ - dict file: data/lang_char/train_unigram500_units.txt
+
+## Environments
+- date: `Mon Feb 21 11:52:07 UTC 2022`
+- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]`
+- espnet version: `espnet 0.6.0`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.0.1.post2`
+
+
+### CER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|clean visual data|171|1669|42.3|42.5|15.2|6.4|64.1|91.8|
+||-9 |187|1897|46.4|38.8|14.8|8.5|62.2|90.9|
+||-6 |176|1821|48.1|37.7|14.2|9.2|61.1|92.0|
+||-3 |201|2096|41.7|46.4|11.9|8.9|67.2|90.0|
+|| 0 |158|1611|43.4|42.6|14.0|7.1|63.7|94.9|
+|| 3 |173|1710|49.2|37.6|13.2|8.9|59.7|91.9|
+|| 6 |185|1920|39.3|45.6|15.2|9.4|70.2|95.1|
+|| 9 |157|1533|46.2|39.1|14.7|8.5|62.3|89.2|
+||12 |150|1536|49.5|37.6|12.9|7.2|57.7|87.3|
+||clean |138|1390|44.2|42.3|13.5|7.8|63.7|92.8|
+||reverb |177|1755|44.8|41.5|13.6|7.5|62.7|92.1|
+|visual gaussian blur|-12|187|1873|37.3|46.6|16.1|9.0|71.6|93.0|
+||-9 |193|1965|43.0|44.1|13.0|11.0|68.1|93.8|
+||-6 |176|1883|39.9|43.3|16.7|7.5|67.6|93.8|
+||-3 |173|1851|43.7|43.8|12.5|8.2|64.5|91.9|
+|| 0 |148|1470|42.3|45.4|12.3|8.2|65.9|93.9|
+|| 3 |176|1718|44.8|41.5|13.7|7.9|63.1|89.2|
+|| 6 |166|1714|38.5|45.4|16.0|10.7|72.2|94.6|
+|| 9 |170|1601|45.1|42.8|12.1|11.7|66.6|91.2|
+||12 |169|1718|42.0|40.1|17.9|8.2|66.2|92.3|
+||clean |138|1390|40.4|45.5|14.2|8.7|68.3|93.5|
+||reverb |177|1755|40.2|45.6|14.2|8.5|68.3|92.7|
+|visual salt and pepper noise|-12|187|1873|36.2|48.1|15.8|9.9|73.7|92.0|
+||-9 |193|1965|41.7|44.6|13.7|10.6|68.9|92.7|
+||-6 |176|1883|36.5|47.2|16.4|8.6|72.1|93.2|
+||-3 |173|1851|42.1|45.4|12.5|10.8|68.6|92.5|
+|| 0 |148|1470|42.3|45.1|12.6|9.5|67.2|91.9|
+|| 3 |176|1718|40.0|45.1|15.0|7.6|67.6|92.0|
+|| 6 |166|1714|38.1|45.2|16.7|10.1|72.0|94.0|
+|| 9 |170|1601|40.2|45.9|13.9|12.0|71.8|92.9|
+||12 |169|1718|37.5|46.8|15.7|8.7|71.2|94.1|
+||clean |138|1390|39.9|46.0|14.0|9.1|69.1|92.8|
+||reverb |177|1755|39.9|46.2|13.9|9.1|69.2|92.7|
+
+### WER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|clean visual data|-12|171|912|39.4|42.7|18.0|4.3|64.9|89.5|
+||-9 |187|1005|43.7|40.6|15.7|5.4|61.7|86.1|
+||-6 |176|951|43.3|42.6|14.1|4.1|60.8|88.6|
+||-3 |201|1097|41.3|44.2|14.5|5.3|64.0|85.6|
+|| 0 |158|847|44.3|37.8|17.9|6.1|61.9|85.4|
+|| 3 |173|884|44.2|39.7|16.1|5.3|61.1|84.4|
+|| 6 |185|997|38.2|44.8|17.0|3.9|65.7|84.9|
+|| 9 |157|817|47.9|37.1|15.1|5.5|57.6|80.3|
+||12 |150|832|42.9|37.6|19.5|5.3|62.4|84.0|
+||clean |138|739|45.9|39.1|15.0|5.3|59.4|85.5|
+||reverb |177|943|43.4|40.5|16.1|5.3|61.9|85.9|
+|visual Gaussian blur|-12|187|995|35.9|45.4|18.7|5.3|69.4|86.6|
+||-9 |193|1060|35.0|44.2|20.8|5.0|70.0|92.2|
+||-6 |176|971|38.2|43.2|18.6|4.6|66.4|87.5|
+||-3 |173|972|37.9|45.5|16.7|4.8|67.0|86.1|
+|| 0 |148|838|38.1|40.7|21.2|4.2|66.1|89.2|
+|| 3 |176|909|36.0|48.5|15.5|5.9|70.0|88.6|
+|| 6 |166|830|36.7|46.6|16.6|6.1|69.4|89.8|
+|| 9 |170|872|39.0|45.5|15.5|4.7|65.7|87.6|
+||12 |169|895|35.2|46.8|18.0|4.6|69.4|89.9|
+||clean |138|739|40.7|42.2|17.1|5.0|64.3|88.4|
+||reverb |177|943|38.0|44.3|17.7|5.0|67.0|89.3|
+|visual salt and pepper noise|-12|187|995|32.5|48.9|18.6|4.6|72.2|83.4|
+||-9 |193|1060|32.3|51.5|16.2|6.1|73.9|92.2|
+||-6 |176|971|36.5|47.3|16.3|7.2|70.8|86.4|
+||-3 |173|972|35.5|47.2|17.3|4.6|69.1|88.4|
+|| 0 |148|838|36.9|41.5|21.6|3.7|66.8|88.5|
+|| 3 |176|909|33.0|51.9|15.1|5.4|72.4|88.6|
+|| 6 |166|830|35.3|49.9|14.8|8.8|73.5|88.0|
+|| 9 |170|872|41.2|43.3|15.5|5.6|64.4|84.7|
+||12 |169|895|34.2|47.8|18.0|7.3|73.1|91.1|
+||clean |138|739|37.5|47.8|14.7|7.3|69.8|86.2|
+||reverb |177|943|35.9|47.9|16.1|6.7|70.7|87.0|
+
+## Train_pytorch_trainavs_delta_specaug (Audio-Visual)
+
+* Model files (archived to model.tar.gz by $ pack_model.sh
)
+ - download link: https://drive.google.com/file/d/1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi/view
+ - training config file: conf/finetuneav/trainavs.yaml
+ - decoding config file: conf/decode.yaml
+ - preprocess config file: conf/specaug.yaml
+ - lm config file: conf/lm.yaml
+ - cmvn file: data/train/cmvn.ark
+ - e2e file: exp/avfintune/model.last10.avg.best
+ - e2e json file: exp/avfintune/model.json
+ - lm file: exp/train_rnnlm_pytorch_lm_unigram500/rnnlm.model.best
+ - lm JSON file: exp/train_rnnlm_pytorch_lm_unigram500/model.json
+ - dict file: data/lang_char/train_unigram500_units.txt
+
+## Environments
+- date: `Mon Feb 21 11:52:07 UTC 2022`
+- python version: `3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]`
+- espnet version: `espnet 0.6.0`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.0.1.post2`
+
+
+### CER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise with clean visual data |-12|171|1669|90.7|5.4|3.9|0.7|9.9|26.3|
+||-9 |187|1897|93.7|3.5|2.7|0.4|6.7|25.1|
+||-6 |176|1821|95.1|2.9|2.0|0.4|5.4|18.8|
+||-3 |201|2096|96.2|1.6|2.2|0.3|4.2|15.9|
+|| 0 |158|1611|96.4|1.9|1.7|0.2|3.8|13.9|
+|| 3 |173|1710|96.7|1.7|1.6|0.2|3.6|17.9|
+|| 6 |185|1920|96.1|1.6|2.2|0.5|4.3|18.9|
+|| 9 |157|1533|96.9|1.4|1.7|0.5|3.6|14.0|
+||12 |150|1536|96.5|1.4|2.1|0.5|4.0|21.3|
+||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8|
+||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4|
+|ambient noise with clean visual data |-12|187|1873|89.6|5.8|4.6|1.2|11.5|31.0|
+||-9 |193|1965|91.2|5.0|3.8|0.9|9.6|29.0|
+||-6 |176|1883|94.3|1.9|3.8|0.3|6.0|21.0|
+||-3 |173|1851|94.8|2.7|2.5|0.9|6.1|22.0|
+|| 0 |148|1470|96.3|1.6|2.0|0.1|3.8|16.9|
+|| 3 |176|1718|97.7|1.5|0.8|0.1|2.4|12.5|
+|| 6 |166|1714|96.6|1.6|1.8|0.2|3.6|16.3|
+|| 9 |170|1601|97.0|1.6|1.4|0.3|3.3|17.1|
+||12 |169|1718|95.4|2.6|2.0|0.1|4.7|20.7|
+||clean |138|1390|97.9|0.9|1.2|0.2|2.3|13.8|
+||reverb |177|1755|96.8|1.5|1.8|0.2|3.5|16.4|
+|ambient noise with visual Gaussian blur|-12|187|1873|86.9|7.3|5.8|1.1|14.2|35.8|
+||-9 |193|1965|91.1|5.4|3.5|1.0|9.9|30.1|
+||-6 |176|1883|93.3|2.7|4.0|0.3|7.0|24.4|
+||-3 |173|1851|95.1|2.5|2.4|0.8|5.7|21.4|
+|| 0 |148|1470|96.3|1.6|2.1|0.1|3.8|17.6|
+|| 3 |176|1718|97.3|1.6|1.2|0.2|2.9|13.6|
+|| 6 |166|1714|96.2|1.8|2.0|0.2|4.0|18.1|
+|| 9 |170|1601|97.0|1.4|1.6|0.2|3.2|16.5|
+||12 |169|1718|94.9|2.8|2.3|0.3|5.4|23.1|
+||clean |138|1390|97.8|0.9|1.3|0.2|2.4|14.5|
+||reverb |177|1755|96.5|1.5|2.1|0.2|3.7|16.9|
+|ambient noise with visual salt and pepper noise|-12|187|1873|87.6|7.0|5.4|1.3|13.8|35.8|
+||-9 |193|1965|91.0|5.8|3.2|1.3|10.3|30.6|
+||-6 |176|1883|93.6|2.0|4.4|0.4|6.9|24.4|
+||-3 |173|1851|95.6|2.9|1.6|0.8|5.2|20.2|
+|| 0 |148|1470|95.9|1.9|2.2|0.1|4.2|18.2|
+|| 3 |176|1718|98.0|1.0|1.0|0.3|2.3|13.1|
+|| 6 |166|1714|96.4|1.8|1.8|0.2|3.7|17.5|
+|| 9 |170|1601|97.0|1.4|1.6|0.4|3.4|16.5|
+||12 |169|1718|96.2|2.2|1.6|0.2|4.1|18.9|
+||clean |138|1390|98.1|0.9|1.1|0.2|2.2|13.0|
+||reverb |177|1755|96.6|1.5|1.9|0.2|3.6|16.9|
+
+### WER
+
+|dataset|SNR in dB|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|---|
+|music noise with clean visual data |-12|171|912|91.2|6.0|2.7|1.5|10.3|26.3|
+||-9 |187|1005|93.2|4.5|2.3|0.4|7.2|25.1|
+||-6 |176|951|94.1|3.7|2.2|0.3|6.2|18.8|
+||-3 |201|1097|95.2|2.7|2.1|0.4|5.2|15.9|
+|| 0 |158|847|96.7|2.2|1.1|0.4|3.7|13.9|
+|| 3 |173|884|95.6|2.6|1.8|0.3|4.8|17.9|
+|| 6 |185|997|95.5|2.3|2.2|0.7|5.2|18.9|
+|| 9 |157|817|96.2|2.1|1.7|0.7|4.5|14.0|
+||12 |150|832|95.1|2.4|2.5|0.2|5.2|21.3|
+||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8|
+||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4|
+|ambient noise with clean visual data |-12|187|995|90.4|6.9|2.7|1.1|10.8|31.0|
+||-9 |193|1060|91.3|5.6|3.1|1.4|10.1|29.0|
+||-6 |176|971|94.4|2.9|2.7|0.3|5.9|21.0|
+||-3 |173|972|93.7|3.7|2.6|0.1|6.4|22.0|
+|| 0 |148|838|95.7|2.0|2.3|0.1|4.4|16.9|
+|| 3 |176|909|97.0|1.5|1.4|0.3|3.3|12.5|
+|| 6 |166|830|96.0|1.9|2.0|0.6|4.6|16.3|
+|| 9 |170|872|95.6|3.4|0.9|0.2|4.6|17.1|
+||12 |169|895|94.0|3.7|2.3|0.4|6.5|20.7|
+||clean |138|739|97.2|1.5|1.4|0.4|3.2|13.8|
+||reverb |177|943|96.0|1.8|2.2|0.3|4.3|16.4|
+|ambient noise with visual Gaussian blur|-12|187|995|87.0|9.1|3.8|1.0|14.0|35.8|
+||-9 |193|1060|90.6|6.2|3.2|1.1|10.6|30.1|
+||-6 |176|971|93.2|3.6|3.2|0.3|7.1|24.4|
+||-3 |173|972|94.0|3.6|2.4|0.1|6.1|21.4|
+|| 0 |148|838|95.6|2.3|2.1|0.2|4.7|17.6|
+|| 3 |176|909|96.3|1.7|2.1|0.3|4.1|13.6|
+|| 6 |166|830|95.4|2.3|2.3|0.6|5.2|18.1|
+|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5|
+||12 |169|895|93.2|4.4|2.5|0.4|7.3|23.1|
+||clean |138|739|97.0|1.5|1.5|0.4|3.4|14.5|
+||reverb |177|943|95.7|1.7|2.7|0.3|4.7|16.9|
+|ambient noise with visual salt and pepper noise|-12|187|995|87.1|8.8|4.0|0.9|13.8|35.8|
+||-9 |193|1060|90.5|6.3|3.2|1.1|10.7|30.6|
+||-6 |176|971|93.3|3.2|3.5|0.3|7.0|24.4|
+||-3 |173|972|94.7|3.8|1.5|0.2|5.6|20.2|
+|| 0 |148|838|95.3|2.4|2.3|0.2|4.9|18.2|
+|| 3 |176|909|96.8|1.4|1.8|0.3|3.5|13.1|
+|| 6 |166|830|95.9|2.2|1.9|0.7|4.8|17.5|
+|| 9 |170|872|95.6|3.1|1.3|0.2|4.6|16.5|
+||12 |169|895|94.7|3.5|1.8|0.3|5.6|18.9|
+||clean |138|739|97.4|1.5|1.1|0.4|3.0|13.0|
+||average |177|943|95.8|1.9|2.3|0.4|4.7|16.9|
diff --git a/egs/lrs/avsr1/cmd.sh b/egs/lrs/avsr1/cmd.sh
new file mode 100755
index 00000000000..4d70c9c7a79
--- /dev/null
+++ b/egs/lrs/avsr1/cmd.sh
@@ -0,0 +1,89 @@
+# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
+# Usage: .pl [options] JOB=1:
+# e.g.
+# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
+#
+# Options:
+# --time : Limit the maximum time to execute.
+# --mem : Limit the maximum memory usage.
+# -–max-jobs-run : Limit the number parallel jobs. This is ignored for non-array jobs.
+# --num-threads : Specify the number of CPU core.
+# --gpu : Specify the number of GPU devices.
+# --config: Change the configuration file from default.
+#
+# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
+# The left string of "=", i.e. "JOB", is replaced by (Nth job) in the command and the log file name,
+# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
+# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
+#
+# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
+# These options are mapping to specific options for each backend and
+# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
+# If jobs failed, your configuration might be wrong for your environment.
+#
+#
+# The official documentaion for run.pl, queue.pl, slurm.pl, and ssh.pl:
+# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
+# =========================================================~
+
+
+# Select the backend used by run.sh from "local", "sge", "slurm", or "ssh"
+cmd_backend='local'
+
+# Local machine, without any Job scheduling system
+if [ "${cmd_backend}" = local ]; then
+
+ # The other usage
+ export train_cmd="run.pl"
+ # Used for "*_train.py": "--gpu" is appended optionally by run.sh
+ export cuda_cmd="run.pl"
+ # Used for "*_recog.py"
+ export decode_cmd="run.pl"
+
+# "qsub" (SGE, Torque, PBS, etc.)
+elif [ "${cmd_backend}" = sge ]; then
+ # The default setting is written in conf/queue.conf.
+ # You must change "-q g.q" for the "queue" for your environment.
+ # To know the "queue" names, type "qhost -q"
+ # Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.
+
+ export train_cmd="queue.pl"
+ export cuda_cmd="queue.pl"
+ export decode_cmd="queue.pl"
+
+# "sbatch" (Slurm)
+elif [ "${cmd_backend}" = slurm ]; then
+ # The default setting is written in conf/slurm.conf.
+ # You must change "-p cpu" and "-p gpu" for the "partion" for your environment.
+ # To know the "partion" names, type "sinfo".
+ # You can use "--gpu * " by defualt for slurm and it is interpreted as "--gres gpu:*"
+ # The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".
+
+ export train_cmd="slurm.pl"
+ export cuda_cmd="slurm.pl"
+ export decode_cmd="slurm.pl"
+
+elif [ "${cmd_backend}" = ssh ]; then
+ # You have to create ".queue/machines" to specify the host to execute jobs.
+ # e.g. .queue/machines
+ # host1
+ # host2
+ # host3
+ # Assuming you can login them without any password, i.e. You have to set ssh keys.
+
+ export train_cmd="ssh.pl"
+ export cuda_cmd="ssh.pl"
+ export decode_cmd="ssh.pl"
+
+# This is an example of specifying several unique options in the JHU CLSP cluster setup.
+# Users can modify/add their own command options according to their cluster environments.
+elif [ "${cmd_backend}" = jhu ]; then
+
+ export train_cmd="queue.pl --mem 2G"
+ export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/gpu.conf"
+ export decode_cmd="queue.pl --mem 4G"
+
+else
+ echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
+ return 1
+fi
diff --git a/egs/lrs/asr1/conf/decode.yaml b/egs/lrs/avsr1/conf/decode.yaml
old mode 100644
new mode 100755
similarity index 100%
rename from egs/lrs/asr1/conf/decode.yaml
rename to egs/lrs/avsr1/conf/decode.yaml
diff --git a/egs/lrs/asr1/conf/fbank.conf b/egs/lrs/avsr1/conf/fbank.conf
old mode 100644
new mode 100755
similarity index 100%
rename from egs/lrs/asr1/conf/fbank.conf
rename to egs/lrs/avsr1/conf/fbank.conf
diff --git a/egs/lrs/asr1/conf/gpu.conf b/egs/lrs/avsr1/conf/gpu.conf
old mode 100644
new mode 100755
similarity index 100%
rename from egs/lrs/asr1/conf/gpu.conf
rename to egs/lrs/avsr1/conf/gpu.conf
diff --git a/egs/lrs/avsr1/conf/lm.yaml b/egs/lrs/avsr1/conf/lm.yaml
new file mode 100755
index 00000000000..94918a470ae
--- /dev/null
+++ b/egs/lrs/avsr1/conf/lm.yaml
@@ -0,0 +1,9 @@
+layer: 4
+dropout: 0
+unit: 2048
+opt: sgd # or adam
+sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
+batchsize: 128 # batch size in LM training
+epoch: 2 # if the data size is large, we can reduce this
+patience: 3
+maxlen: 150 # if sentence length > lm_maxlen, lm_batchsize is automatically reduced
diff --git a/egs/lrs/avsr1/conf/mfcc.conf b/egs/lrs/avsr1/conf/mfcc.conf
new file mode 100755
index 00000000000..a1aa3d6c158
--- /dev/null
+++ b/egs/lrs/avsr1/conf/mfcc.conf
@@ -0,0 +1,2 @@
+--use-energy=false # only non-default option.
+--sample-frequency=16000
diff --git a/egs/lrs/avsr1/conf/mfcc_hires.conf b/egs/lrs/avsr1/conf/mfcc_hires.conf
new file mode 100755
index 00000000000..434834a6725
--- /dev/null
+++ b/egs/lrs/avsr1/conf/mfcc_hires.conf
@@ -0,0 +1,10 @@
+# config for high-resolution MFCC features, intended for neural network training
+# Note: we keep all cepstra, so it has the same info as filterbank features,
+# but MFCC is more easily compressible (because less correlated) which is why
+# we prefer this method.
+--use-energy=false # use average of log energy, not energy.
+--num-mel-bins=40 # similar to Google's setup.
+--num-ceps=40 # there is no dimensionality reduction.
+--low-freq=20 # low cutoff frequency for mel bins... this is high-bandwidth data, so
+ # there might be some information at the low end.
+--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)
diff --git a/egs/lrs/asr1/conf/pitch.conf b/egs/lrs/avsr1/conf/pitch.conf
old mode 100644
new mode 100755
similarity index 100%
rename from egs/lrs/asr1/conf/pitch.conf
rename to egs/lrs/avsr1/conf/pitch.conf
diff --git a/egs/lrs/asr1/conf/queue.conf b/egs/lrs/avsr1/conf/queue.conf
old mode 100644
new mode 100755
similarity index 100%
rename from egs/lrs/asr1/conf/queue.conf
rename to egs/lrs/avsr1/conf/queue.conf
diff --git a/egs/lrs/avsr1/conf/slurm.conf b/egs/lrs/avsr1/conf/slurm.conf
new file mode 100755
index 00000000000..cefd21f031d
--- /dev/null
+++ b/egs/lrs/avsr1/conf/slurm.conf
@@ -0,0 +1,12 @@
+# Default configuration
+command sbatch --export=PATH --ntasks-per-node=1
+option time=* --time $0
+option mem=* --mem-per-cpu $0
+option mem=0 # Do not add anything to qsub_opts
+option num_threads=* --cpus-per-task $0 --ntasks-per-node=1
+option num_threads=1 --cpus-per-task 1 --ntasks-per-node=1 # Do not add anything to qsub_opts
+default gpu=0
+option gpu=0 -p cpu
+option gpu=* -p gpu --gres=gpu:$0
+# note: the --max-jobs-run option is supported as a special case
+# by slurm.pl and you don't have to handle it in the config file.
diff --git a/egs/lrs/asr1/conf/specaug.yaml b/egs/lrs/avsr1/conf/specaug.yaml
old mode 100644
new mode 100755
similarity index 100%
rename from egs/lrs/asr1/conf/specaug.yaml
rename to egs/lrs/avsr1/conf/specaug.yaml
diff --git a/egs/lrs/avsr1/conf/train.yaml b/egs/lrs/avsr1/conf/train.yaml
new file mode 100755
index 00000000000..53fd0572132
--- /dev/null
+++ b/egs/lrs/avsr1/conf/train.yaml
@@ -0,0 +1,39 @@
+# network architecture
+# encoder related
+transformer-input-layer: conv2d
+elayers: 12
+eunits: 2048
+# decoder related
+dlayers: 6
+dunits: 2048
+# attention related
+adim: 256
+aheads: 4
+# transformer related
+model-module: "espnet.trainaudio.e2e_asr_transformer:E2E"
+
+# hybrid CTC/attention
+mtlalpha: 0.3
+
+# label smoothing
+lsm-type: unigram
+lsm-weight: 0.1
+
+# minibatch related
+batch-size: 32
+maxlen-in: 512 # if input length > maxlen_in, batchsize is automatically reduced
+maxlen-out: 150 # if output length > maxlen_out, batchsize is automatically reduced
+
+# optimization related
+sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
+opt: noam
+epochs: 100
+dropout-rate: 0.1
+accum-grad: 2
+grad-clip: 5
+patience: 0
+transformer-lr: 5.0
+transformer-warmup-steps: 25000
+transformer-attn-dropout-rate: 0.0
+transformer-length-normalized-loss: False
+transformer-init: pytorch
diff --git a/egs/lrs/avsr1/local/CMakeLists.txt b/egs/lrs/avsr1/local/CMakeLists.txt
new file mode 100644
index 00000000000..107f5c1e76d
--- /dev/null
+++ b/egs/lrs/avsr1/local/CMakeLists.txt
@@ -0,0 +1,248 @@
+cmake_minimum_required (VERSION 3.8)
+set(CMAKE_CXX_STANDARD 17)
+
+project(OpenFace VERSION 2.0.2)
+
+set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin/)
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake/modules/")
+
+set(CMAKE_CONFIG_DIR etc/OpenFace)
+set(CONFIG_DIR "${CMAKE_INSTALL_PREFIX}/${CMAKE_CONFIG_DIR}")
+add_definitions(-DCONFIG_DIR="${CONFIG_DIR}")
+
+# make sure we'll use OpenBLAS only: there's a header file naming difference between different
+# implementations; so OpenFace wants OpenBLAS;
+find_package(Threads)
+find_package(OpenBLAS REQUIRED)
+if ( ${OpenBLAS_FOUND} )
+ MESSAGE("OpenBLAS information:")
+ MESSAGE(" OpenBLAS_LIBRARIES: ${OpenBLAS_LIB}")
+else()
+ MESSAGE(FATAL_ERROR "OpenBLAS not found in the system.")
+endif()
+
+if ( ${OpenBLAS_INCLUDE_FOUND} )
+ MESSAGE(" OpenBLAS_INCLUDE: ${OpenBLAS_INCLUDE_DIR}")
+else()
+ MESSAGE(WARNING "OpenBLAS include not found in the system. Using the one vended with OpenFace.")
+ set(OpenBLAS_INCLUDE_DIR "${CMAKE_SOURCE_DIR}/lib/3rdParty/OpenBLAS/include")
+ MESSAGE(" OpenBLAS_INCLUDE: ${OpenBLAS_INCLUDE_DIR}")
+endif()
+
+find_package( OpenCV 4.0 REQUIRED COMPONENTS core imgproc calib3d highgui objdetect)
+if(${OpenCV_FOUND})
+ MESSAGE("OpenCV information:")
+ MESSAGE(" OpenCV_INCLUDE_DIRS: ${OpenCV_INCLUDE_DIRS}")
+ MESSAGE(" OpenCV_LIBRARIES: ${OpenCV_LIBRARIES}")
+ MESSAGE(" OpenCV_LIBRARY_DIRS: ${OpenCV_LINK_DIRECTORIES}")
+else()
+ MESSAGE(FATAL_ERROR "OpenCV not found in the system.")
+endif()
+
+find_package( Boost 1.5.9 COMPONENTS filesystem system)
+if(${Boost_FOUND})
+ MESSAGE("Boost information:")
+ MESSAGE(" Boost_VERSION: ${Boost_VERSION}")
+ MESSAGE(" Boost_INCLUDE_DIRS: ${Boost_INCLUDE_DIRS}")
+ MESSAGE(" Boost_LIBRARIES: ${Boost_LIBRARIES}")
+ MESSAGE(" Boost_LIBRARY_DIRS: ${Boost_LIBRARY_DIRS}")
+else()
+ MESSAGE("Boost not found in the system.")
+endif()
+
+
+# Move LandmarkDetector model
+file(GLOB files "lib/local/LandmarkDetector/model/*.txt")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model)
+endforeach()
+
+# Move the hierarchical LandmarkDetector models
+file(GLOB files "lib/local/LandmarkDetector/model/model*")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model)
+ install(DIRECTORY ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model)
+endforeach()
+
+# Move detection validation models
+file(GLOB files "lib/local/LandmarkDetector/model/detection_validation/*.txt")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model/detection_validation)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model/detection_validation)
+endforeach()
+
+# Move patch experts
+file(GLOB files "lib/local/LandmarkDetector/model/patch_experts/*.txt")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model/patch_experts)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model/patch_experts)
+endforeach()
+
+# Move CEN patch experts
+file(GLOB files "lib/local/LandmarkDetector/model/patch_experts/*.dat")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model/patch_experts)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model/patch_experts)
+endforeach()
+
+# Move MTCNN face detector
+file(GLOB files "lib/local/LandmarkDetector/model/mtcnn_detector/*.txt")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model/mtcnn_detector)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model/mtcnn_detector)
+endforeach()
+
+# Move MTCNN face detector
+file(GLOB files "lib/local/LandmarkDetector/model/mtcnn_detector/*.dat")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model/mtcnn_detector)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model/mtcnn_detector)
+endforeach()
+
+# Move Point Distribution models
+file(GLOB files "lib/local/LandmarkDetector/model/pdms/*.txt")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/model/pdms)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/model/pdms)
+endforeach()
+
+# Move OpenCV classifiers
+file(GLOB files "lib/3rdParty/OpenCV3.4/classifiers/*.xml")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/classifiers)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/classifiers)
+endforeach()
+
+# Move AU prediction modules
+file(GLOB files "lib/local/FaceAnalyser/AU_predictors/*.txt")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/AU_predictors)
+ install(FILES ${file} DESTINATION ${CMAKE_CONFIG_DIR}/AU_predictors)
+endforeach()
+
+# Move AU prediction modules
+file(GLOB files "lib/local/FaceAnalyser/AU_predictors/svr*")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/AU_predictors)
+ install(DIRECTORY ${file} DESTINATION ${CMAKE_CONFIG_DIR}/AU_predictors)
+endforeach()
+
+# Move AU prediction modules
+file(GLOB files "lib/local/FaceAnalyser/AU_predictors/svm*")
+foreach(file ${files})
+ file(COPY ${file} DESTINATION ${CMAKE_BINARY_DIR}/bin/AU_predictors)
+ install(DIRECTORY ${file} DESTINATION ${CMAKE_CONFIG_DIR}/AU_predictors)
+endforeach()
+
+if (${CMAKE_CXX_COMPILER_ID} STREQUAL "GNU")
+ execute_process(COMMAND ${CMAKE_CXX_COMPILER} -dumpversion OUTPUT_VARIABLE GCC_VERSION)
+ if (GCC_VERSION VERSION_LESS 8.0)
+ MESSAGE(FATAL_ERROR "Need a 8.0 or newer GCC compiler. Current GCC: ${GCC_VERSION}")
+ else ()
+ set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -msse -msse2 -msse3")
+ endif ()
+endif ()
+
+# dlib
+find_package(dlib 19.13)
+if(${dlib_FOUND})
+ message("dlib information:")
+ message(" dlib version: ${dlib_VERSION}")
+
+ if (NOT TARGET dlib)
+ add_library(dlib INTERFACE IMPORTED GLOBAL)
+ endif()
+else()
+ message(FATAL_ERROR "dlib not found in the system, please install dlib")
+endif()
+
+# suppress auto_ptr deprecation warnings
+if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang" OR "${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
+ add_compile_options("-Wno-deprecated-declarations")
+endif()
+
+# LandmarkDetector library
+add_subdirectory(lib/local/LandmarkDetector)
+# Facial Expression analysis library
+add_subdirectory(lib/local/FaceAnalyser)
+# Gaze estimation library
+add_subdirectory(lib/local/GazeAnalyser)
+# Utilities library
+add_subdirectory(lib/local/Utilities)
+
+# test if this file is a top list file
+# thus we're building an OpenFace as a standalone
+# project; otherwise OpenFace is being built as a
+# part or larger tree;
+if(CMAKE_CURRENT_SOURCE_DIR STREQUAL "${CMAKE_SOURCE_DIR}")
+
+ # for a standalone builds - allow installing package configs;
+ message(STATUS "Standalone mode detected; Enabling configuration/targets export.")
+
+ # export libraries for reuse
+ include(CMakePackageConfigHelpers)
+
+ set(LIB_INSTALL_DIR lib)
+ set(CONFIG_DEST_DIR ${LIB_INSTALL_DIR}/cmake/OpenFace/)
+ set(OpenFace_LIBRARIES OpenFace::GazeAnalyser OpenFace::FaceAnalyser OpenFace::LandmarkDetector OpenFace::Utilities)
+
+ # export targets [build tree]
+ export(EXPORT OpenFaceTargets
+ NAMESPACE OpenFace::
+ FILE "${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_DEST_DIR}/OpenFaceTargets.cmake")
+
+ # write package version file
+ write_basic_package_version_file(
+ "${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_DEST_DIR}/OpenFaceConfigVersion.cmake"
+ COMPATIBILITY AnyNewerVersion)
+
+ # define [build tree] bindir relative include dir
+ foreach(lib ${OpenFace_LIBRARIES})
+ if(TARGET ${lib})
+ get_target_property(libname ${lib} "NAME")
+ file(RELATIVE_PATH rel_incdir ${CMAKE_CURRENT_BINARY_DIR} "${CMAKE_CURRENT_SOURCE_DIR}/lib/local/${libname}/include")
+ list(APPEND OPENFACE_INCLUDE_DIRS ${rel_incdir})
+ endif()
+ endforeach()
+ list(REMOVE_DUPLICATES OPENFACE_INCLUDE_DIRS)
+
+ # write package config file from template [build tree]
+ # all PATH_VARS should be relative to a ${CMAKE_CURRENT_BINARY_DIR}
+ # as it's the "prefix" of our non installed package in the build tree
+ configure_package_config_file(cmake/OpenFaceConfig.cmake.in
+ "${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_DEST_DIR}/OpenFaceConfig.cmake"
+ INSTALL_DESTINATION ${CONFIG_DEST_DIR}
+ PATH_VARS OPENFACE_INCLUDE_DIRS)
+
+ # store current build dir in the CMake package registry
+ # export(PACKAGE OpenFace)
+
+ # install exported targets [install tree]
+ install(EXPORT OpenFaceTargets
+ FILE OpenFaceTargets.cmake
+ NAMESPACE OpenFace::
+ DESTINATION ${CONFIG_DEST_DIR})
+
+ # redefine [install tree] prefix relative include dir
+ set(OPENFACE_INCLUDE_DIRS "include/OpenFace")
+
+ # write package config file from template [install tree]
+ configure_package_config_file(cmake/OpenFaceConfig.cmake.in
+ "${CMAKE_CURRENT_BINARY_DIR}/OpenFace/OpenFaceConfig.cmake"
+ INSTALL_DESTINATION ${CONFIG_DEST_DIR}
+ PATH_VARS OPENFACE_INCLUDE_DIRS)
+
+ # install package configs
+ install(FILES
+ "${CMAKE_CURRENT_BINARY_DIR}/OpenFace/OpenFaceConfig.cmake"
+ "${CMAKE_CURRENT_BINARY_DIR}/${CONFIG_DEST_DIR}/OpenFaceConfigVersion.cmake"
+ DESTINATION ${CONFIG_DEST_DIR})
+endif()
+
+# executables
+add_subdirectory(exe/FaceLandmarkImg)
+add_subdirectory(exe/FaceLandmarkVid)
+add_subdirectory(exe/FaceLandmarkVidMulti)
+add_subdirectory(exe/FeatureExtraction)
diff --git a/egs/lrs/avsr1/local/download.sh b/egs/lrs/avsr1/local/download.sh
new file mode 100644
index 00000000000..8d17609a9b2
--- /dev/null
+++ b/egs/lrs/avsr1/local/download.sh
@@ -0,0 +1,13 @@
+#! /usr/bin/env bash
+
+# Copyright 2020 Ruhr-University (Wentao Yu)
+
+. ./cmd.sh
+. ./path.sh
+
+git clone https://github.com/rub-ksv/lrs_avsr1_local.git
+for file in data_prepare dump extract_reliability training; do
+ cp -R lrs_avsr1_local/$file local
+done
+rm -rf lrs_avsr1_local
+exit 0
diff --git a/egs/lrs/avsr1/local/installpackage.sh b/egs/lrs/avsr1/local/installpackage.sh
new file mode 100755
index 00000000000..066a396101a
--- /dev/null
+++ b/egs/lrs/avsr1/local/installpackage.sh
@@ -0,0 +1,82 @@
+#! /usr/bin/env bash
+
+#! /usr/bin/env bash
+
+# Copyright 2020 Ruhr-University (Wentao Yu)
+
+. ./cmd.sh
+. ./path.sh
+
+# hand over parameters
+OPENFACE_DIR=$1 # Path to OpenFace build directory
+VIDAUG_DIR=$2 # Path to vidaug directory
+DEEPXI_DIR=$3 # DeepXi directory
+
+conda install -n espnet_venv tensorflow tqdm pysoundfile boost
+conda install -n espnet_venv dlib pythran-openblas==0.3.6 opencv-python
+conda install -c esri tensorflow-addons
+
+mkdir -p local/installations
+if [ -d "$OPENFACE_DIR" ] ; then
+ echo "OpenFace already installed."
+else
+ while true
+ do
+ read -r -p "Have you already installed OpenFace on your computer [Y/n] " input
+ case $input in
+ [yY][eE][sS]|[yY])
+ echo "Please path OpenFace directory"
+ exit 1;
+ ;;
+ [nN][oO]|[nN])
+ cd local/installations
+ $MAIN_ROOT/tools/installers/install_openface.sh || exit 1;
+ cd ../..
+ break
+ ;;
+ esac
+ done
+fi
+
+if [ -d "$VIDAUG_DIR" ] ; then
+ echo "Vidaug already installed."
+else
+ while true
+ do
+ read -r -p "Have you already installed Vidaug on your computer [Y/n] " input
+ case $input in
+ [yY][eE][sS]|[yY])
+ echo "Please path Vidaug directory"
+ exit 1;
+ ;;
+ [nN][oO]|[nN])
+ cd local/installations
+ $MAIN_ROOT/tools/installers/install_vidaug.sh $MAIN_ROOT || exit 1;
+ cd ../..
+ break
+ ;;
+ esac
+ done
+fi
+
+if [ -d "$DEEPXI_DIR" ] ; then
+ echo "DeepXi already installed."
+else
+ while true
+ do
+ read -r -p "Have you already installed DeepXi on your computer [Y/n] " input
+ case $input in
+ [yY][eE][sS]|[yY])
+ echo "Please path DeepXi directory"
+ exit 1;
+ ;;
+ [nN][oO]|[nN])
+ cd local/installations
+ $MAIN_ROOT/tools/installers/install_deepxi.sh || exit 1;
+ cd ../..
+ break
+ ;;
+ esac
+ done
+fi
+exit 0
diff --git a/egs/lrs/avsr1/local/se_batch.py b/egs/lrs/avsr1/local/se_batch.py
new file mode 100755
index 00000000000..c5f0a58bf6b
--- /dev/null
+++ b/egs/lrs/avsr1/local/se_batch.py
@@ -0,0 +1,61 @@
+""" AUTHOR: Aaron Nicolson
+AFFILIATION: Signal Processing Laboratory, Griffith University.
+
+This Source Code Form is subject to the terms of the Mozilla Public
+License, v. 2.0. If a copy of the MPL was not distributed with this
+file, You can obtain one at http://mozilla.org/MPL/2.0/."""
+
+from deepxi.utils import read_wav
+import glob
+import numpy as np
+import os
+
+
+def Batch(fdir, snr_l=[]):
+ """REQUIRES REWRITING. WILL BE MOVED TO deepxi/utils.py
+
+ Places all of the test waveforms from the list into a numpy array.
+ SPHERE format cannot be used. 'glob' is used to support Unix style pathname
+ pattern expansions. Waveforms are padded to the maximum waveform length. The
+ waveform lengths are recorded so that the correct lengths can be sliced
+ for feature extraction. The SNR levels of each test file are placed into a
+ numpy array. Also returns a list of the file names.
+
+ Inputs:
+ fdir - directory containing the waveforms.
+ fnames - filename/s of the waveforms.
+ snr_l - list of the SNR levels used.
+
+ Outputs:
+ wav_np - matrix of paded waveforms stored as a numpy array.
+ len_np - length of each waveform strored as a numpy array.
+ snr_test_np - numpy array of all the SNR levels for the test set.
+ fname_l - list of filenames.
+
+ """
+ fname_l = [] # list of file names.
+ wav_l = [] # list for waveforms.
+ snr_test_l = [] # list of SNR levels for the test set.
+ # if isinstance(fnames, str): fnames = [fnames] # if string, put into list.
+ fnames = ["*.wav", "*.flac", "*.mp3"]
+ for fname in fnames:
+ for fpath in glob.glob(os.path.join(fdir, fname)):
+ for snr in snr_l:
+ if fpath.find("_" + str(snr) + "dB") != -1:
+ snr_test_l.append(snr) # append SNR level.
+ (wav, _) = read_wav(fpath) # read waveform from given file path.
+ if len(wav.shape) == 2:
+ wav = wav[:, 0]
+ if np.isnan(wav).any() or np.isinf(wav).any():
+ raise ValueError("Error: NaN or Inf value.")
+ wav_l.append(wav) # append.
+ fname_l.append(os.path.basename(os.path.splitext(fpath)[0])) # append name.
+ len_l = [] # list of the waveform lengths.
+ maxlen = max(len(wav) for wav in wav_l) # maximum length of waveforms.
+ wav_np = np.zeros(
+ [len(wav_l), maxlen], np.int16
+ ) # numpy array for waveform matrix.
+ for (i, wav) in zip(range(len(wav_l)), wav_l):
+ wav_np[i, : len(wav)] = wav # add waveform to numpy array.
+ len_l.append(len(wav)) # append length of waveform to list.
+ return wav_np, np.array(len_l, np.int32), np.array(snr_test_l, np.int32), fname_l
diff --git a/egs/lrs/avsr1/local/show_result.sh b/egs/lrs/avsr1/local/show_result.sh
new file mode 100755
index 00000000000..35f5915cfbf
--- /dev/null
+++ b/egs/lrs/avsr1/local/show_result.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+mindepth=0
+maxdepth=1
+
+. utils/parse_options.sh
+
+if [ $# -gt 2 ]; then
+ echo "Usage: $0 --mindepth 0 --maxdepth 1 [exp]" 1>&2
+ echo ""
+ echo "Show the system environments and the evaluation results in Markdown format."
+ echo 'The default of is "exp/".'
+ exit 1
+fi
+
+[ -f ./path.sh ] && . ./path.sh
+set -euo pipefail
+if [ $# -eq 1 ]; then
+ exp=$1
+else
+ exp=$1
+ savedir=$2
+fi
+
+
+cat << EOF
+
+# RESULTS
+## Environments
+- date: \`$(LC_ALL=C date)\`
+EOF
+
+python << EOF
+import sys, espnet, chainer, torch
+pyversion = sys.version.replace('\n', ' ')
+
+print(f"""- python version: \`{pyversion}\`
+- espnet version: \`espnet {espnet.__version__}\`
+- chainer version: \`chainer {chainer.__version__}\`
+- pytorch version: \`pytorch {torch.__version__}\`""")
+EOF
+
+cat << EOF
+- Git hash: \`$(git rev-parse HEAD)\`
+ - Commit date: \`$(git log -1 --format='%cd')\`
+
+EOF
+
+while IFS= read -r expdir; do
+ if ls ${expdir}/decode_*/result.txt &> /dev/null; then
+ # 1. Show the result table
+ cat << EOF
+## $(basename ${expdir})
+### CER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+EOF
+ grep -e Avg ${expdir}/decode_*/result.txt \
+ | sed -e "s#${expdir}/\([^/]*\)/result.txt:#|\1#g" \
+ | sed -e 's#Sum/Avg##g' | tr '|' ' ' | tr -s ' ' '|'
+ echo
+
+ # 2. Show the result table for WER
+ if ls ${expdir}/decode_*/result.wrd.txt &> /dev/null; then
+ cat << EOF
+### WER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+EOF
+ grep -e Avg ${expdir}/decode_*/result.wrd.txt \
+ | sed -e "s#${expdir}/\([^/]*\)/result.wrd.txt:#|\1#g" \
+ | sed -e 's#Sum/Avg##g' | tr '|' ' ' | tr -s ' ' '|'
+ echo
+ fi
+ fi
+done < <(find ${exp} -mindepth ${mindepth} -maxdepth ${maxdepth} -type d) >>$savedir
diff --git a/egs/lrs/avsr1/path.sh b/egs/lrs/avsr1/path.sh
new file mode 100755
index 00000000000..aa33934494e
--- /dev/null
+++ b/egs/lrs/avsr1/path.sh
@@ -0,0 +1,18 @@
+MAIN_ROOT=$PWD/../../..
+KALDI_ROOT=$MAIN_ROOT/tools/kaldi
+
+
+
+export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$KALDI_ROOT/src/featbin:$PATH
+[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 1
+. $KALDI_ROOT/tools/config/common_path.sh
+export LC_ALL=C
+
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:$MAIN_ROOT/tools/chainer_ctc/ext/warp-ctc/build
+. "${MAIN_ROOT}"/tools/activate_python.sh && . "${MAIN_ROOT}"/tools/extra_path.sh
+export PATH=$MAIN_ROOT/utils:$MAIN_ROOT/espnet/bin:$PATH
+
+export OMP_NUM_THREADS=1
+
+# NOTE(kan-bayashi): Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
diff --git a/egs/lrs/avsr1/run.sh b/egs/lrs/avsr1/run.sh
new file mode 100755
index 00000000000..6e295ad5290
--- /dev/null
+++ b/egs/lrs/avsr1/run.sh
@@ -0,0 +1,1050 @@
+#!/usr/bin/env bash
+
+# Copyright 2020 Ruhr-University (Wentao Yu)
+
+. ./path.sh || exit 1;
+. ./cmd.sh || exit 1;
+
+
+# general configuration
+ifpretrain=true # if use LRS2 pretrain set
+iflrs3pretrain=true # if use LRS3 pretrain set
+ifsegment=true # if do segmentation for pretrain set
+ifcuda=true # if use cuda
+ifmulticore=true # if multi cpu processing, default is true in all scripts
+num= # this variable is related with next variable. Only applies when ifdebug=true
+ifdebug=false # with debug, we only use $num Utts from pretrain and $num Utts from Train set
+backend=pytorch
+stage=-1 # start from -1 if you need to start from data download
+stop_stage=100 # stage at which to stop
+dataprocessingstage=0 # stage for data processing in stage 3
+stop_dataprocessingstage=100 # stage at which to stop
+ngpu=1 # number of gpus ("0" uses cpu, otherwise use gpu)
+nj=16
+debugmode=1
+dumpdir=dump # directory to dump full features
+N=0 # number of minibatches to be used (mainly for debugging). "0" uses all minibatches.
+verbose=0 # verbose option
+train_lm=false # true: Train own language model, false: use pretrained librispeech LM model
+
+# Setting path variables for dataset, OpenFace, DeepXi, pretrained model and musan
+# Change this variables and adapt it to your Folder structure
+DATA_DIR= # The LRS2 dataset directory e.g. "/home/foo/LRS2"
+DATALRS3_DIR= # The LRS3 dataset directory e.g. "/home/foo/LRS3"
+PRETRAINEDMODEL=pretrainedvideomodel/Video_only_model.pt # Path to pretrained video model e.g. "pretrainedvideomodel/Video_only_model.pt"
+MUSAN_DIR="musan" # The noise dataset directory e.g. "musan"
+
+# feature configuration
+do_delta=false
+
+preprocess_config=conf/specaug.yaml
+train_config=conf/train.yaml
+lm_config=conf/lm.yaml
+decode_config=conf/decode.yaml
+
+# rnnlm related
+lm_resume= # specify a snapshot file to resume LM training
+lmtag= # tag for managing LMs
+
+# bpemode (unigram or bpe)
+nbpe=500
+bpemode=unigram
+
+# exp tag
+tag="" # tag for managing experiments.
+
+. utils/parse_options.sh || exit 1;
+
+## Function for pretrained Librispeech language model:
+function gdrive_download () {
+ CONFIRM=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate \
+ "https://docs.google.com/uc?export=download&id=$1" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')
+ wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$1" -O $2
+ rm -rf /tmp/cookies.txt
+}
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+# define sets
+if [ "$ifpretrain" = true ] ; then
+ train_set="pretrain_Train"
+else
+ train_set="Train"
+fi
+train_dev="Val"
+recog_set="Val Test"
+
+
+
+
+# Stage -1: download local folder
+if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
+ # download required files for data processing
+ local/download.sh
+fi
+
+# Stage 0: install software
+OPENFACE_DIR=local/installations/OpenFace/build/bin # Path to OpenFace build directory
+VIDAUG_DIR=local/installations/vidaug # Path to vidaug directory
+DEEPXI_DIR=local/installations/DeepXi # DeepXi directory
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+ # Install required softwares
+ local/installpackage.sh $OPENFACE_DIR $VIDAUG_DIR $DEEPXI_DIR
+fi
+
+# Stage 1: Data preparation
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+ ### Task dependent. You have to make data the following preparation part by yourself.
+ ### But you can utilize Kaldi recipes in most cases
+ echo "stage 1: Data preparation"
+
+ echo "Download pretrained video feature extractor and check directory configuration"
+ if [ -f "$PRETRAINEDMODEL" ] ; then
+ echo "pretrained video feature extractor already exists"
+ else
+ gdrive_download '1ITgdZoa8vQ7lDwi1jLziYGXOyUtgE2ow' 'model.v1.tar.gz' || exit 1;
+ tar -xf model.v1.tar.gz || exit 1;
+ mv model.v1/avsrlrs2_3/pretrainedvideomodel ./
+ rm -rf model.v1
+ rm -rf model.v1.tar.gz
+ fi
+
+ if [ -d "$DATA_DIR" ] ; then
+ echo "Dataset already exists."
+ else
+ echo "For downloading the data, please visit 'https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html'."
+ echo "You will need to sign a Data Sharing agreement with BBC Research & Development before getting access."
+ echo "Please download the dataset by yourself and save the dataset directory in path.sh file"
+ echo "Thanks!"
+ fi
+
+ if [ "$iflrs3pretrain" = true ] ; then
+ if [ -d "$DATALRS3_DIR" ]; then
+ echo "Dataset already exists."
+ else
+ echo "For downloading the data, please visit 'https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html'."
+ echo "You will need to sign a Data Sharing agreement with BBC Research & Development before getting access."
+ echo "Please download the dataset by yourself and save the dataset directory in path.sh file"
+ echo "Thanks!"
+ fi
+ fi
+
+ # Create Musan directory
+ if [ -d "${MUSAN_DIR}" ]; then
+ echo "MUSAN dataset is in ${MUSAN_DIR}..."
+ else
+ echo "Download MUSAN dataset"
+ wget --no-check-certificate http://www.openslr.org/resources/17/musan.tar.gz
+ echo "Download finished"
+ echo "Unzip MUSAN dataset"
+ tar -xf musan.tar.gz
+ rm -rf musan.tar.gz
+ echo "Unzipping finished"
+ fi
+ # Create RIRS_NOISES Dataset
+ if [ -d "RIRS_NOISES" ]; then
+ echo "RIRS_NOISES dataset is in RIRS_NOISES..."
+ else
+ # Download the package that includes the real RIRs, simulated RIRs, isotropic noises and point-source noises
+ echo "Download RIRS_NOISES dataset"
+ wget --no-check-certificate http://www.openslr.org/resources/28/rirs_noises.zip
+ echo "Download finished"
+ echo "Unzip RIRS_NOISES dataset"
+ unzip rirs_noises.zip
+ rm -rf rirs_noises.zip
+ echo "Unzipping finished"
+ fi
+
+ for part in Test Val Train; do
+ # use underscore-separated names in data directories. #Problem: Filelist_Val is readonly
+ local/data_prepare/lrs2_audio_data_prep.sh ${DATA_DIR} $part $ifsegment $ifmulticore $ifdebug $num $nj || exit 1;
+ done
+ if [ "$ifpretrain" = true ] ; then
+ part=pretrain
+ local/data_prepare/lrs2_audio_data_prep.sh ${DATA_DIR} $part $ifsegment $ifmulticore $ifdebug $num $nj || exit 1;
+ fi
+
+ if [ "$iflrs3pretrain" = true ] ; then
+
+ ## embedding LRS3 code
+ python3 -m venv --system-site-packages ./LRS3-env
+ source ./LRS3-env/bin/activate
+ pip3 install pydub
+ local/data_prepare/lrs3_audio_data_prep.sh $DATALRS3_DIR pretrain $ifmulticore $ifsegment $ifdebug $num
+ deactivate
+ rm -rf ./LRS3-env
+ mkdir -p data/audio/clean/LRS3/pretrain
+ mv Dataset_processing/LRS3/kaldi/pretrainsegment/* data/audio/clean/LRS3/pretrain
+ cp Dataset_processing/LRS3/audio/pretrain/Filelist_pretrain Dataset_processing/LRS3/audio/pretrain/Filelist_LRS3pretrain
+ mv Dataset_processing/LRS3/audio/pretrain/Filelist_LRS3pretrain data/METADATA
+ fi
+ echo "stage 1: Data preparation finished"
+
+fi
+
+# Stage 2: Audio augmentation
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+ ### Task dependent. You have to make data the following preparation part by yourself.
+ ### But you can utilize Kaldi recipes in most cases
+ echo "stage 2: Audio augmentation"
+ for part in Test Val Train; do
+ # use underscore-separated names in data directories.
+ local/extract_reliability/audio_augmentation.sh $MUSAN_DIR $part LRS2 || exit 1;
+ done
+
+ if [ "$ifpretrain" = true ] ; then
+ part=pretrain
+ local/extract_reliability/audio_augmentation.sh $MUSAN_DIR $part LRS2 || exit 1;
+ fi
+ if [ "$iflrs3pretrain" = true ] ; then
+ part=pretrain
+ local/extract_reliability/audio_augmentation.sh $MUSAN_DIR $part LRS3 || exit 1;
+ fi
+ # The Test set is augmented with ambient and music noise SNR from -12 to 12
+ local/extract_reliability/audio_augmentation_recog.sh $MUSAN_DIR Test LRS2 || exit 1;
+ echo "Datasets Combination"
+ if [[ "$ifpretrain" = true || "$iflrs3pretrain" = true ]] ; then ## combine pretrain and train set
+ if [[ "$ifpretrain" = true && "$iflrs3pretrain" = false ]] ; then
+ utils/combine_data.sh data/audio/augment/pretrain_Train_aug \
+ data/audio/augment/LRS2_Train_aug \
+ data/audio/augment/LRS2_pretrain_aug || exit 1;
+ utils/combine_data.sh data/audio/augment/pretrain_aug \
+ data/audio/augment/LRS2_pretrain_aug || exit 1;
+ elif [[ "$ifpretrain" = false && "$iflrs3pretrain" = true ]] ; then
+ utils/combine_data.sh data/audio/augment/pretrain_Train_aug \
+ data/audio/augment/LRS2_Train_aug \
+ data/audio/augment/LRS3_pretrain_aug || exit 1;
+ utils/combine_data.sh data/audio/augment/pretrain_aug \
+ data/audio/augment/LRS3_pretrain_aug || exit 1;
+ elif [[ "$ifpretrain" = true && "$iflrs3pretrain" = true ]] ; then
+ utils/combine_data.sh data/audio/augment/pretrain_Train_aug \
+ data/audio/augment/LRS2_Train_aug \
+ data/audio/augment/LRS2_pretrain_aug \
+ data/audio/augment/LRS3_pretrain_aug || exit 1;
+ utils/combine_data.sh data/audio/augment/pretrain_aug \
+ data/audio/augment/LRS2_pretrain_aug \
+ data/audio/augment/LRS3_pretrain_aug || exit 1;
+ fi
+ fi
+ mv data/audio/augment/LRS2_Test_aug data/audio/augment/Test_aug
+ mv data/audio/augment/LRS2_Val_aug data/audio/augment/Val_aug
+ mv data/audio/augment/LRS2_Train_aug data/audio/augment/Train_aug
+
+ echo "stage 2: Audio augmentation finished"
+
+fi
+
+mp3files=Dataset_processing/Audioaugments
+feat_tr_dir=${dumpdir}/audio_org/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}
+# Stage 3: Feature Generation for audio features
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+ echo "stage 3: Feature Generation"
+ echo "stage 3.1: Make augmented mp3 files"
+ mkdir -p $mp3files
+ if [ "$ifpretrain" = false ] && [ "$iflrs3pretrain" = false ] ; then
+ for part in Test Val Train; do
+ echo "Run audioaugwav frames for ${part} set!"
+ mkdir -p ${mp3files}/$part
+ local/extract_reliability/audioaugwav.sh data/audio/augment/${part}_aug $mp3files/$part || exit 1;
+ done
+
+ else
+ for part in Test Val Train pretrain; do #Train pretrain
+ echo "Run audioaugwav frames for ${part} set!"
+ mkdir -p $mp3files/$part
+ local/extract_reliability/audioaugwav.sh data/audio/augment/${part}_aug $mp3files/$part || exit 1;
+ done
+
+ part=pretrain
+ python3 local/extract_reliability/segaugaudio.py $mp3files data/audio/augment $part $ifmulticore
+ rm -r ${mp3files:?}/${part:?}
+ mv ${mp3files}/${part}_aug $mp3files/${part}
+ fi
+ nameambient=noise
+ namemusic=music
+ name_list="${nameambient} ${namemusic}"
+ for name in ${name_list};do
+ dset=Test
+ mkdir -p ${mp3files}/${dset}_${name} || exit 1;
+ local/extract_reliability/audioaugwav.sh data/audio/augment/LRS2_decode/${dset}_aug_${name} $mp3files/${dset}_${name} || exit 1;
+ done
+ echo "stage 3.1: Make augmented mp3 files finished"
+
+ ### Task dependent. You have to design training and dev sets by yourself.
+ ### But you can utilize Kaldi recipes in most cases
+ echo "stage 3.2: Feature Generation"
+
+ fbankdir=fbank
+ mfccdir=mfccs
+ if [[ "$ifpretrain" = true || "$iflrs3pretrain" = true ]] ; then ## combine pretrain and train set
+ # Generate the fbank and mfcc features; by default 80-dimensional fbanks with pitch on each frame
+
+ mv data/audio/augment/pretrain_aug/segments data/audio/augment/pretrain_aug/segments_old
+ mv data/audio/augment/pretrain_Train_aug/segments data/audio/augment/pretrain_Train_aug/segments_old
+ for x in pretrain Train Test Val; do #pretrain_Train pretrain Train
+ mv data/audio/augment/${x}_aug/wav.scp data/audio/augment/${x}_aug/wavnew.scp
+ python3 local/extract_reliability/remakewav.py data/audio/augment/${x}_aug/wavnew.scp data/audio/augment/${x}_aug/wav.scp Dataset_processing/Audioaugments/$x
+ cp -R data/audio/augment/${x}_aug data/audio/augment/${x}mfccs_aug
+ mv data/audio/augment/${x}_aug data/audio/augment/${x}fbank_aug
+ steps/make_mfcc.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --write_utt2num_frames true \
+ data/audio/augment/${x}mfccs_aug \
+ exp/make_mfcc/${x} \
+ ${mfccdir} || exit 1;
+ utils/fix_data_dir.sh data/audio/augment/${x}mfccs_aug || exit 1;
+ steps/make_fbank_pitch.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --write_utt2num_frames true \
+ data/audio/augment/${x}fbank_aug \
+ exp/make_fbank/${x} \
+ ${fbankdir} || exit 1;
+ utils/fix_data_dir.sh data/audio/augment/${x}fbank_aug || exit 1;
+ done
+
+ utils/combine_data.sh data/audio/augment/pretrain_Trainfbank_aug \
+ data/audio/augment/pretrainfbank_aug \
+ data/audio/augment/Trainfbank_aug || exit 1;
+ utils/combine_data.sh data/audio/augment/pretrain_Trainmfccs_aug \
+ data/audio/augment/pretrainmfccs_aug \
+ data/audio/augment/Trainmfccs_aug || exit 1;
+ else
+ # Generate the fbank and mfcc features; by default 80-dimensional fbanks with pitch on each frame
+ for x in Train Val Test; do #
+ cp -R data/audio/augment/${x}_aug data/audio/augment/${x}mfccs_aug
+ mv data/audio/augment/${x}_aug data/audio/augment/${x}fbank_aug
+ steps/make_mfcc.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --write_utt2num_frames true \
+ data/audio/augment/${x}mfccs_aug \
+ exp/make_mfcc/${x} \
+ ${mfccdir} || exit 1;
+ utils/fix_data_dir.sh data/audio/augment/${x}mfccs_aug || exit 1;
+ steps/make_fbank_pitch.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --write_utt2num_frames true \
+ data/audio/augment/${x}fbank_aug \
+ exp/make_fbank/${x} \
+ ${fbankdir} || exit 1;
+ utils/fix_data_dir.sh data/audio/augment/${x}fbank_aug || exit 1;
+ done
+ fi
+
+ ## make fband and mfcc features for test decode dataset
+ x=Test
+ nameambient=noise
+ namemusic=music
+ name_list="${nameambient} ${namemusic}"
+ for name in ${name_list};do
+ rm -rf data/audio/augment/LRS2_decode/${x}mfccs_aug_${name}
+ rm -rf data/audio/augment/LRS2_decode/${x}fbank_aug_${name}
+ cp -R data/audio/augment/LRS2_decode/${x}_aug_${name} data/audio/augment/LRS2_decode/${x}mfccs_aug_${name} || exit 1;
+ mv data/audio/augment/LRS2_decode/${x}_aug_${name} data/audio/augment/LRS2_decode/${x}fbank_aug_${name} || exit 1;
+ steps/make_mfcc.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --write_utt2num_frames true \
+ data/audio/augment/LRS2_decode/${x}mfccs_aug_${name} \
+ exp/make_mfcc/${x}_${name} ${mfccdir} || exit 1;
+ utils/fix_data_dir.sh data/audio/augment/LRS2_decode/${x}mfccs_aug_${name} || exit 1;
+ steps/make_fbank_pitch.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --write_utt2num_frames true \
+ data/audio/augment/LRS2_decode/${x}fbank_aug_${name} \
+ exp/make_fbank/${x}_${name} ${fbankdir} || exit 1;
+ utils/fix_data_dir.sh data/audio/augment/LRS2_decode/${x}fbank_aug_${name} || exit 1;
+ done
+
+ # compute global CMVN
+ compute-cmvn-stats scp:data/audio/augment/${train_set}fbank_aug/feats.scp data/audio/augment/${train_set}fbank_aug/cmvn.ark || exit 1;
+
+ # dump features
+ dump.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --do_delta ${do_delta} \
+ data/audio/augment/${train_set}fbank_aug/feats.scp \
+ data/audio/augment/${train_set}fbank_aug/cmvn.ark \
+ exp/dump_feats/${train_set}fbank_aug ${feat_tr_dir} || exit 1;
+
+ for rtask in ${recog_set} Train pretrain; do
+ feat_recog_dir=${dumpdir}/audio_org/${rtask}/delta${do_delta}; mkdir -p ${feat_recog_dir}
+ dump.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --do_delta ${do_delta} data/audio/augment/${rtask}fbank_aug/feats.scp \
+ data/audio/augment/${train_set}fbank_aug/cmvn.ark \
+ exp/dump_feats/recog/${rtask} \
+ ${feat_recog_dir} || exit 1;
+ done
+
+ # make dump file for Test decode File
+ nameambient=noise
+ namemusic=music
+ name_list="${nameambient} ${namemusic}"
+ for name in ${name_list};do
+ feat_recog_dir=${dumpdir}/audio_org/Test_decode_${name}/delta${do_delta}; mkdir -p ${feat_recog_dir}
+ dump.sh \
+ --cmd "$train_cmd" \
+ --nj $nj \
+ --do_delta ${do_delta} data/audio/augment/LRS2_decode/Testfbank_aug_${name}/feats.scp \
+ data/audio/augment/${train_set}fbank_aug/cmvn.ark \
+ exp/dump_feats/recog/Test_${name} \
+ ${feat_recog_dir} || exit 1;
+ done
+
+ echo "stage 3.2: Audio Feature Generation finished"
+ echo "stage 3: Feature Generation finished"
+fi
+
+
+dict=data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
+bpemodel=data/lang_char/${train_set}_${bpemode}${nbpe}
+echo "dictionary: ${dict}"
+# Stage 4: Dictionary and JSON Data Preparation
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+ ### Task dependent. You have to check non-linguistic symbols used in the corpus.
+ echo "stage 4: Dictionary and Json Data Preparation"
+ if [ "$train_lm" = true ] ; then
+ mkdir -p data/lang_char/
+ echo " 1" > ${dict} # must be 1, 0 will be used for "blank" in CTC
+ cut -f 2- -d" " data/${train_set}/text > data/lang_char/input.txt
+ spm_train --input=data/lang_char/input.txt \
+ --vocab_size=${nbpe} \
+ --model_type=${bpemode} \
+ --model_prefix=${bpemodel} \
+ --input_sentence_size=100000000 || exit 1;
+ spm_encode --model=${bpemodel}.model \
+ --output_format=piece < data/lang_char/input.txt | tr ' ' '\n' | sort | uniq | awk '{print $0 " " NR+1}' >> ${dict} || exit 1;
+ wc -l ${dict}
+ else
+ # if using external librispeech lm
+ gdrive_download '1ITgdZoa8vQ7lDwi1jLziYGXOyUtgE2ow' 'model.v1.tar.gz' || exit 1;
+ tar -xf model.v1.tar.gz || exit 1;
+ mv model.v1/avsrlrs2_3/exp/train_rnnlm_pytorch_lm_unigram500 exp/train_rnnlm_pytorch_lm_unigram500
+ mv model.v1/avsrlrs2_3/data/lang_char data/
+ mv data/lang_char/train_unigram500.model data/lang_char/${train_set}_unigram500.model
+ mv data/lang_char/train_unigram500.vocab data/lang_char/${train_set}_unigram500.vocab
+ mv data/lang_char/train_unigram500_units.txt data/lang_char/${train_set}_unigram500_units.txt
+ rm -rf model.v1
+ rm -rf model.v1.tar.gz
+
+ ##### it is depands on your corpus, if the corpus text transcription is uppercase, use this to convert to lowercase
+
+ textfilenames="data/audio/augment/*/text"
+ textdecodefilenames="data/audio/augment/LRS2_decode/*/text"
+ textcleanfilenames="data/audio/clean/*/*/text"
+ for textname in $textfilenames $textdecodefilenames $textcleanfilenames; do
+ for textfilename in $textname
+ do
+ sed -r 's/([^ \t]+\s)(.*)/\1\L\2/' $textfilename > ${textfilename}1 || exit 1;
+ rm -rf $textfilename || exit 1;
+ mv ${textfilename}1 $textfilename || exit 1;
+ done
+ done
+ fi
+
+ # make json labels
+ data2json.sh --feat ${feat_tr_dir}/feats.scp --bpecode ${bpemodel}.model \
+ data/audio/augment/${train_set}fbank_aug ${dict} > ${feat_tr_dir}/data_${bpemode}${nbpe}.json || exit 1;
+ for rtask in ${recog_set} Train pretrain; do
+ sed -r 's/([^ \t]+\s)(.*)/\1\L\2/' data/audio/augment/${rtask}fbank_aug/text > data/audio/augment/${rtask}fbank_aug/text1 || exit 1;
+ rm -rf data/audio/augment/${rtask}fbank_aug/text || exit 1;
+ mv data/audio/augment/${rtask}fbank_aug/text1 data/audio/augment/${rtask}fbank_aug/text || exit 1;
+
+ feat_recog_dir=${dumpdir}/audio_org/${rtask}/delta${do_delta}
+ data2json.sh --feat ${feat_recog_dir}/feats.scp --bpecode ${bpemodel}.model \
+ data/audio/augment/${rtask}fbank_aug ${dict} > ${feat_recog_dir}/data_${bpemode}${nbpe}.json || exit 1;
+ done
+
+ ###make dump file for Test decode File
+ nameambient=noise
+ namemusic=music
+ name_list="${nameambient} ${namemusic}"
+ for name in ${name_list};do
+ feat_recog_dir=${dumpdir}/audio_org/Test_decode_${name}/delta${do_delta}
+ data2json.sh --feat ${feat_recog_dir}/feats.scp --bpecode ${bpemodel}.model \
+ data/audio/augment/LRS2_decode/Testfbank_aug_${name} ${dict} > ${feat_recog_dir}/data_${bpemode}${nbpe}.json || exit 1;
+ done
+
+ echo "stage 4: Dictionary and Json Data Preparation finished"
+fi
+
+
+# Define new paths
+facerecog=Dataset_processing/Facerecog
+videoframe=Dataset_processing/Videodata
+videoaug=Dataset_processing/Videoaug
+videofeature=Dataset_processing/Videofeature
+SNRdir=Dataset_processing/SNRsmat
+SNRptdir=Dataset_processing/SNRs
+
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+ echo "stage 5: Extract reliability measures"
+ if [ ${dataprocessingstage} -le 0 ] && [ ${stop_dataprocessingstage} -ge 0 ]; then
+ #make mfcc dump file
+ mkdir -p ${dumpdir}/mfcc/${train_set}/delta${do_delta}/ || exit 1;
+ cp data/audio/augment/${train_set}mfccs_aug/feats.scp ${dumpdir}/mfcc/${train_set}/delta${do_delta}/ || exit 1;
+ data2json.sh --feat ${dumpdir}/mfcc/${train_set}/delta${do_delta}/feats.scp \
+ --bpecode ${bpemodel}.model data/audio/augment/${train_set}mfccs_aug ${dict} \
+ > ${dumpdir}/mfcc/${train_set}/delta${do_delta}/data_${bpemode}${nbpe}.json || exit 1;
+ for rtask in ${recog_set} Train pretrain; do
+ feat_recog_dir=${dumpdir}/mfcc/${rtask}/delta${do_delta}
+ mkdir -p $feat_recog_dir || exit 1;
+ cp data/audio/augment/${rtask}mfccs_aug/feats.scp ${dumpdir}/mfcc/${rtask}/delta${do_delta}/ || exit 1;
+ data2json.sh --feat ${feat_recog_dir}/feats.scp \
+ --bpecode ${bpemodel}.model data/audio/augment/${rtask}mfccs_aug ${dict} \
+ > ${dumpdir}/mfcc/${rtask}/delta${do_delta}/data_${bpemode}${nbpe}.json || exit 1;
+ done
+
+ nameambient=noise
+ namemusic=music
+ name_list="${nameambient} ${namemusic}"
+ for name in ${name_list};do
+ dset=Test
+ feat_recog_dir=${dumpdir}/mfcc/Test_decode_${name}/delta${do_delta}
+ mkdir -p $feat_recog_dir || exit 1;
+ cp data/audio/augment/LRS2_decode/Testmfccs_aug_${name}/feats.scp ${dumpdir}/mfcc/Test_decode_${name}/delta${do_delta}/ || exit 1;
+ data2json.sh --feat ${feat_recog_dir}/feats.scp \
+ --bpecode ${bpemodel}.model data/audio/augment/LRS2_decode/Testmfccs_aug_${name} ${dict} \
+ > ${dumpdir}/mfcc/Test_decode_${name}/delta${do_delta}/data_${bpemode}${nbpe}.json || exit 1;
+
+ done
+ fi
+
+ if [ ${dataprocessingstage} -le 1 ] && [ ${stop_dataprocessingstage} -ge 1 ]; then
+ #Stage 5.1: Video augmentation with Gaussian blur and salt&pepper noise
+ if [ -d vidaug ]; then
+ echo "vidaug already exist..."
+ else
+ ln -s $VIDAUG_DIR vidaug
+ ln -rsf local/extract_reliability/videoaug.py vidaug/videoaug.py
+ fi
+ python3 vidaug/videoaug.py data/METADATA/Filelist_Test $DATA_DIR $videoaug blur # video augmentation with Gaussian blur
+ python3 vidaug/videoaug.py data/METADATA/Filelist_Test $DATA_DIR $videoaug saltandpepper # video augmentation with salt and pepper noise
+ unlink ./vidaug
+ fi
+
+ if [ ${dataprocessingstage} -le 2 ] && [ ${stop_dataprocessingstage} -ge 2 ]; then
+ #Stage 5.2: Video stream processing, using OpenFace for face recognition
+ echo "stage 5.2: OpenFace face recognition"
+ mkdir -p $facerecog
+ for part in Test Val Train; do #
+ echo "Starting OpenFace background processes for ${part} set!"
+ mkdir -p $facerecog/LRS2${part}
+ local/extract_reliability/Openface.sh $DATA_DIR $facerecog/LRS2${part} $part $OPENFACE_DIR \
+ LRS2 $nj $ifdebug || exit 1;
+ done
+ if [ "$ifpretrain" = true ] ; then
+ part=pretrain
+ echo "Starting OpenFace background processes for ${part} set!"
+ mkdir -p $facerecog/LRS2${part}
+ local/extract_reliability/Openface.sh $DATA_DIR $facerecog/LRS2${part} $part $OPENFACE_DIR \
+ LRS2 $nj $ifdebug || exit 1;
+ fi
+ if [ "$iflrs3pretrain" = true ] ; then
+ part=pretrain
+ echo "Starting OpenFace background processes for LRS3 ${part} set!"
+ mkdir -p $facerecog/LRS3${part}
+ local/extract_reliability/Openface.sh $DATALRS3_DIR $facerecog/LRS3${part} $part $OPENFACE_DIR \
+ LRS3 $nj $ifdebug || exit 1;
+ fi
+ part=Test
+ for noisetype in blur saltandpepper; do
+ echo "Starting OpenFace background processes for ${part} set!"
+ mkdir -p $facerecog/LRS2${part}_$noisetype
+ local/extract_reliability/Openface_vidaug.sh $videoaug $facerecog/LRS2${part}_$noisetype \
+ $part $OPENFACE_DIR LRS2 $noisetype $nj $ifdebug || exit 1;
+ done
+
+ echo "All OpenFace background processes for all sets are done!"
+ fi
+
+ if [ ${dataprocessingstage} -le 3 ] && [ ${stop_dataprocessingstage} -ge 3 ]; then
+ # Stage 5.3: Extract Video frames from the MP4 File by using OpenFace results
+ echo "stage 5.3: Extract Frames"
+ mkdir -p $videoframe
+
+ if [ "$ifpretrain" = true ] ; then
+ part=pretrain
+ echo "Extracting frames for ${part} set!"
+ mkdir -p $videoframe/LRS2${part}
+ local/extract_reliability/extractframs.sh $DATA_DIR \
+ $videoframe \
+ $facerecog \
+ data/audio/clean/LRS2 \
+ $part \
+ LRS2 \
+ $ifsegment \
+ $ifmulticore || exit 1;
+ fi
+
+ if [ "$iflrs3pretrain" = true ] ; then
+ part=pretrain
+ echo "Extracting frames for ${part} set!"
+ mkdir -p $videoframe/LRS3${part}
+ local/extract_reliability/extractframs.sh $DATALRS3_DIR \
+ $videoframe \
+ $facerecog \
+ data/audio/clean/LRS3 \
+ $part \
+ LRS3 \
+ $ifsegment \
+ $ifmulticore || exit 1;
+ fi
+
+ for part in Test Val Train; do # Test
+ echo "Extracting frames for ${part} set!"
+ mkdir -p $videoframe/LRS2${part}
+ local/extract_reliability/extractframs.sh $DATA_DIR \
+ $videoframe \
+ $facerecog \
+ data/audio/clean/LRS2 \
+ $part \
+ LRS2 \
+ $ifsegment \
+ $ifmulticore || exit 1;
+ done
+ part=Test
+ for noisetype in blur saltandpepper; do
+ echo "Extracting frames for augumented ${part} set!"
+ mkdir -p $videoframe/LRS2${part}_$noisetype
+ local/extract_reliability/extractframs.sh $videoaug \
+ $videoframe \
+ $facerecog \
+ data/audio/clean/LRS2 \
+ $part \
+ LRS2 \
+ $ifsegment \
+ $ifmulticore \
+ $noisetype || exit 1;
+ done
+ echo "Extract Frames finished"
+ fi
+
+ if [ ${dataprocessingstage} -le 4 ] && [ ${stop_dataprocessingstage} -ge 4 ]; then
+ # Stage 5.4: Use DeepXi to estimate SNR
+ echo "stage 5.4: Estimate SNRs using DeepXi framework"
+ if [ -d DeepXi ]; then
+ echo "Deepxi already exist..."
+ else
+ ln -s $DEEPXI_DIR DeepXi
+ fi
+ rm -rf DeepXi/set/test_noisy_speech
+ rm -rf DeepXi/deepxi/se_batch.py
+ cp local/se_batch.py DeepXi/deepxi
+ if [ "$ifpretrain" = false ] && [ "$iflrs3pretrain" = false ] ; then
+ for part in Test Val Train; do
+ echo "Extract SNR for ${part} set!"
+ mkdir -p $SNRdir/$part
+ mkdir -p $SNRptdir/$part
+ local/extract_reliability/extractsnr.sh $SNRdir $SNRptdir $mp3files $part $ifmulticore || exit 1;
+ done
+ else
+ for part in Train pretrain Test Val; do
+ echo "Extract SNR for ${part} set!"
+ mkdir -p $SNRdir/$part
+ mkdir -p $SNRptdir/$part
+ local/extract_reliability/extractsnr.sh $SNRdir $SNRptdir $mp3files $part $ifmulticore || exit 1;
+ done
+ fi
+ nameambient=noise
+ namemusic=music
+ name_list="${nameambient} ${namemusic}"
+ for name in ${name_list};do
+ dset=Test
+ mkdir -p $SNRdir/${dset}_${name}
+ mkdir -p $SNRptdir/${dset}_${name} || exit 1;
+ local/extract_reliability/extractsnr.sh $SNRdir $SNRptdir $mp3files ${dset}_${name} $ifmulticore || exit 1;
+ done
+
+ # Clean Up DeepXi: unlink and rm DeepXi
+ unlink ./DeepXi
+ rm -rf $SNRdir
+ fi
+
+ if [ ${dataprocessingstage} -le 5 ] && [ ${stop_dataprocessingstage} -ge 5 ]; then
+ # Extract video features from video frames, if it is necessary
+ echo "stage 5.5: Extract video features"
+ mkdir -p $videofeature
+ for part in Test Val; do
+ echo "Extract video features for ${part} set!"
+ mkdir -p $videofeature/LRS2${part}
+ local/extract_reliability/extractfeatures.sh $videoframe/LRS2${part}/Pics \
+ $videofeature/LRS2${part} \
+ $PRETRAINEDMODEL \
+ $part \
+ $ifcuda \
+ $ifdebug || exit 1;
+ done
+
+ if [ "$ifpretrain" = true ] ; then
+ part=pretrain
+ echo "Extract video features for ${part} set!"
+ mkdir -p $videofeature/LRS2${part}
+ local/extract_reliability/extractfeatures.sh $videoframe/LRS2${part}/Pics \
+ $videofeature/LRS2${part} \
+ $PRETRAINEDMODEL \
+ $part \
+ $ifcuda \
+ $ifdebug || exit 1;
+ fi
+
+ if [ "$iflrs3pretrain" = true ] ; then
+ part=pretrain
+ echo "Extract video features for ${part} set!"
+ mkdir -p $videofeature/LRS3${part}
+ local/extract_reliability/extractfeatures.sh $videoframe/LRS3${part}/Pics \
+ $videofeature/LRS3${part} \
+ $PRETRAINEDMODEL \
+ $part \
+ $ifcuda \
+ $ifdebug || exit 1;
+ fi
+ part=Test
+ for noisetype in blur saltandpepper; do
+ echo "Extract video features for augmented ${part} set!"
+ mkdir -p $videofeature/LRS2${part}_$noisetype
+ local/extract_reliability/extractfeatures.sh $videoframe/LRS2${part}_$noisetype/Pics \
+ $videofeature/LRS2${part}_$noisetype \
+ $PRETRAINEDMODEL \
+ $part \
+ $ifcuda \
+ $ifdebug || exit 1;
+ done
+ fi
+
+ if [ ${dataprocessingstage} -le 6 ] && [ ${stop_dataprocessingstage} -ge 6 ]; then
+ # Make video ark files
+ echo "stage 5.6: Make video ark files"
+
+ rm -rf data/video
+ python3 local/extract_reliability/tensor2ark.py $videofeature data/video $nj
+ for part in Test Val; do
+ echo "Make video dump files for LRS2 ${part} set!"
+ cat data/video/LRS2${part}/feats_*.scp > data/video/LRS2${part}/feats.scp || exit 1;
+ sort data/video/LRS2${part}/feats.scp -o data/video/LRS2${part}/feats.scp
+ mkdir -p ${dumpdir}/video/${part} || exit 1;
+ for files in text wav.scp utt2spk; do
+ cp data/audio/clean/LRS2/${part}/${files} data/video/LRS2${part} || exit 1;
+ done
+ utils/fix_data_dir.sh data/video/LRS2${part} || exit 1;
+ cp data/video/LRS2${part}/feats.scp ${dumpdir}/video/${part} || exit 1;
+ data2json.sh --feat ${dumpdir}/video/${part}/feats.scp --bpecode ${bpemodel}.model \
+ data/video/LRS2${part} ${dict} > ${dumpdir}/video/${part}/data_${bpemode}${nbpe}.json || exit 1;
+ done
+
+ if [[ "$ifpretrain" = true || "$iflrs3pretrain" = true ]] ; then
+ part=pretrain
+ if [[ "$ifpretrain" = true && "$iflrs3pretrain" = false ]] || [[ "$ifpretrain" = false && "$iflrs3pretrain" = true ]]; then
+ if [[ "$ifpretrain" = true && "$iflrs3pretrain" = false ]] ; then
+ dataset=LRS2
+ elif [[ "$ifpretrain" = false && "$iflrs3pretrain" = true ]] ; then
+ dataset=LRS3
+ fi
+ echo "Make video dump files for ${dataset} ${part} set!"
+ mkdir -p data/video/${part}
+ cat data/video/${part}/feats_*.scp > data/video/${part}/feats.scp || exit 1;
+ sort data/video/${part}/feats.scp -o data/video/${part}/feats.scp
+ mkdir -p ${dumpdir}/video/${part} || exit 1;
+ for files in text wav.scp utt2spk; do
+ cp data/audio/clean/${dataset}/${part}/${files} data/video/${part} || exit 1;
+ done
+ utils/fix_data_dir.sh data/video/${part} || exit 1;
+ cp data/video/${part}/${part}/feats.scp ${dumpdir}/video/${part} || exit 1;
+ elif [[ "$ifpretrain" = true && "$iflrs3pretrain" = true ]] ; then
+ echo "Make video dump files for LRS2 and LRS3 ${part} set!"
+ cat data/video/LRS2${part}/feats_*.scp > data/video/LRS2${part}/feats.scp || exit 1;
+ cat data/video/LRS3${part}/feats_*.scp > data/video/LRS3${part}/feats.scp || exit 1;
+ mkdir -p data/video/${part}
+ mkdir -p ${dumpdir}/video/${part} || exit 1;
+ for files in text wav.scp utt2spk; do
+ cat data/audio/clean/LRS2/${part}/${files} data/audio/clean/LRS3/${part}/${files} > data/video/${part}/${files} || exit 1;
+ sort data/video/${part}/${files} -o data/video/${part}/${files}
+ done
+ utils/fix_data_dir.sh data/video/${part} || exit 1;
+ cat data/video/LRS2${part}/feats.scp data/video/LRS3${part}/feats.scp > ${dumpdir}/video/${part}/feats.scp || exit 1;
+ sort ${dumpdir}/video/${part}/feats.scp -o ${dumpdir}/video/${part}/feats.scp
+ fi
+
+ data2json.sh --feat ${dumpdir}/video/${part}/feats.scp --bpecode ${bpemodel}.model \
+ data/video/${part} ${dict} > ${dumpdir}/video/${part}/data_${bpemode}${nbpe}.json || exit 1;
+
+ fi
+
+ part=Test
+ for noisetype in blur saltandpepper; do
+ echo "Make video dump files for augmented ${part} set!"
+ cat data/video/LRS2${part}_${noisetype}/feats_*.scp > data/video/LRS2${part}_${noisetype}/feats.scp || exit 1;
+ sort data/video/LRS2${part}_${noisetype}/feats.scp -o data/video/LRS2${part}_${noisetype}/feats.scp
+ mkdir -p ${dumpdir}/video/${part}_decode_${noisetype} || exit 1;
+ for files in text wav.scp utt2spk; do
+ cp data/audio/clean/LRS2/${part}/${files} data/video/LRS2${part}_${noisetype} || exit 1;
+ done
+ utils/fix_data_dir.sh data/video/LRS2${part}_${noisetype} || exit 1;
+ cp data/video/LRS2${part}_${noisetype}/feats.scp ${dumpdir}/video/${part}_decode_${noisetype} || exit 1;
+ data2json.sh --feat ${dumpdir}/video/${part}_decode_${noisetype}/feats.scp --bpecode ${bpemodel}.model \
+ data/video/LRS2${part}_${noisetype} ${dict} > ${dumpdir}/video/${part}_decode_${noisetype}/data_${bpemode}${nbpe}.json \
+ || exit 1;
+ done
+ fi
+
+ if [ ${dataprocessingstage} -le 7 ] && [ ${stop_dataprocessingstage} -ge 7 ]; then
+ # Remake dump files
+ echo "stage 5.7: Remake audio and video dump files"
+
+ for dset in pretrain_Train Val Test Test_decode_music Test_decode_noise; do
+ rm -rf dump/audio/$dset
+ python3 local/dump/audiodump.py dump/audio dump/audio_org $dset $ifmulticore || exit 1;
+ done
+
+ for dset in pretrain Val Test; do
+ rm -rf dump/avpretrain/$dset
+ python3 local/dump/avpretraindump.py dump/avpretrain dump/audio_org dump/video \
+ $SNRptdir $videoframe dump/mfcc \
+ $dset $ifmulticore || exit 1;
+ done
+
+ for dset in Train Val Test; do
+ rm -rf dump/avtrain/$dset
+ python3 local/dump/avtraindump.py dump/avtrain dump/audio_org $videofeature \
+ $SNRptdir $videoframe dump/mfcc \
+ $dset $ifmulticore || exit 1;
+ done
+
+ # Creat video dump file
+ for dset in pretrain Val Test; do
+ rm -rf dump/videopretrain/$dset
+ python3 local/dump/videodump.py dump/avpretrain dump/videopretrain $dset || exit 1;
+ done
+
+ for dset in Train Val Test; do
+ rm -rf dump/videotrain/$dset
+ python3 local/dump/videodump.py dump/avtrain dump/videotrain $dset || exit 1;
+ done
+
+ dset=Test
+ rm -rf dump/avpretraindecode
+ rm -rf dump/avtraindecode
+ for noisecombination in 'noise_None' 'music_None' 'noise_blur' 'noise_saltandpepper'; do
+ python3 local/dump/avpretraindecodedump.py dump/avpretraindecode dump/audio_org dump/video \
+ $SNRptdir $videoframe dump/mfcc \
+ $dset $noisecombination $ifmulticore || exit 1;
+ python3 local/dump/avtraindecodedump.py dump/avtraindecode dump/audio_org dump/video \
+ $videofeature $SNRptdir $videoframe dump/mfcc \
+ $dset $noisecombination $ifmulticore || exit 1;
+ done
+
+ fi
+
+ if [ ${dataprocessingstage} -le 8 ] && [ ${stop_dataprocessingstage} -ge 8 ]; then
+ echo "stage 5.8: Split Test decode dump files"
+ for audionoise in noise music; do
+ python3 local/extract_reliability/extractsnr.py data/audio/augment/LRS2_decode $audionoise $ifmulticore || exit 1;
+ done
+ for noisecombination in 'noise_None' 'music_None' 'noise_blur' 'noise_saltandpepper'; do
+ python3 local/extract_reliability/splitsnr.py dump/avpretraindecode $noisecombination data/audio/augment/LRS2_decode || exit 1;
+ python3 local/extract_reliability/splitsnr.py dump/avtraindecode $noisecombination data/audio/augment/LRS2_decode || exit 1;
+ done
+ fi
+
+ echo "stage 5: Reliability measures generation finished"
+fi
+
+# It takes a few days. If you just want to end-to-end ASR without LM,
+# you can skip this and remove --rnnlm option in the recognition (stage 6)
+# Otherwise, the pretrained Librispeech LM can be used (train_lm=false)
+
+if [ "$train_lm" = false ] ; then
+ lmexpname=train_rnnlm_pytorch_lm_unigram500
+ lmexpdir=exp/${lmexpname}
+else
+ if [ -z ${lmtag} ]; then
+ lmtag=$(basename ${lm_config%.*})
+ fi
+ lmexpname=train_rnnlm_${backend}_${lmtag}_${bpemode}${nbpe}_ngpu${ngpu}
+ lmexpdir=exp/${lmexpname}
+ mkdir -p ${lmexpdir}
+fi
+
+# Stage 6: Language Model (LM) preparation
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+ if [ "$train_lm" = false ] ; then
+ echo "stage 6: Use pretrained LM"
+ else
+ echo "stage 6: LM Preparation"
+ lmdatadir=data/local/lm_train_${bpemode}${nbpe}
+ # use external data
+ if [ ! -e data/local/lm_train/librispeech-lm-norm.txt.gz ]; then
+ echo "Download Librispeech normnalized language model (LM) training text"
+ wget http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz -P data/local/lm_train/
+ echo "Download finished"
+ fi
+
+ if [ ! -e ${lmdatadir} ]; then
+ echo "Prepare LM data"
+ mkdir -p ${lmdatadir}
+ # build gzip archive for language data out of the utterances in the LRS dataset
+ cut -f 2- -d" " data/${train_set}/text | gzip -c > data/local/lm_train/${train_set}_text.gz
+ # combine external text and transcriptions and shuffle them with seed 777
+ zcat data/local/lm_train/librispeech-lm-norm.txt.gz data/local/lm_train/${train_set}_text.gz |\
+ spm_encode \
+ --model=${bpemodel}.model \
+ --output_format=piece \
+ > ${lmdatadir}/train.txt
+ cut -f 2- -d" " data/audio/augment/${train_dev}fbank_aug/text | \
+ spm_encode \
+ --model=${bpemodel}.model \
+ --output_format=piece \
+ > ${lmdatadir}/valid.txt
+ echo "Preparation step done"
+ fi
+ echo "Start training Language Model"
+ ${cuda_cmd} --gpu ${ngpu} ${lmexpdir}/train.log \
+ lm_train.py \
+ --config ${lm_config} \
+ --ngpu ${ngpu} \
+ --backend ${backend} \
+ --verbose 1 \
+ --outdir ${lmexpdir} \
+ --tensorboard-dir tensorboard/${lmexpname} \
+ --train-label ${lmdatadir}/train.txt \
+ --valid-label ${lmdatadir}/valid.txt \
+ --resume ${lm_resume} \
+ --dict ${dict} \
+ --dump-hdf5-path ${lmdatadir}
+ echo "stage 6: LM Preparation finished"
+ fi
+fi
+
+if [ -z ${tag} ]; then
+ expname=${train_set}_${backend} #_$(basename ${train_config%.*})
+ if ${do_delta}; then
+ expname=${expname}_delta
+ fi
+ if [ -n "${preprocess_config}" ]; then
+ expname=${expname}_$(basename ${preprocess_config%.*})
+ fi
+else
+ expname=${train_set}_${backend}_${tag}
+fi
+
+# ToDo: Hand over parameters for subscripts
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+ echo "Stage 7: Network Training"
+ # train audio model
+ expdirapretrain=exp/pretrain/A
+ mkdir -p ${expdirapretrain}
+ echo ${expdirapretrain}
+ noisetype=noise # Which noise type data is used for decoding, possible noisetype: noise music blur and saltandpepper
+ local/training/train_audio.sh --backend $backend \
+ --ngpu $ngpu \
+ --debugmode $debugmode \
+ --N $N \
+ --verbose $verbose \
+ --nbpe $nbpe \
+ --bpemode $bpemode \
+ --nj $nj \
+ --do_delta $do_delta \
+ --train_set $train_set \
+ --train_dev $train_dev \
+ --preprocess_config $preprocess_config \
+ --train_config $train_config\
+ --lm_config $lm_config \
+ --decode_config $decode_config\
+ $expdirapretrain dump/audio dump/avpretraindecode $lmexpdir $noisetype $dict $bpemodel || exit 1;
+
+ # pretrain video model
+ expdirvpretrain=exp/pretrain/V
+ mkdir -p ${expdirvpretrain}
+ echo ${expdirvpretrain}
+ noisetype=blur # Which noise type data is used for decoding, possible noisetype: noise music blur and saltandpepper
+ local/training/pretrain_video.sh --backend $backend \
+ --ngpu $ngpu \
+ --debugmode $debugmode \
+ --N $N \
+ --verbose $verbose \
+ --nbpe $nbpe \
+ --bpemode $bpemode \
+ --nj $nj \
+ --do_delta $do_delta \
+ --preprocess_config $preprocess_config \
+ --train_config $train_config\
+ --lm_config $lm_config \
+ --decode_config $decode_config\
+ $expdirvpretrain dump/videopretrain dump/avpretraindecode $lmexpdir $noisetype $dict $bpemodel || exit 1;
+
+ # finetune video model
+ expdirvfine=exp/fine/V
+ mkdir -p ${expdirvfine}
+ echo ${expdirvfine}
+ noisetype=blur # Which noise type data is used for decoding, possible noisetype: noise music blur and saltandpepper
+ local/training/finetune_video.sh --backend $backend \
+ --ngpu $ngpu \
+ --debugmode $debugmode \
+ --N $N \
+ --verbose $verbose \
+ --nbpe $nbpe \
+ --bpemode $bpemode \
+ --nj $nj \
+ --do_delta $do_delta \
+ --preprocess_config $preprocess_config \
+ --train_config $train_config\
+ --lm_config $lm_config \
+ --decode_config $decode_config\
+ $expdirvfine $expdirvpretrain dump/videotrain dump/avtraindecode $PRETRAINEDMODEL $lmexpdir $noisetype $dict $bpemodel || exit 1;
+
+ # pretrain audio-visual model
+ expdiravpretrain=exp/pretrain/AV
+ mkdir -p ${expdiravpretrain}
+ echo ${expdiravpretrain}
+ noisetype=noise # Which noise type data is used for decoding, possible noisetype: noise music blur and saltandpepper
+ local/training/pretrain_av.sh --backend $backend \
+ --ngpu $ngpu \
+ --debugmode $debugmode \
+ --N $N \
+ --verbose $verbose \
+ --nbpe $nbpe \
+ --bpemode $bpemode \
+ --nj $nj \
+ --do_delta $do_delta \
+ --preprocess_config $preprocess_config \
+ --train_config $train_config\
+ --lm_config $lm_config \
+ --decode_config $decode_config\
+ $expdiravpretrain dump/avpretrain dump/avpretraindecode $lmexpdir \
+ $noisetype $dict $bpemodel $expdirapretrain $expdirvpretrain|| exit 1;
+
+ # finetune audio-visual model (final network used for decoding)
+ expdiravfine=exp/fine/AV
+ mkdir -p ${expdiravfine}
+ echo ${expdiravfine}
+ noisetype=noise # Which noise type data is used for decoding, possible noisetype: noise music blur and saltandpepper
+ local/training/finetune_av.sh --backend $backend \
+ --ngpu $ngpu \
+ --debugmode $debugmode \
+ --N $N \
+ --verbose $verbose \
+ --nbpe $nbpe \
+ --bpemode $bpemode \
+ --nj $nj \
+ --do_delta $do_delta \
+ --preprocess_config $preprocess_config \
+ --train_config $train_config\
+ --lm_config $lm_config \
+ --decode_config $decode_config\
+ $expdiravfine dump/avtrain dump/avtraindecode $PRETRAINEDMODEL $lmexpdir \
+ $noisetype $dict $bpemodel $expdiravpretrain|| exit 1;
+
+fi
+
+exit 0
diff --git a/egs/lrs/asr1/steps b/egs/lrs/avsr1/steps
similarity index 100%
rename from egs/lrs/asr1/steps
rename to egs/lrs/avsr1/steps
diff --git a/egs/lrs/asr1/utils b/egs/lrs/avsr1/utils
similarity index 100%
rename from egs/lrs/asr1/utils
rename to egs/lrs/avsr1/utils
diff --git a/egs/lrs/asr1/RESULTS.md b/egs/lrs2/asr1/RESULTS.md
similarity index 100%
rename from egs/lrs/asr1/RESULTS.md
rename to egs/lrs2/asr1/RESULTS.md
diff --git a/egs/lrs/asr1/cmd.sh b/egs/lrs2/asr1/cmd.sh
similarity index 100%
rename from egs/lrs/asr1/cmd.sh
rename to egs/lrs2/asr1/cmd.sh
diff --git a/egs/lrs2/asr1/conf/decode.yaml b/egs/lrs2/asr1/conf/decode.yaml
new file mode 100644
index 00000000000..98b36d1752e
--- /dev/null
+++ b/egs/lrs2/asr1/conf/decode.yaml
@@ -0,0 +1,7 @@
+batchsize: 0
+beam-size: 60
+ctc-weight: 0.4
+lm-weight: 0.6
+maxlenratio: 0.0
+minlenratio: 0.0
+penalty: 0.0
diff --git a/egs/lrs2/asr1/conf/fbank.conf b/egs/lrs2/asr1/conf/fbank.conf
new file mode 100644
index 00000000000..82ac7bd0dbc
--- /dev/null
+++ b/egs/lrs2/asr1/conf/fbank.conf
@@ -0,0 +1,2 @@
+--sample-frequency=16000
+--num-mel-bins=80
diff --git a/egs/lrs2/asr1/conf/gpu.conf b/egs/lrs2/asr1/conf/gpu.conf
new file mode 100644
index 00000000000..6d0a75b067a
--- /dev/null
+++ b/egs/lrs2/asr1/conf/gpu.conf
@@ -0,0 +1,10 @@
+# Default configuration
+command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
+option mem=* -l mem_free=$0,ram_free=$0
+option mem=0 # Do not add anything to qsub_opts
+option num_threads=* -pe smp $0
+option num_threads=1 # Do not add anything to qsub_opts
+option max_jobs_run=* -tc $0
+default gpu=0
+option gpu=0
+option gpu=* -l 'hostname=b1[12345678]*|c*,gpu=$0' -q g.q
\ No newline at end of file
diff --git a/egs/lrs/asr1/conf/lm.yaml b/egs/lrs2/asr1/conf/lm.yaml
similarity index 100%
rename from egs/lrs/asr1/conf/lm.yaml
rename to egs/lrs2/asr1/conf/lm.yaml
diff --git a/egs/lrs2/asr1/conf/pitch.conf b/egs/lrs2/asr1/conf/pitch.conf
new file mode 100644
index 00000000000..e959a19d5b8
--- /dev/null
+++ b/egs/lrs2/asr1/conf/pitch.conf
@@ -0,0 +1 @@
+--sample-frequency=16000
diff --git a/egs/lrs2/asr1/conf/queue.conf b/egs/lrs2/asr1/conf/queue.conf
new file mode 100644
index 00000000000..257d7b7b3aa
--- /dev/null
+++ b/egs/lrs2/asr1/conf/queue.conf
@@ -0,0 +1,10 @@
+# Default configuration
+command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
+option mem=* -l mem_free=$0,ram_free=$0
+option mem=0 # Do not add anything to qsub_opts
+option num_threads=* -pe smp $0
+option num_threads=1 # Do not add anything to qsub_opts
+option max_jobs_run=* -tc $0
+default gpu=0
+option gpu=0
+option gpu=* -l gpu=$0 -q g.q
diff --git a/egs/lrs/asr1/conf/slurm.conf b/egs/lrs2/asr1/conf/slurm.conf
similarity index 100%
rename from egs/lrs/asr1/conf/slurm.conf
rename to egs/lrs2/asr1/conf/slurm.conf
diff --git a/egs/lrs2/asr1/conf/specaug.yaml b/egs/lrs2/asr1/conf/specaug.yaml
new file mode 100644
index 00000000000..c0643d38597
--- /dev/null
+++ b/egs/lrs2/asr1/conf/specaug.yaml
@@ -0,0 +1,16 @@
+process:
+ # these three processes are a.k.a. SpecAugument
+ - type: "time_warp"
+ max_time_warp: 5
+ inplace: true
+ mode: "PIL"
+ - type: "freq_mask"
+ F: 30
+ n_mask: 2
+ inplace: true
+ replace_with_zero: false
+ - type: "time_mask"
+ T: 40
+ n_mask: 2
+ inplace: true
+ replace_with_zero: false
diff --git a/egs/lrs/asr1/conf/train.yaml b/egs/lrs2/asr1/conf/train.yaml
similarity index 100%
rename from egs/lrs/asr1/conf/train.yaml
rename to egs/lrs2/asr1/conf/train.yaml
diff --git a/egs/lrs/asr1/local/README.md b/egs/lrs2/asr1/local/README.md
similarity index 100%
rename from egs/lrs/asr1/local/README.md
rename to egs/lrs2/asr1/local/README.md
diff --git a/egs/lrs/asr1/local/data_preparation.sh b/egs/lrs2/asr1/local/data_preparation.sh
similarity index 100%
rename from egs/lrs/asr1/local/data_preparation.sh
rename to egs/lrs2/asr1/local/data_preparation.sh
diff --git a/egs/lrs/asr1/local/make_files.py b/egs/lrs2/asr1/local/make_files.py
similarity index 100%
rename from egs/lrs/asr1/local/make_files.py
rename to egs/lrs2/asr1/local/make_files.py
diff --git a/egs/lrs/asr1/local/pretrain.py b/egs/lrs2/asr1/local/pretrain.py
similarity index 100%
rename from egs/lrs/asr1/local/pretrain.py
rename to egs/lrs2/asr1/local/pretrain.py
diff --git a/egs/lrs/asr1/path.sh b/egs/lrs2/asr1/path.sh
similarity index 100%
rename from egs/lrs/asr1/path.sh
rename to egs/lrs2/asr1/path.sh
diff --git a/egs/lrs/asr1/run.sh b/egs/lrs2/asr1/run.sh
similarity index 98%
rename from egs/lrs/asr1/run.sh
rename to egs/lrs2/asr1/run.sh
index 06d0773f17e..e4958dd9359 100644
--- a/egs/lrs/asr1/run.sh
+++ b/egs/lrs2/asr1/run.sh
@@ -167,14 +167,14 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
wc -l ${dict}
else
- gdrive_download '1ZXXCXSbbFS2PDlrs9kbJL9pE6-5nPPxi' 'model.v1.tar.gz'
+ gdrive_download '1ITgdZoa8vQ7lDwi1jLziYGXOyUtgE2ow' 'model.v1.tar.gz'
tar -xf model.v1.tar.gz
- mv avsrlrs2_3/exp/train_rnnlm_pytorch_lm_unigram500 exp/pretrainedlm
- mv avsrlrs2_3/data/lang_char data/
+ mv model.v1/avsrlrs2_3/exp/train_rnnlm_pytorch_lm_unigram500 exp/pretrainedlm
+ mv model.v1/avsrlrs2_3/data/lang_char data/
mv data/lang_char/train_unigram500.model data/lang_char/${train_set}_unigram500.model
mv data/lang_char/train_unigram500.vocab data/lang_char/${train_set}_unigram500.vocab
mv data/lang_char/train_unigram500_units.txt data/lang_char/${train_set}_unigram500_units.txt
- rm -rf avsrlrs2_3
+ rm -rf model.v1
rm -rf model.v1.tar.gz
##### it is depands on your corpus, if the corpus text transcription is uppercase, use this to convert to lowercase
diff --git a/egs/lrs2/asr1/steps b/egs/lrs2/asr1/steps
new file mode 120000
index 00000000000..91f2d234e20
--- /dev/null
+++ b/egs/lrs2/asr1/steps
@@ -0,0 +1 @@
+../../../tools/kaldi/egs/wsj/s5/steps
\ No newline at end of file
diff --git a/egs/lrs2/asr1/utils b/egs/lrs2/asr1/utils
new file mode 120000
index 00000000000..f49247da827
--- /dev/null
+++ b/egs/lrs2/asr1/utils
@@ -0,0 +1 @@
+../../../tools/kaldi/egs/wsj/s5/utils
\ No newline at end of file
diff --git a/egs2/README.md b/egs2/README.md
index 8da8f300214..c3dfd2d1478 100755
--- a/egs2/README.md
+++ b/egs2/README.md
@@ -19,6 +19,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| chime4 | The 4th CHiME Speech Separation and Recognition Challenge | ASR/Multichannel ASR | ENG | http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/ | |
| cmu_indic | CMU INDIC | TTS | 7 languages | http://festvox.org/cmu_indic/ | |
| commonvoice | The Mozilla Common Voice | ASR | 13 languages | https://voice.mozilla.org/datasets | |
+| conferencingspeech21 | Far-field Multi-channel Speech Enhancement Challenge for Video Conferencing (ConferencingSpeech 2021) | SE | ENG, CMN | https://tea-lab.qq.com/conferencingspeech-2021 | |
| csj | Corpus of Spontaneous Japanese | ASR | JPN | https://pj.ninjal.ac.jp/corpus_center/csj/en/ | |
| csmsc | Chinese Standard Mandarin Speech Copus | TTS | CMN | https://www.data-baker.com/open_source.html | |
| css10 | CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages | TTS | 10 langauges | https://github.com/Kyubyong/css10 | |
@@ -56,6 +57,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| lrs2 | The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset | Lipreading/ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | |
| mini_an4 | Mini version of CMU AN4 database for the integration test | ASR/TTS/SE | ENG | http://www.speech.cs.cmu.edu/databases/an4/ | |
| mini_librispeech | Mini version of Librispeech corpus | DIAR | ENG | https://openslr.org/31/ | |
+| ml_openslr63 | Crowdsourced high-quality Malayalam multi-speaker speech data | ASR | MAL | https://openslr.org/63/ | |
| mls | MLS (A large multilingual corpus derived from LibriVox audiobooks) | ASR | 8 languages | http://www.openslr.org/94/ | |
| mr_openslr64 | OpenSLR Marathi Corpus | ASR | MAR | http://www.openslr.org/64/ | |
| ms_indic_is18 | Microsoft Speech Corpus (Indian languages) | ASR | 3 langs: TEL TAM GUJ | https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e | |
@@ -69,7 +71,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2
| ruslan | RUSLAN: Russian Spoken Language Corpus For Speech Synthesis | TTS | RUS | https://ruslan-corpus.github.io/ | |
| snips | SNIPS: A dataset for spoken language understanding | SLU | ENG | https://github.com/sonos/spoken-language-understanding-research-datasets | |
| seame | SEAME: a Mandarin-English Code-switching Speech Corpus in South-East Asia | ASR | ENG + CMN | https://catalog.ldc.upenn.edu/LDC2015S04 | |
-| siwis | SIWIS: Spoken Interaction with Interpretation in Switzerland | TTS | FRA | https://https://datashare.ed.ac.uk/handle/10283/2353 | |
+| siwis | SIWIS: Spoken Interaction with Interpretation in Switzerland | TTS | FRA | https://datashare.ed.ac.uk/handle/10283/2353 | |
| slue-voxceleb | SLUE: Spoken Language Understanding Evaluation | SLU | ENG | https://github.com/asappresearch/slue-toolkit | |
| slurp | SLURP: A Spoken Language Understanding Resource Package | SLU | ENG | https://github.com/pswietojanski/slurp | |
| slurp_entity | SLURP: A Spoken Language Understanding Resource Package | SLU/Entity Classifi. | ENG | https://github.com/pswietojanski/slurp | |
diff --git a/egs2/TEMPLATE/asr1/db.sh b/egs2/TEMPLATE/asr1/db.sh
index 31008b9502c..4e9b9d01ac4 100755
--- a/egs2/TEMPLATE/asr1/db.sh
+++ b/egs2/TEMPLATE/asr1/db.sh
@@ -7,6 +7,7 @@ AISHELL3=downloads
AISHELL4=downloads
ALFFA=downloads
AN4=downloads
+AUDIOSET=
DIRHA_ENGLISH_PHDEV=
DIRHA_WSJ=
DIRHA_WSJ_PROCESSED="${PWD}/data/local/dirha_wsj_processed" # Output file path
@@ -47,6 +48,7 @@ MISP2021=
LIBRIMIX=downloads
LIBRITTS=
LJSPEECH=downloads
+MUSAN=
NSC=
JMD=downloads
JSSS=downloads
@@ -128,6 +130,7 @@ PRIMEWORDS_CHINESE=downloads
SEAME=
BENGALI=downloads
IWSLT14=
+MALAYALAM=downloads
ST_CMDS=downloads
MS_INDIC_IS18=
MARATHI=downloads
@@ -273,4 +276,6 @@ if [[ "$(hostname -d)" == clsp.jhu.edu ]]; then
IWSLT21LR=downloads/iwslt21
TOTONAC=downloads
GOOGLEI18N=downloads
+ MALAYALAM=
+
fi
diff --git a/egs2/TEMPLATE/asr1/pyscripts/utils/extract_xvectors.py b/egs2/TEMPLATE/asr1/pyscripts/utils/extract_xvectors.py
index d7e7804a13f..e64b82dc515 100755
--- a/egs2/TEMPLATE/asr1/pyscripts/utils/extract_xvectors.py
+++ b/egs2/TEMPLATE/asr1/pyscripts/utils/extract_xvectors.py
@@ -105,7 +105,7 @@ def main(argv):
xvectors.append(embeds)
# Speaker Normalization
- xvectors = np.mean(np.concatenate(xvectors, 0), 0)
+ embeds = np.mean(np.stack(xvectors, 0), 0)
writer_spk[speaker] = embeds
writer_utt.close()
writer_spk.close()
diff --git a/egs2/TEMPLATE/asr1/scripts/utils/show_translation_result.sh b/egs2/TEMPLATE/asr1/scripts/utils/show_translation_result.sh
new file mode 100755
index 00000000000..c1c1bdf0882
--- /dev/null
+++ b/egs2/TEMPLATE/asr1/scripts/utils/show_translation_result.sh
@@ -0,0 +1,67 @@
+#!/usr/bin/env bash
+mindepth=0
+maxdepth=3
+case=tc
+
+. utils/parse_options.sh
+
+if [ $# -gt 1 ]; then
+ echo "Usage: $0 --mindepth 0 --maxdepth 1 [exp]" 1>&2
+ echo ""
+ echo "Show the system environments and the evaluation results in Markdown format."
+ echo 'The default of is "exp/".'
+ exit 1
+fi
+
+[ -f ./path.sh ] && . ./path.sh
+set -euo pipefail
+if [ $# -eq 1 ]; then
+ exp=$1
+else
+ exp=exp
+fi
+
+
+cat << EOF
+
+# RESULTS
+## Environments
+- date: \`$(LC_ALL=C date)\`
+EOF
+
+python3 << EOF
+import sys, espnet, torch
+pyversion = sys.version.replace('\n', ' ')
+
+print(f"""- python version: \`{pyversion}\`
+- espnet version: \`espnet {espnet.__version__}\`
+- pytorch version: \`pytorch {torch.__version__}\`""")
+EOF
+
+cat << EOF
+- Git hash: \`$(git rev-parse HEAD)\`
+ - Commit date: \`$(git log -1 --format='%cd')\`
+
+EOF
+
+metrics="bleu"
+
+while IFS= read -r expdir; do
+ if ls "${expdir}"/*/*/score_*/result.${case}.txt &> /dev/null; then
+ echo "## $(basename ${expdir})"
+ for type in $metrics; do
+ cat << EOF
+### ${type^^}
+
+|dataset|bleu_score|verbose_score|
+|---|---|---|
+EOF
+ data=$(echo "${expdir}"/*/*/score_*/result.${case}.txt | cut -d '/' -f4)
+ bleu=$(sed -n '5p' "${expdir}"/*/*/score_*/result.${case}.txt | cut -d ' ' -f 3 | tr -d ',')
+ verbose=$(sed -n '7p' "${expdir}"/*/*/score_*/result.${case}.txt | cut -d ' ' -f 3- | tr -d '",')
+ echo "${data}|${bleu}|${verbose}"
+
+ done
+ fi
+
+done < <(find ${exp} -mindepth ${mindepth} -maxdepth ${maxdepth} -type d)
diff --git a/egs2/TEMPLATE/diar1/scripts/utils/create_README_file.py b/egs2/TEMPLATE/diar1/scripts/utils/create_README_file.py
index a18aed64ab6..0fe3405603d 120000
--- a/egs2/TEMPLATE/diar1/scripts/utils/create_README_file.py
+++ b/egs2/TEMPLATE/diar1/scripts/utils/create_README_file.py
@@ -1 +1 @@
-egs2/TEMPLATE/asr1/scripts/utils/create_README_file.py
\ No newline at end of file
+../../../asr1/scripts/utils/create_README_file.py
\ No newline at end of file
diff --git a/egs2/TEMPLATE/diar1/scripts/utils/get_model_names.py b/egs2/TEMPLATE/diar1/scripts/utils/get_model_names.py
index 0b4eaaf09a8..b163314a6c5 120000
--- a/egs2/TEMPLATE/diar1/scripts/utils/get_model_names.py
+++ b/egs2/TEMPLATE/diar1/scripts/utils/get_model_names.py
@@ -1 +1 @@
-egs2/TEMPLATE/asr1/scripts/utils/get_model_names.py
\ No newline at end of file
+../../../asr1/scripts/utils/get_model_names.py
\ No newline at end of file
diff --git a/egs2/TEMPLATE/enh1/enh.sh b/egs2/TEMPLATE/enh1/enh.sh
index cb6e9e8503b..fcb4f324f15 100755
--- a/egs2/TEMPLATE/enh1/enh.sh
+++ b/egs2/TEMPLATE/enh1/enh.sh
@@ -76,7 +76,7 @@ inference_model=valid.loss.ave.pth
download_model=
# Evaluation related
-scoring_protocol="STOI SDR SAR SIR"
+scoring_protocol="STOI SDR SAR SIR SI_SNR"
ref_channel=0
score_with_asr=false
asr_exp="" # asr model for scoring WER
diff --git a/egs2/TEMPLATE/enh1/scripts/utils/create_README_file.py b/egs2/TEMPLATE/enh1/scripts/utils/create_README_file.py
index a18aed64ab6..0fe3405603d 120000
--- a/egs2/TEMPLATE/enh1/scripts/utils/create_README_file.py
+++ b/egs2/TEMPLATE/enh1/scripts/utils/create_README_file.py
@@ -1 +1 @@
-egs2/TEMPLATE/asr1/scripts/utils/create_README_file.py
\ No newline at end of file
+../../../asr1/scripts/utils/create_README_file.py
\ No newline at end of file
diff --git a/egs2/TEMPLATE/enh1/scripts/utils/get_model_names.py b/egs2/TEMPLATE/enh1/scripts/utils/get_model_names.py
index 0b4eaaf09a8..b163314a6c5 120000
--- a/egs2/TEMPLATE/enh1/scripts/utils/get_model_names.py
+++ b/egs2/TEMPLATE/enh1/scripts/utils/get_model_names.py
@@ -1 +1 @@
-egs2/TEMPLATE/asr1/scripts/utils/get_model_names.py
\ No newline at end of file
+../../../asr1/scripts/utils/get_model_names.py
\ No newline at end of file
diff --git a/egs2/TEMPLATE/enh1/scripts/utils/show_enh_score.sh b/egs2/TEMPLATE/enh1/scripts/utils/show_enh_score.sh
index 66fb9bc81c2..291d67078c3 100755
--- a/egs2/TEMPLATE/enh1/scripts/utils/show_enh_score.sh
+++ b/egs2/TEMPLATE/enh1/scripts/utils/show_enh_score.sh
@@ -51,7 +51,7 @@ while IFS= read -r expdir; do
metrics=()
heading="\n|dataset|"
sep="|---|"
- for type in pesq stoi sar sdr sir si_snr; do
+ for type in pesq estoi stoi sar sdr sir si_snr; do
if ls "${expdir}"/*/scoring/result_${type}.txt &> /dev/null; then
metrics+=("$type")
heading+="${type^^}|"
diff --git a/egs2/TEMPLATE/mt1/mt.sh b/egs2/TEMPLATE/mt1/mt.sh
index 35c6ab276c3..587b4ebf534 100755
--- a/egs2/TEMPLATE/mt1/mt.sh
+++ b/egs2/TEMPLATE/mt1/mt.sh
@@ -299,7 +299,7 @@ if "${token_joint}"; then
src_bpetoken_list="${tgt_bpetoken_list}"
src_chartoken_list="${tgt_chartoken_list}"
else
- src_bpedir="${token_listdir}/src_bpe_${tgt_bpemode}${tgt_nbpe}"
+ src_bpedir="${token_listdir}/src_bpe_${src_bpemode}${src_nbpe}"
src_bpeprefix="${src_bpedir}"/bpe
src_bpemodel="${src_bpeprefix}".model
src_bpetoken_list="${src_bpedir}"/tokens.txt
diff --git a/egs2/TEMPLATE/st1/st.sh b/egs2/TEMPLATE/st1/st.sh
index e90ca0b45ab..c8d4ce55c49 100755
--- a/egs2/TEMPLATE/st1/st.sh
+++ b/egs2/TEMPLATE/st1/st.sh
@@ -326,7 +326,7 @@ if "${token_joint}"; then
src_bpetoken_list="${tgt_bpetoken_list}"
src_chartoken_list="${tgt_chartoken_list}"
else
- src_bpedir="${token_listdir}/src_bpe_${tgt_bpemode}${tgt_nbpe}"
+ src_bpedir="${token_listdir}/src_bpe_${src_bpemode}${src_nbpe}"
src_bpeprefix="${src_bpedir}"/bpe
src_bpemodel="${src_bpeprefix}".model
src_bpetoken_list="${src_bpedir}"/tokens.txt
diff --git a/egs2/conferencingspeech21/enh1/cmd.sh b/egs2/conferencingspeech21/enh1/cmd.sh
new file mode 100644
index 00000000000..2aae6919fef
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/cmd.sh
@@ -0,0 +1,110 @@
+# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
+# Usage: .pl [options] JOB=1:
+# e.g.
+# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
+#
+# Options:
+# --time : Limit the maximum time to execute.
+# --mem : Limit the maximum memory usage.
+# -–max-jobs-run : Limit the number parallel jobs. This is ignored for non-array jobs.
+# --num-threads : Specify the number of CPU core.
+# --gpu : Specify the number of GPU devices.
+# --config: Change the configuration file from default.
+#
+# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
+# The left string of "=", i.e. "JOB", is replaced by (Nth job) in the command and the log file name,
+# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
+# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
+#
+# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
+# These options are mapping to specific options for each backend and
+# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
+# If jobs failed, your configuration might be wrong for your environment.
+#
+#
+# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
+# "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
+# =========================================================~
+
+
+# Select the backend used by run.sh from "local", "stdout", "sge", "slurm", or "ssh"
+cmd_backend='local'
+
+# Local machine, without any Job scheduling system
+if [ "${cmd_backend}" = local ]; then
+
+ # The other usage
+ export train_cmd="run.pl"
+ # Used for "*_train.py": "--gpu" is appended optionally by run.sh
+ export cuda_cmd="run.pl"
+ # Used for "*_recog.py"
+ export decode_cmd="run.pl"
+
+# Local machine logging to stdout and log file, without any Job scheduling system
+elif [ "${cmd_backend}" = stdout ]; then
+
+ # The other usage
+ export train_cmd="stdout.pl"
+ # Used for "*_train.py": "--gpu" is appended optionally by run.sh
+ export cuda_cmd="stdout.pl"
+ # Used for "*_recog.py"
+ export decode_cmd="stdout.pl"
+
+
+# "qsub" (Sun Grid Engine, or derivation of it)
+elif [ "${cmd_backend}" = sge ]; then
+ # The default setting is written in conf/queue.conf.
+ # You must change "-q g.q" for the "queue" for your environment.
+ # To know the "queue" names, type "qhost -q"
+ # Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.
+
+ export train_cmd="queue.pl"
+ export cuda_cmd="queue.pl"
+ export decode_cmd="queue.pl"
+
+
+# "qsub" (Torque/PBS.)
+elif [ "${cmd_backend}" = pbs ]; then
+ # The default setting is written in conf/pbs.conf.
+
+ export train_cmd="pbs.pl"
+ export cuda_cmd="pbs.pl"
+ export decode_cmd="pbs.pl"
+
+
+# "sbatch" (Slurm)
+elif [ "${cmd_backend}" = slurm ]; then
+ # The default setting is written in conf/slurm.conf.
+ # You must change "-p cpu" and "-p gpu" for the "partition" for your environment.
+ # To know the "partion" names, type "sinfo".
+ # You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
+ # The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".
+
+ export train_cmd="slurm.pl"
+ export cuda_cmd="slurm.pl"
+ export decode_cmd="slurm.pl"
+
+elif [ "${cmd_backend}" = ssh ]; then
+ # You have to create ".queue/machines" to specify the host to execute jobs.
+ # e.g. .queue/machines
+ # host1
+ # host2
+ # host3
+ # Assuming you can login them without any password, i.e. You have to set ssh keys.
+
+ export train_cmd="ssh.pl"
+ export cuda_cmd="ssh.pl"
+ export decode_cmd="ssh.pl"
+
+# This is an example of specifying several unique options in the JHU CLSP cluster setup.
+# Users can modify/add their own command options according to their cluster environments.
+elif [ "${cmd_backend}" = jhu ]; then
+
+ export train_cmd="queue.pl --mem 2G"
+ export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/queue.conf"
+ export decode_cmd="queue.pl --mem 4G"
+
+else
+ echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
+ return 1
+fi
diff --git a/egs2/conferencingspeech21/enh1/conf/pbs.conf b/egs2/conferencingspeech21/enh1/conf/pbs.conf
new file mode 100644
index 00000000000..119509938ce
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/conf/pbs.conf
@@ -0,0 +1,11 @@
+# Default configuration
+command qsub -V -v PATH -S /bin/bash
+option name=* -N $0
+option mem=* -l mem=$0
+option mem=0 # Do not add anything to qsub_opts
+option num_threads=* -l ncpus=$0
+option num_threads=1 # Do not add anything to qsub_opts
+option num_nodes=* -l nodes=$0:ppn=1
+default gpu=0
+option gpu=0
+option gpu=* -l ngpus=$0
diff --git a/egs2/conferencingspeech21/enh1/conf/queue.conf b/egs2/conferencingspeech21/enh1/conf/queue.conf
new file mode 100644
index 00000000000..500582fab31
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/conf/queue.conf
@@ -0,0 +1,12 @@
+# Default configuration
+command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
+option name=* -N $0
+option mem=* -l mem_free=$0,ram_free=$0
+option mem=0 # Do not add anything to qsub_opts
+option num_threads=* -pe smp $0
+option num_threads=1 # Do not add anything to qsub_opts
+option max_jobs_run=* -tc $0
+option num_nodes=* -pe mpi $0 # You must set this PE as allocation_rule=1
+default gpu=0
+option gpu=0
+option gpu=* -l gpu=$0 -q g.q
diff --git a/egs2/conferencingspeech21/enh1/conf/slurm.conf b/egs2/conferencingspeech21/enh1/conf/slurm.conf
new file mode 100644
index 00000000000..3b229673638
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/conf/slurm.conf
@@ -0,0 +1,14 @@
+# Default configuration
+command sbatch --export=PATH
+option name=* --job-name $0
+option time=* --time $0
+option mem=* --mem-per-cpu $0
+option mem=0
+option num_threads=* --cpus-per-task $0
+option num_threads=1 --cpus-per-task 1
+option num_nodes=* --nodes $0
+default gpu=0
+option gpu=0 -p cpu
+option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
+# note: the --max-jobs-run option is supported as a special case
+# by slurm.pl and you don't have to handle it in the config file.
diff --git a/egs2/conferencingspeech21/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml b/egs2/conferencingspeech21/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml
new file mode 100644
index 00000000000..46784a14064
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml
@@ -0,0 +1,72 @@
+optim: adam
+init: xavier_uniform
+max_epoch: 70
+batch_type: folded
+batch_size: 8
+num_workers: 4
+optim_conf:
+ lr: 1.0e-03
+ eps: 1.0e-08
+ weight_decay: 0
+patience: 4
+val_scheduler_criterion:
+- valid
+- loss
+best_model_criterion:
+- - valid
+ - si_snr
+ - max
+- - valid
+ - loss
+ - min
+keep_nbest_models: 1
+scheduler: reducelronplateau
+scheduler_conf:
+ mode: min
+ factor: 0.5
+ patience: 1
+encoder: stft
+encoder_conf:
+ n_fft: 512
+ hop_length: 128
+ use_builtin_complex: False
+decoder: stft
+decoder_conf:
+ n_fft: 512
+ hop_length: 128
+separator: wpe_beamformer
+separator_conf:
+ num_spk: 1
+ loss_type: mask_mse
+ use_wpe: True
+ wnet_type: blstmp
+ wlayers: 3
+ wunits: 300
+ wprojs: 320
+ wdropout_rate: 0.0
+ taps: 5
+ delay: 3
+ use_dnn_mask_for_wpe: False
+ use_beamformer: True
+ bnet_type: blstmp
+ blayers: 3
+ bunits: 512
+ bprojs: 512
+ badim: 320
+ ref_channel: 0
+ use_noise_mask: False
+ beamformer_type: mvdr_souden
+ bdropout_rate: 0.0
+
+
+criterions:
+ # The first criterion
+ - name: mse
+ conf:
+ compute_on_mask: True
+ mask_type: PSM^2
+ # the wrapper for the current criterion
+ # for single-talker case, we simplely use fixed_order wrapper
+ wrapper: fixed_order
+ wrapper_conf:
+ weight: 1.0
diff --git a/egs2/conferencingspeech21/enh1/db.sh b/egs2/conferencingspeech21/enh1/db.sh
new file mode 120000
index 00000000000..50d86130898
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/db.sh
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/db.sh
\ No newline at end of file
diff --git a/egs2/conferencingspeech21/enh1/enh.sh b/egs2/conferencingspeech21/enh1/enh.sh
new file mode 120000
index 00000000000..8fd33b0b191
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/enh.sh
@@ -0,0 +1 @@
+../../TEMPLATE/enh1/enh.sh
\ No newline at end of file
diff --git a/egs2/conferencingspeech21/enh1/local/config_from_generated.py b/egs2/conferencingspeech21/enh1/local/config_from_generated.py
new file mode 100755
index 00000000000..25758da1024
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/config_from_generated.py
@@ -0,0 +1,80 @@
+#!/usr/bin/env python
+
+# Copyright 2021 Shanghai Jiao Tong University (Authors: Wangyou Zhang)
+# Apache 2.0
+import argparse
+from pathlib import Path
+
+
+def construct_path_dict(wav_list):
+ path_dict = {}
+ with wav_list.open("r") as f:
+ for wavpath in f:
+ wavpath = wavpath.strip()
+ if not wavpath:
+ continue
+ wavname = Path(wavpath).expanduser().resolve().with_suffix("").name
+ path_dict[wavname] = wavpath
+ return path_dict
+
+
+def prepare_config(args):
+ audiodir = Path(args.audiodir).expanduser()
+ clean_list = Path(args.clean_list).expanduser().resolve()
+ noise_list = Path(args.noise_list).expanduser().resolve()
+ outfile = Path(args.outfile).expanduser().resolve()
+
+ speech_data = construct_path_dict(clean_list)
+ noise_data = construct_path_dict(noise_list)
+ audios = {
+ folder: {
+ path.with_suffix("").name: str(path)
+ for path in (audiodir / folder).rglob("*." + args.audio_format)
+ }
+ for folder in ("mix", "noreverb_ref", "reverb_ref")
+ }
+ keys = audios["mix"].keys()
+ assert keys == audios["noreverb_ref"].keys() == audios["reverb_ref"].keys()
+
+ with outfile.open("w") as out:
+ for name in keys:
+ path_clean, path_noise, path_rir, start_time, snr, scale = name.split("#")
+ path_clean = speech_data[path_clean]
+ path_noise = noise_data[path_noise]
+ out.write(
+ f"{path_clean} {start_time} {path_noise} "
+ f"/path/{args.tag}/{path_rir}.wav {snr} {scale}\n"
+ )
+
+
+def get_parser():
+ """Argument parser."""
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "--audiodir",
+ type=str,
+ required=True,
+ help="Paths to the directory containing simulated audio files",
+ )
+ parser.add_argument("--audio-format", type=str, default="wav")
+ parser.add_argument(
+ "--clean_list",
+ type=str,
+ required=True,
+ help="Path to the list of clean speech audio file for simulation",
+ )
+ parser.add_argument(
+ "--noise_list",
+ type=str,
+ required=True,
+ help="Path to the list of noise audio file for simulation",
+ )
+ parser.add_argument("--outfile", type=str, required=True)
+ parser.add_argument("--tag", type=str, default="linear")
+ return parser
+
+
+if __name__ == "__main__":
+ parser = get_parser()
+ args = parser.parse_args()
+ prepare_config(args)
diff --git a/egs2/conferencingspeech21/enh1/local/data.sh b/egs2/conferencingspeech21/enh1/local/data.sh
new file mode 100755
index 00000000000..578370cbaa9
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/data.sh
@@ -0,0 +1,359 @@
+#!/bin/bash
+
+set -e
+set -u
+set -o pipefail
+
+log() {
+ local fname=${BASH_SOURCE[1]##*/}
+ echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
+}
+SECONDS=0
+
+help_message=$(cat << EOF
+Usage: $0 [--stage ] [--stop_stage ] --use_reverb_ref --official-data-dir
+
+ required argument:
+ --official-data-dir: path to the directory of offical data for ConferencingSpeech2021 with the following structure:
+
+
+ |-- Development_test_set/
+ | |-- playback+noise/
+ | |-- readme.txt
+ | |-- realrecording_cut/
+ | |-- semireal+noise/
+ | |-- simu_multiple_MA/
+ | \-- simu_single_MA/
+ |
+ |-- Training_set/
+ | |-- circle_rir/
+ | |-- linear_rir/
+ | |-- non_uniform_linear_rir/
+ | |-- readme.txt
+ | |-- selected_lists/
+ | \-- train_record_noise/
+ |
+ |-- Evaluation_set/
+ | |-- eval_data/
+ | | |--tast1/
+ | | \--task2/
+ | \-- Readme.txt
+ |
+ \-- config_files_simulation_train/
+ |-- train_simu_circle.config
+ |-- train_simu_linear.config
+ \-- train_simu_non_uniform.config
+
+ optional argument:
+ [--stage]: 1 (default) or 4
+ [--stop_stage]: 1 or 4 (default)
+ [--use_reverb_ref]: true or false (default)
+EOF
+)
+
+
+stage=1
+stop_stage=4
+official_data_dir=
+use_official_dev=true
+use_reverb_ref=true
+
+log "$0 $*"
+. utils/parse_options.sh
+
+
+. ./path.sh || exit 1;
+. ./cmd.sh || exit 1;
+. ./db.sh || exit 1;
+
+if [ $# -gt 0 ]; then
+ log "${help_message}"
+ exit 2
+fi
+
+if [ ! -e "${official_data_dir}" ]; then
+ log "${help_message}"
+ log "No such directory for --official-data-dir: '${official_data_dir}'"
+ exit 1
+fi
+
+if [ ! -e "${AISHELL}" ]; then
+ log "Fill the value of 'AISHELL' in db.sh"
+ log "(available at http://openslr.org/33/)"
+ exit 1
+fi
+
+if [ ! -e "${AISHELL3}" ]; then
+ log "Fill the value of 'AISHELL3' in db.sh"
+ log "(available at http://openslr.org/93/)"
+ exit 1
+fi
+
+if [ ! -e "${LIBRISPEECH}" ]; then
+ log "Fill the value of 'LIBRISPEECH' in db.sh"
+ log "(available at http://openslr.org/12/)"
+ exit 1
+elif [ ! -e "${LIBRISPEECH}/train-clean-360" ]; then
+ log "Please ensure '${LIBRISPEECH}/train-clean-360' exists"
+ exit 1
+fi
+
+if [ ! -e "${VCTK}" ]; then
+ log "Fill the value of 'VCTK' in db.sh"
+ log "(Version 0.80, available at https://datashare.ed.ac.uk/handle/10283/2651)"
+ exit 1
+fi
+
+if [ ! -e "${MUSAN}" ]; then
+ log "Fill the value of 'MUSAN' in db.sh"
+ log "(available at http://openslr.org/17/)"
+ exit 1
+fi
+
+if [ ! -e "${AUDIOSET}" ]; then
+ log "Fill the value of 'AUDIOSET' in db.sh"
+ log "(available at https://github.com/marc-moreaux/audioset_raw)"
+ exit 1
+fi
+
+
+odir="${PWD}/local"
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+ log "stage 1: Prepare Training and Dev Data for Simulation"
+
+ if [ ! -d "${odir}/ConferencingSpeech2021" ]; then
+ git clone https://github.com/ConferencingSpeech/ConferencingSpeech2021.git "${odir}/ConferencingSpeech2021"
+ fi
+ (
+ cd "${odir}/ConferencingSpeech2021"
+ # This patch is for simulation/mix_wav.py at commit 49d3b2fc47
+ git apply "${odir}/fix_simulation_script.patch"
+ python -m pip install -r requirements.txt
+ )
+
+ rir_dir="${official_data_dir}/Training_set"
+
+ # make symbolic links for each corpus to match the data preparation script
+ corpora_dir="${odir}/ConferencingSpeech2021/corpora"
+ mkdir -p "${corpora_dir}"
+ ln -s "${AISHELL}" "${corpora_dir}/aishell_1"
+ ln -s "${AISHELL3}" "${corpora_dir}/aishell_3"
+ ln -s "${VCTK}" "${corpora_dir}/vctk"
+ ln -s "${LIBRISPEECH}/train-clean-360" "${corpora_dir}/librispeech_360"
+ ln -s "${MUSAN}" "${corpora_dir}/musan"
+ ln -s "${AUDIOSET}" "${corpora_dir}/audioset"
+ ln -s "${rir_dir}/linear_rir" "${corpora_dir}/linear"
+ ln -s "${rir_dir}/circle_rir" "${corpora_dir}/circle"
+ ln -s "${rir_dir}/non_uniform_linear_rir" "${corpora_dir}/non_uniform"
+
+ sed -i -e "s#aishell_1='.*'#aishell_1='${corpora_dir}/aishell_1'#g" \
+ -e "s#aishell_3='.*'#aishell_3='${corpora_dir}/aishell_3'#g" \
+ -e "s#vctk='.*'#vctk='${corpora_dir}/vctk'#g" \
+ -e "s#librispeech='.*'#librispeech='${corpora_dir}/librispeech_360'#g" \
+ -e "s#musan='.*'#musan='${corpora_dir}/musan'#g" \
+ -e "s#audioset='.*'#audioset='${corpora_dir}/audioset'#g" \
+ -e "s#linear='.*'#linear='${corpora_dir}/linear'#g" \
+ -e "s#circle='.*'#circle='${corpora_dir}/circle'#g" \
+ -e "s#non_uniform='.*'#non_uniform='${corpora_dir}/non_uniform'#g" \
+ -e "s#find \${name_path} #find \${name_path}/ #g" \
+ "${odir}/ConferencingSpeech2021/simulation/prepare.sh"
+
+ # This script will generate ${odir}/ConferencingSpeech2021/simulation/data/{train,dev}_*.config
+ (
+ cd "${odir}/ConferencingSpeech2021/simulation"
+ # NOTE (wangyou): 1000+ samples in ConferencingSpeech2021/selected_list/train/audioset.name
+ # might be unavailable from YouTube due to violation of policies, copyright, and other causes.
+ # In this case, you may want to remove them from the list.
+ bash ./prepare.sh
+ )
+ # If the above script fail to finish successfully, please use the following command instead:
+ #
+ # local/prepare.sh \
+ # --corpora_dir "${corpora_dir}" \
+ # --selected_list_dir "${odir}/ConferencingSpeech2021/selected_lists" \
+ # --outdir "${odir}/ConferencingSpeech2021/simulation/data"
+
+ # Fill ${odir}/ConferencingSpeech2021/simulation/data/dev_*.config with real paths
+ simu_data_path="${odir}/ConferencingSpeech2021/simulation/data"
+ for name in linear circle non_uniform; do
+ python local/prepare_simu_config.py \
+ "${simu_data_path}/dev_${name}_simu_mix.config" \
+ --clean_list "${simu_data_path}/dev_clean.lst" \
+ --noise_list "${simu_data_path}/dev_noise.lst" \
+ --rir_list "${simu_data_path}/dev_${name}_rir.lst" \
+ --outfile "${simu_data_path}/dev_${name}_simu_mix.config"
+ done
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+ log "stage 2: Data Simulation"
+
+ if ${use_official_dev}; then
+ log "Skip simulation (using official development data in track2)"
+
+ datadir="${odir}/ConferencingSpeech2021/simulation/data/wavs/dev"
+ for name in linear circle non_uniform; do
+ mkdir -p "${datadir}/simu_${name}"
+ done
+ track2dir="${official_data_dir}/Development_test_set/simu_multiple_MA"
+ for folder in mix reverb_ref noreverb_ref; do
+ ln -s "${track2dir}/dev_simu_linear_uniform_track2/${folder}" "${datadir}/simu_linear/${folder}"
+ ln -s "${track2dir}/dev_simu_circular_track2/${folder}" "${datadir}/simu_circle/${folder}"
+ ln -s "${track2dir}/dev_simu_linear_nonuniform_track2/${folder}" "${datadir}/simu_non_uniform/${folder}"
+ done
+ simu_data_path="${odir}/ConferencingSpeech2021/simulation/data"
+ for name in linear circle non_uniform; do
+ python local/config_from_generated.py \
+ --audiodir "${datadir}/simu_${name}" \
+ --audio-format wav \
+ --clean_list "${simu_data_path}/dev_clean.lst" \
+ --noise_list "${simu_data_path}/dev_noise.lst" \
+ --tag ${name} \
+ --outfile "${datadir}/simu_${name}/dev_${name}_simu_mix.config"
+ done
+ else
+ # Expected data to be generated:
+ # ${odir}/ConferencingSpeech2021/simulation/data/wav/dev/
+ # |-- simu_circle/
+ # | |-- dev_circle_simu_mix.config
+ # | |-- mix/*.wav (1588 samples * 8 ch * 6 sec)
+ # | |-- noreverb_ref/*.wav (1588 samples * 8 ch * 6 sec)
+ # | \-- reverb_ref/*.wav (1588 samples * 8 ch * 6 sec)
+ # |-- simu_linear/
+ # | |-- dev_linear_simu_mix.config
+ # | |-- mix/*.wav (1588 samples * 8 ch * 6 sec)
+ # | |-- noreverb_ref/*.wav (1588 samples * 8 ch * 6 sec)
+ # | \-- reverb_ref/*.wav (1588 samples * 8 ch * 6 sec)
+ # \-- simu_non_uniform/
+ # |-- dev_non_uniform_simu_mix.config
+ # |-- mix/*.wav (1588 samples * 8 ch * 6 sec)
+ # |-- noreverb_ref/*.wav (1588 samples * 8 ch * 6 sec)
+ # \-- reverb_ref/*.wav (1588 samples * 8 ch * 6 sec)
+ (
+ cd "${odir}/ConferencingSpeech2021/simulation"
+ for name in linear circle non_uniform; do
+ log "Simulating with dev_${name}_simu_mix.config"
+ python mix_wav.py \
+ --mix_config_path data/dev_${name}_simu_mix.config \
+ --save_dir data/wavs/dev/simu_${name}/ \
+ --chunk_len 6 \
+ --generate_config False
+ done
+ )
+ fi
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+ log "stage 3: Prepare data directory"
+
+ tmpdir=$(mktemp -d /tmp/conferencingspeech.XXXX)
+ ##############################################
+ # Training data will be generated on the fly #
+ ##############################################
+ mkdir -p data/train
+ simu_data_path="${odir}/ConferencingSpeech2021/simulation/data"
+
+ # Prepare wav.scp and spk1.scp
+ sed -e 's/\.\(wav\|flac\)//' "${simu_data_path}/train_clean.lst" | \
+ awk -F '/' '{print $NF}' > "${tmpdir}/utt_clean.list"
+ paste -d' ' "${tmpdir}/utt_clean.list" "${simu_data_path}/train_clean.lst" | sort -u > data/train/wav.scp
+ cp data/train/wav.scp data/train/spk1.scp
+
+ # Prepare utt2spk for data from aishell_1, aishell_3, librispeech_360, and vctk
+ # path -> spkid (aishell_1): .../S0724/BAC009S0724W0121.wav -> S0724
+ # path -> spkid (aishell_3): .../SSB0261/SSB02610250.wav -> SSB0261
+ # path -> spkid (librispeech_360): .../7932/93470/7932-93470-0006.flac -> 7932-93470
+ # path -> spkid (vctk): .../p278/p278_202.wav -> p278
+ sed -e 's/\.\(wav\|flac\)//' "${simu_data_path}/train_clean.lst" | \
+ awk 'BEGIN{ FS="/" } {
+ if(match($0, "librispeech_360")) {i=NF-2; j=NF-1; printf("%s %s-%s\n",$NF,$i,$j)}
+ else {i=NF-1; printf("%s %s\n",$NF,$i)}
+ }' | sort -u > data/train/utt2spk
+ utils/utt2spk_to_spk2utt.pl data/train/utt2spk > data/train/spk2utt
+
+ # Prepare scp files of noises and RIRs for training (used for on-the-fly mixing)
+ # * The noise set is composed of two parts:
+ # (1) selected from MUSAN and Audioset (25390 samples, ~120 hours)
+ # (2) real meeting room noises recorded by high fidelity devices (98 clips, ~13 hours)
+ # NOTE: different noise data may have different sample rates.
+ # * 28914 RIRs are simulated using the image method.
+ sed -e 's/\.\(wav\|flac\)//' "${simu_data_path}/train_noise.lst" | \
+ awk -F '/' '{print $NF}' > "${tmpdir}/utt_noise.list"
+ paste -d' ' "${tmpdir}/utt_noise.list" "${simu_data_path}/train_noise.lst" > data/train/noises.scp
+ find "${official_data_dir}/Training_set/train_record_noise/" -iname "*.wav" > "${tmpdir}/train_record_noise.list"
+ sed -e 's/\.\(wav\|flac\)//' "${tmpdir}/train_record_noise.list" | \
+ awk -F '/' '{print $NF}' > "${tmpdir}/utt_record_noise.list"
+ paste -d' ' "${tmpdir}/utt_record_noise.list" "${tmpdir}/train_record_noise.list" >> data/train/noises.scp
+
+ # NOTE: different RIRs may have different numbers of channels.
+ cat "${simu_data_path}"/train_{circle,linear,non_uniform}_rir.lst > "${tmpdir}/train_rir.list"
+ sed -e 's/\.wav//' "${tmpdir}/train_rir.list" | \
+ awk -F '/' '{print $NF}' > "${tmpdir}/utt_rir.list"
+ paste -d' ' "${tmpdir}/utt_rir.list" "${tmpdir}/train_rir.list" > data/train/rirs.scp
+
+ utils/validate_data_dir.sh --no-feats --no-text data/train
+
+ ####################
+ # Development data #
+ ####################
+ mkdir -p data/dev
+ if ${use_official_dev}; then
+ mkdir -p "${tmpdir}"/dev_{simu_circle,simu_linear,simu_non_uniform}
+ for name in linear circle non_uniform; do
+ python local/prepare_dev_data.py \
+ --audiodirs "${simu_data_path}"/wavs/dev/simu_${name}/mix \
+ --use_reverb_ref ${use_reverb_ref} \
+ --outdir "${tmpdir}"/dev_simu_${name} \
+ --uttid_suffix ${name} \
+ "${simu_data_path}"/wavs/dev/simu_${name}/dev_${name}_simu_mix.config
+ done
+ for f in spk1.scp utt2spk wav.scp; do
+ cat "${tmpdir}"/dev_{simu_circle,simu_linear,simu_non_uniform}/${f} | sort > data/dev/${f}
+ done
+ else
+ cat "${simu_data_path}"/wavs/dev/simu_circle/dev_circle_simu_mix.config \
+ "${simu_data_path}"/wavs/dev/simu_linear/dev_linear_simu_mix.config \
+ "${simu_data_path}"/wavs/dev/simu_non_uniform/dev_non_uniform_simu_mix.config \
+ > ${tmpdir}/dev.config
+ python local/prepare_dev_data.py \
+ --audiodirs "${simu_data_path}"/wavs/dev/{simu_circle,simu_linear,simu_non_uniform}/mix \
+ --use_reverb_ref ${use_reverb_ref} \
+ --outdir data/dev \
+ ${tmpdir}/dev.config
+
+ for f in spk1.scp utt2spk wav.scp; do
+ mv data/dev/${f} data/dev/.${f}
+ sort data/dev/.${f} > data/dev/${f}
+ rm data/dev/.${f}
+ done
+ fi
+ utils/utt2spk_to_spk2utt.pl data/dev/utt2spk > data/dev/spk2utt
+ utils/validate_data_dir.sh --no-feats --no-text data/dev
+
+ rm -rf "$tmpdir"
+fi
+
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+ log "stage 4: Prepare test data"
+ tmpdir=$(mktemp -d /tmp/conferencingspeech.XXXX)
+ ########################
+ # Evaluation test data #
+ ########################
+ mkdir -p data/test
+ mkdir -p "${tmpdir}"/test_{simu_circle,simu_linear,simu_non_uniform}
+ for name in real-recording semi-real-playback semi-real-realspk; do
+ python local/prepare_test_data.py \
+ --audiodirs "${official_data_dir}"/Evaluation_set/eval_data/task2/${name} \
+ --outdir "${tmpdir}"/test_${name} \
+ --uttid_prefix "task2_${name}"
+ done
+ for f in spk1.scp utt2spk wav.scp; do
+ cat "${tmpdir}"/test_{real-recording,semi-real-playback,semi-real-realspk}/${f} | sort > data/test/${f}
+ done
+ utils/utt2spk_to_spk2utt.pl data/test/utt2spk > data/test/spk2utt
+ utils/validate_data_dir.sh --no-feats --no-text data/test
+
+ rm -rf "$tmpdir"
+fi
+
+log "Successfully finished. [elapsed=${SECONDS}s]"
diff --git a/egs2/conferencingspeech21/enh1/local/fix_simulation_script.patch b/egs2/conferencingspeech21/enh1/local/fix_simulation_script.patch
new file mode 100644
index 00000000000..e9ec83dc19a
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/fix_simulation_script.patch
@@ -0,0 +1,98 @@
+diff --git a/requirements.txt b/requirements.txt
+index 8054697..3012192 100644
+--- a/requirements.txt
++++ b/requirements.txt
+@@ -3,5 +3,4 @@ numpy
+ pystoi
+ pesq
+ scipy
+-pyrirgen
+ librosa
+diff --git a/simulation/mix_wav.py b/simulation/mix_wav.py
+index 1cd05c1..d1cb9b6 100644
+--- a/simulation/mix_wav.py
++++ b/simulation/mix_wav.py
+@@ -79,8 +79,8 @@ def clip_data(data, start, segment_length):
+ tgt[st:st+data.shape[0]] += data
+ st = segment_length//3 * 2
+ tgt[st:st+data.shape[0]] += data
+-
+- else:
++
++ elif data_len < segment_length//2:
+ """
+ padding to A_A
+ """
+@@ -92,25 +92,60 @@ def clip_data(data, start, segment_length):
+ tgt[:segment_length//2] += data
+ st = segment_length//2
+ tgt[st:st+data.shape[0]] += data
+-
++
++ elif data_len < segment_length:
++ # (wangyou) in case of outliers
++ """same as (start == -1)--if"""
++ if data_len % 4 == 0:
++ tgt[:data_len] += data
++ tgt[data_len:] += data[:segment_length-data_len]
++ elif data_len % 4 == 1:
++ tgt[:data_len] += data
++ elif data_len % 4 == 2:
++ tgt[-data_len:] += data
++ elif data_len % 4 == 3:
++ tgt[(segment_length-data_len)//2:(segment_length-data_len)//2+data_len] += data
++
++ else:
++ # (wangyou) in case of outliers
++ """same as (start == -1)--else"""
++ if data_len % 4 == 0 or data_len % 4 == 3:
++ tgt += data[(data_len-segment_length)//2:(data_len-segment_length)//2+segment_length]
++ elif data_len % 4 == 1:
++ tgt += data[:segment_length]
++ elif data_len % 4 == 2:
++ tgt += data[-segment_length:]
++
+ elif start == -1:
+ '''
+ this means segment_length < data_len*2
+ padding to A_A
+ '''
+- if data_len % 4 == 0:
+- tgt[:data_len] += data
+- tgt[data_len:] += data[:segment_length-data_len]
+- elif data_len % 4 == 1:
+- tgt[:data_len] += data
+- elif data_len % 4 == 2:
+- tgt[-data_len:] += data
+- elif data_len % 4 == 3:
+- tgt[(segment_length-data_len)//2:(segment_length-data_len)//2+data_len] += data
+-
++ if data_len < segment_length:
++ if data_len % 4 == 0:
++ tgt[:data_len] += data
++ tgt[data_len:] += data[:segment_length-data_len]
++ elif data_len % 4 == 1:
++ tgt[:data_len] += data
++ elif data_len % 4 == 2:
++ tgt[-data_len:] += data
++ elif data_len % 4 == 3:
++ tgt[(segment_length-data_len)//2:(segment_length-data_len)//2+data_len] += data
++
++ else:
++ # (wangyou) in case of outliers
++ if data_len % 4 == 0 or data_len % 4 == 3:
++ tgt += data[(data_len-segment_length)//2:(data_len-segment_length)//2+segment_length]
++ elif data_len % 4 == 1:
++ tgt += data[:segment_length]
++ elif data_len % 4 == 2:
++ tgt += data[-segment_length:]
++
+ else:
++ if start + segment_length > data_len:
++ data = np.pad(data, [0, start + segment_length - data_len], 'constant')
+ tgt += data[start:start+segment_length]
+-
++
+ return tgt
+
+ def rms(data):
diff --git a/egs2/conferencingspeech21/enh1/local/path.sh b/egs2/conferencingspeech21/enh1/local/path.sh
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/egs2/conferencingspeech21/enh1/local/prepare.sh b/egs2/conferencingspeech21/enh1/local/prepare.sh
new file mode 100755
index 00000000000..ea17270abe4
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/prepare.sh
@@ -0,0 +1,80 @@
+#!/bin/bash
+
+set -e
+set -u
+set -o pipefail
+
+
+corpora_dir=
+selected_list_dir=
+outdir=
+
+echo "$0 $*"
+. utils/parse_options.sh
+
+
+. ./path.sh || exit 1;
+. ./db.sh || exit 1;
+
+
+mkdir -p "${outdir}"
+tmpdir=$(mktemp -d /tmp/cs21.XXXX)
+trap 'rm -rf "$tmpdir"' EXIT
+
+# prepare speech for training
+for name in aishell_3 aishell_1 vctk; do
+ python local/prepare_data_list.py \
+ --outfile "${tmpdir}/train_${name}.lst" \
+ --audiodirs "${corpora_dir}/${name}" \
+ --audio-format "wav" \
+ "${selected_list_dir}/train/${name}.name"
+done
+
+python local/prepare_data_list.py \
+ --outfile "${tmpdir}/train_librispeech_360.lst" \
+ --audiodirs "${corpora_dir}/librispeech_360" \
+ --audio-format "flac" \
+ "${selected_list_dir}/train/librispeech_360.name"
+
+cat "${tmpdir}"/train_{aishell_3,aishell_1,vctk,librispeech_360}.lst > "${outdir}/train_clean.lst"
+
+# prepare noise for training
+python local/prepare_data_list.py \
+ --outfile "${tmpdir}/musan.lst" \
+ --audiodirs "${corpora_dir}/musan" \
+ --audio-format "wav" \
+ "${selected_list_dir}/train/musan.name"
+
+python local/prepare_data_list.py \
+ --outfile "${tmpdir}/audioset.lst" \
+ --audiodirs "${corpora_dir}/audioset" \
+ --audio-format "wav" \
+ --ignore-missing-files True \
+ "${selected_list_dir}/train/audioset.name"
+
+cat "${tmpdir}"/{musan,audioset}.lst > "${outdir}/train_noise.lst"
+
+# prepare speech for development
+python local/prepare_data_list.py \
+ --outfile "${outdir}/dev_clean.lst" \
+ --audiodirs "${corpora_dir}/aishell_1" "${corpora_dir}/vctk" "${corpora_dir}/aishell_3" \
+ --audio-format "wav" \
+ "${selected_list_dir}/dev/clean.name"
+
+# prepare noise for development
+python local/prepare_data_list.py \
+ --outfile "${outdir}/dev_noise.lst" \
+ --audiodirs "${corpora_dir}/musan" \
+ --audio-format "wav" \
+ "${selected_list_dir}/dev/noise.name"
+
+# Prepare the simulated RIR lists for training and development
+for name in linear circle non_uniform; do
+ for mode in train dev; do
+ python local/prepare_data_list.py \
+ --outfile "${outdir}/${mode}_${name}_rir.lst" \
+ --audiodirs "${corpora_dir}/${name}" \
+ --audio-format "wav" \
+ "${selected_list_dir}/${mode}/${name}.name"
+ done
+done
diff --git a/egs2/conferencingspeech21/enh1/local/prepare_data_list.py b/egs2/conferencingspeech21/enh1/local/prepare_data_list.py
new file mode 100755
index 00000000000..5c330d6e99c
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/prepare_data_list.py
@@ -0,0 +1,67 @@
+#!/usr/bin/env python
+
+# Copyright 2021 Shanghai Jiao Tong University (Authors: Wangyou Zhang)
+# Apache 2.0
+import argparse
+from pathlib import Path
+
+from espnet2.utils.types import str2bool
+
+
+def prepare_data(args):
+ datalist = Path(args.datalist).expanduser().resolve()
+ audiodirs = [Path(audiodir).expanduser() for audiodir in args.audiodirs]
+ outfile = Path(args.outfile).expanduser().resolve()
+ audios = {
+ path.name: str(path)
+ for audiodir in audiodirs
+ for path in audiodir.rglob("*." + args.audio_format)
+ }
+ missing_files = []
+ with outfile.open("w") as out, datalist.open("r") as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+ wavname, others = line.split(maxsplit=1)
+ if args.ignore_missing_files:
+ if wavname not in audios:
+ missing_files.append(wavname)
+ continue
+ else:
+ assert wavname in audios, "No such file %s in %s" % (
+ wavname,
+ str([str(p) for p in audiodirs]),
+ )
+ out.write(audios[wavname] + " " + others + "\n")
+ if args.ignore_missing_files and len(missing_files) > 0:
+ print(
+ "{} wav missing files are skipped:\n{}".format(
+ len(missing_files), "\n ".join(missing_files)
+ )
+ )
+
+
+def get_parser():
+ """Argument parser."""
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "datalist", type=str, help="Path to the list of audio files for training"
+ )
+ parser.add_argument("--outfile", type=str, required=True)
+ parser.add_argument(
+ "--audiodirs",
+ type=str,
+ nargs="+",
+ required=True,
+ help="Paths to the directories containing audio files",
+ )
+ parser.add_argument("--audio-format", type=str, default="wav")
+ parser.add_argument("--ignore-missing-files", type=str2bool, default=False)
+ return parser
+
+
+if __name__ == "__main__":
+ parser = get_parser()
+ args = parser.parse_args()
+ prepare_data(args)
diff --git a/egs2/conferencingspeech21/enh1/local/prepare_dev_data.py b/egs2/conferencingspeech21/enh1/local/prepare_dev_data.py
new file mode 100755
index 00000000000..7ea801511b8
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/prepare_dev_data.py
@@ -0,0 +1,91 @@
+#!/usr/bin/env python
+
+# Copyright 2021 Shanghai Jiao Tong University (Authors: Wangyou Zhang)
+# Apache 2.0
+import argparse
+from pathlib import Path
+import re
+
+from espnet2.fileio.datadir_writer import DatadirWriter
+from espnet2.utils.types import str2bool
+
+
+def prepare_data(args):
+ config_file = Path(args.config_file).expanduser().resolve()
+ audiodirs = [Path(audiodir).expanduser().resolve() for audiodir in args.audiodirs]
+ audios = {
+ path.stem: str(path)
+ for audiodir in audiodirs
+ for path in audiodir.rglob("*.wav")
+ }
+ suffix = "_" + args.uttid_suffix if args.uttid_suffix else ""
+ with DatadirWriter(args.outdir) as writer, config_file.open("r") as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+
+ path_clean, start_time, path_noise, path_rir, snr, scale = line.split()
+ uttid = "#".join(
+ [
+ Path(path_clean).stem,
+ Path(path_noise).stem,
+ Path(path_rir).stem,
+ start_time,
+ snr,
+ scale,
+ ]
+ )
+ writer["wav.scp"][uttid + suffix] = audios[uttid]
+ if args.use_reverb_ref:
+ repl = r"/reverb_ref/\1"
+ else:
+ repl = r"/noreverb_ref/\1"
+ writer["spk1.scp"][uttid + suffix] = re.sub(
+ r"/mix/([^\\]+\.wav$)", repl, audios[uttid]
+ )
+ if "librispeech" in path_clean:
+ spkid = "-".join(path_clean.split("/")[-3:-1])
+ else:
+ spkid = path_clean.split("/")[-2]
+ writer["utt2spk"][uttid + suffix] = spkid
+
+
+def get_parser():
+ """Argument parser."""
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "config_file", type=str, help="Path to the list of audio files for training"
+ )
+ parser.add_argument(
+ "--audiodirs",
+ type=str,
+ nargs="+",
+ required=True,
+ help="Paths to the directories containing simulated audio files",
+ )
+ parser.add_argument(
+ "--uttid_suffix",
+ type=str,
+ default="",
+ help="suffix to be appended to each utterance ID",
+ )
+ parser.add_argument(
+ "--outdir",
+ type=str,
+ required=True,
+ help="Paths to the directory for storing *.scp, utt2spk, spk2utt",
+ )
+ parser.add_argument(
+ "--use_reverb_ref",
+ type=str2bool,
+ default=True,
+ help="True to use reverberant references, False to use non-reverberant ones",
+ )
+ return parser
+
+
+if __name__ == "__main__":
+ parser = get_parser()
+ args = parser.parse_args()
+ prepare_data(args)
diff --git a/egs2/conferencingspeech21/enh1/local/prepare_simu_config.py b/egs2/conferencingspeech21/enh1/local/prepare_simu_config.py
new file mode 100755
index 00000000000..ec6ab395f42
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/prepare_simu_config.py
@@ -0,0 +1,85 @@
+#!/usr/bin/env python
+
+# Copyright 2021 Shanghai Jiao Tong University (Authors: Wangyou Zhang)
+# Apache 2.0
+import argparse
+from pathlib import Path
+
+
+def construct_path_dict(wav_list):
+ path_dict = {}
+ with wav_list.open("r") as f:
+ for wavpath in f:
+ wavpath = wavpath.strip()
+ if not wavpath:
+ continue
+ wavname = Path(wavpath).expanduser().resolve().name
+ path_dict[wavname] = wavpath
+ return path_dict
+
+
+def prepare_config(args):
+ config = Path(args.config).expanduser().resolve()
+ clean_list = Path(args.clean_list).expanduser().resolve()
+ noise_list = Path(args.noise_list).expanduser().resolve()
+ rir_list = Path(args.rir_list).expanduser().resolve()
+ outfile = Path(args.outfile).expanduser().resolve()
+
+ speech_data = construct_path_dict(clean_list)
+ noise_data = construct_path_dict(noise_list)
+ rir_data = construct_path_dict(rir_list)
+
+ lines = []
+ with config.open("r") as f:
+ for line in f:
+ line = line.strip()
+ if not line:
+ continue
+
+ path_clean, start_time, path_noise, path_rir, snr, scale = line.split()
+ path_clean = speech_data[Path(path_clean).name]
+ path_noise = noise_data[Path(path_noise).name]
+ path_rir = rir_data[Path(path_rir).name]
+ lines.append(
+ f"{path_clean} {start_time} {path_noise} {path_rir} {snr} {scale}\n"
+ )
+
+ with outfile.open("w") as out:
+ for line in lines:
+ out.write(line)
+
+
+def get_parser():
+ """Argument parser."""
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "config",
+ type=str,
+ help="Path to the config file for simulation",
+ )
+ parser.add_argument(
+ "--clean_list",
+ type=str,
+ required=True,
+ help="Path to the list of clean speech audio file for simulation",
+ )
+ parser.add_argument(
+ "--noise_list",
+ type=str,
+ required=True,
+ help="Path to the list of noise audio file for simulation",
+ )
+ parser.add_argument(
+ "--rir_list",
+ type=str,
+ required=True,
+ help="Path to the list of RIR audio file for simulation",
+ )
+ parser.add_argument("--outfile", type=str, required=True)
+ return parser
+
+
+if __name__ == "__main__":
+ parser = get_parser()
+ args = parser.parse_args()
+ prepare_config(args)
diff --git a/egs2/conferencingspeech21/enh1/local/prepare_test_data.py b/egs2/conferencingspeech21/enh1/local/prepare_test_data.py
new file mode 100755
index 00000000000..974c25d812c
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/local/prepare_test_data.py
@@ -0,0 +1,62 @@
+#!/usr/bin/env python
+
+# Copyright 2021 Shanghai Jiao Tong University (Authors: Jing Shi)
+# Apache 2.0
+import argparse
+from pathlib import Path
+
+from espnet2.fileio.datadir_writer import DatadirWriter
+
+
+def prepare_data(args):
+ audiodirs = [Path(audiodir).expanduser().resolve() for audiodir in args.audiodirs]
+ if args.uttid_prefix:
+ audios = {
+ "_".join([args.uttid_prefix, str(path.parent.stem), str(path.stem)]): str(
+ path
+ )
+ for audiodir in audiodirs
+ for path in audiodir.rglob("*.wav")
+ }
+ else:
+ audios = {
+ "_".join([path.parent, path.stem]): str(path)
+ for audiodir in audiodirs
+ for path in audiodir.rglob("*.wav")
+ }
+ with DatadirWriter(args.outdir) as writer:
+ for uttid, utt_path in audios.items():
+ writer["wav.scp"][uttid] = utt_path
+ writer["spk1.scp"][uttid] = utt_path
+ writer["utt2spk"][uttid] = uttid
+
+
+def get_parser():
+ """Argument parser."""
+ parser = argparse.ArgumentParser()
+ parser.add_argument(
+ "--audiodirs",
+ type=str,
+ nargs="+",
+ required=True,
+ help="Paths to the directories containing simulated audio files",
+ )
+ parser.add_argument(
+ "--uttid_prefix",
+ type=str,
+ default="",
+ help="Prefix to be appended to each utterance ID",
+ )
+ parser.add_argument(
+ "--outdir",
+ type=str,
+ required=True,
+ help="Paths to the directory for storing *.scp, utt2spk, spk2utt",
+ )
+ return parser
+
+
+if __name__ == "__main__":
+ parser = get_parser()
+ args = parser.parse_args()
+ prepare_data(args)
diff --git a/egs2/conferencingspeech21/enh1/path.sh b/egs2/conferencingspeech21/enh1/path.sh
new file mode 120000
index 00000000000..eb217d35673
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/path.sh
@@ -0,0 +1 @@
+../../TEMPLATE/enh1/path.sh
\ No newline at end of file
diff --git a/egs2/conferencingspeech21/enh1/pyscripts b/egs2/conferencingspeech21/enh1/pyscripts
new file mode 120000
index 00000000000..ac68ad75b60
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/pyscripts
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/pyscripts
\ No newline at end of file
diff --git a/egs2/conferencingspeech21/enh1/run.sh b/egs2/conferencingspeech21/enh1/run.sh
new file mode 100755
index 00000000000..b978635a8e2
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/run.sh
@@ -0,0 +1,30 @@
+#!/bin/bash
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+
+# run local/data.sh for more information
+official_data_dir=/path/to/ConferencingSpeech2021_data
+sample_rate=16k
+
+train_set=train
+valid_set=dev
+test_sets="test"
+
+./enh.sh \
+ --audio_format wav \
+ --train_set "${train_set}" \
+ --valid_set "${valid_set}" \
+ --test_sets "${test_sets}" \
+ --fs ${sample_rate} \
+ --ngpu 1 \
+ --spk_num 1 \
+ --local_data_opts "--official_data_dir ${official_data_dir}" \
+ --enh_config conf/tuning/train_enh_beamformer_mvdr.yaml \
+ --use_dereverb_ref false \
+ --use_noise_ref false \
+ --inference_model "valid.loss.best.pth" \
+ "$@"
diff --git a/egs2/conferencingspeech21/enh1/scripts b/egs2/conferencingspeech21/enh1/scripts
new file mode 120000
index 00000000000..9aeb3f26509
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/scripts
@@ -0,0 +1 @@
+../../TEMPLATE/enh1/scripts
\ No newline at end of file
diff --git a/egs2/conferencingspeech21/enh1/steps b/egs2/conferencingspeech21/enh1/steps
new file mode 120000
index 00000000000..91f2d234e20
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/steps
@@ -0,0 +1 @@
+../../../tools/kaldi/egs/wsj/s5/steps
\ No newline at end of file
diff --git a/egs2/conferencingspeech21/enh1/utils b/egs2/conferencingspeech21/enh1/utils
new file mode 120000
index 00000000000..f49247da827
--- /dev/null
+++ b/egs2/conferencingspeech21/enh1/utils
@@ -0,0 +1 @@
+../../../tools/kaldi/egs/wsj/s5/utils
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/README.md b/egs2/ml_openslr63/asr1/README.md
new file mode 100644
index 00000000000..35485aec30c
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/README.md
@@ -0,0 +1,99 @@
+
+# RESULTS
+## Environments
+- date: `Sat Mar 19 20:34:49 UTC 2022`
+- python version: `3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:24:11) [GCC 9.4.0]`
+- espnet version: `espnet 0.10.7a1`
+- pytorch version: `pytorch 1.10.1`
+- Git hash: `d2410457152872f63c51ee76ed746a6ea3153f09`
+ - Commit date: `Sat Mar 19 09:04:54 2022 +0000`
+- Pretrained Model
+ - Hugging Face Hub:
+ https://huggingface.co/espnet/ml_openslr63
+
+## asr_train_asr_conformer_s3prlfrontend_hubert_fused_raw_ml_bpe150_sp
+### WER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|2345|75.2|21.8|3.0|2.4|27.2|71.5|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|6136|67.0|28.7|4.3|2.6|35.6|71.8|
+
+### CER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|21321|96.1|2.2|1.7|0.9|4.7|71.5|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|57065|93.5|3.2|3.3|1.3|7.7|71.8|
+
+### TER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|13402|93.5|4.4|2.1|0.9|7.4|71.3|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|35911|89.9|6.3|3.8|1.3|11.4|70.4|
+
+
+# RESULTS
+## Environments
+- date: `Sat Mar 19 07:22:48 UTC 2022`
+- python version: `3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:24:11) [GCC 9.4.0]`
+- espnet version: `espnet 0.10.7a1`
+- pytorch version: `pytorch 1.10.1`
+- Git hash: `813ee348e36db8a6f8d0d717be8767f938b2e62b`
+ - Commit date: `Fri Mar 18 11:12:20 2022 -0400`
+
+## asr_train_asr_conformer_s3prlfrontend_hubert_raw_ml_bpe150_sp
+### WER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|2345|71.4|24.4|4.2|2.5|31.1|72.6|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|6136|61.8|32.1|6.1|2.0|40.3|73.5|
+
+### CER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|21321|94.5|2.3|3.3|1.0|6.5|72.6|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|57065|90.9|3.4|5.8|1.1|10.3|73.5|
+
+### TER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|13402|91.3|4.5|4.1|0.9|9.6|72.6|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|35911|86.7|6.6|6.7|0.9|14.1|72.1|
+
+
+# RESULTS
+## Environments
+- date: `Fri Mar 18 17:25:39 UTC 2022`
+- python version: `3.9.10 | packaged by conda-forge | (main, Feb 1 2022, 21:24:11) [GCC 9.4.0]`
+- espnet version: `espnet 0.10.7a1`
+- pytorch version: `pytorch 1.10.1`
+- Git hash: `9cb00370db63ced70ee39e1a2ba3137311842d44`
+ - Commit date: `Fri Mar 18 10:47:05 2022 -0400`
+
+## asr_train_asr_conformer5_raw_ml_bpe150_sp
+### WER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|2345|71.0|25.5|3.5|2.4|31.4|73.2|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|6136|63.0|32.1|4.9|2.2|39.2|73.2|
+
+### CER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|21321|94.3|3.3|2.4|1.3|7.0|73.2|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|57065|91.1|4.8|4.0|1.5|10.4|73.2|
+
+### TER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/dev_ml|369|13402|90.7|6.2|3.1|1.4|10.6|72.9|
+|decode_asr_lm_lm_train_lm_ml_bpe150_valid.loss.ave_asr_model_valid.acc.ave/test_ml|1062|35911|86.7|8.6|4.6|1.6|14.8|71.8|
+
diff --git a/egs2/ml_openslr63/asr1/asr.sh b/egs2/ml_openslr63/asr1/asr.sh
new file mode 120000
index 00000000000..60b05122cfd
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/asr.sh
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/asr.sh
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/cmd.sh b/egs2/ml_openslr63/asr1/cmd.sh
new file mode 120000
index 00000000000..f77e339f822
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/cmd.sh
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/cmd.sh
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/conf/decode_asr.yaml b/egs2/ml_openslr63/asr1/conf/decode_asr.yaml
new file mode 120000
index 00000000000..f3f59d5ac2b
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/decode_asr.yaml
@@ -0,0 +1 @@
+tuning/decode_transformer.yaml
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/conf/fbank.conf b/egs2/ml_openslr63/asr1/conf/fbank.conf
new file mode 100644
index 00000000000..82ac7bd0dbc
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/fbank.conf
@@ -0,0 +1,2 @@
+--sample-frequency=16000
+--num-mel-bins=80
diff --git a/egs2/ml_openslr63/asr1/conf/pbs.conf b/egs2/ml_openslr63/asr1/conf/pbs.conf
new file mode 100644
index 00000000000..119509938ce
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/pbs.conf
@@ -0,0 +1,11 @@
+# Default configuration
+command qsub -V -v PATH -S /bin/bash
+option name=* -N $0
+option mem=* -l mem=$0
+option mem=0 # Do not add anything to qsub_opts
+option num_threads=* -l ncpus=$0
+option num_threads=1 # Do not add anything to qsub_opts
+option num_nodes=* -l nodes=$0:ppn=1
+default gpu=0
+option gpu=0
+option gpu=* -l ngpus=$0
diff --git a/egs2/ml_openslr63/asr1/conf/pitch.conf b/egs2/ml_openslr63/asr1/conf/pitch.conf
new file mode 100644
index 00000000000..e959a19d5b8
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/pitch.conf
@@ -0,0 +1 @@
+--sample-frequency=16000
diff --git a/egs2/ml_openslr63/asr1/conf/queue.conf b/egs2/ml_openslr63/asr1/conf/queue.conf
new file mode 100644
index 00000000000..500582fab31
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/queue.conf
@@ -0,0 +1,12 @@
+# Default configuration
+command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
+option name=* -N $0
+option mem=* -l mem_free=$0,ram_free=$0
+option mem=0 # Do not add anything to qsub_opts
+option num_threads=* -pe smp $0
+option num_threads=1 # Do not add anything to qsub_opts
+option max_jobs_run=* -tc $0
+option num_nodes=* -pe mpi $0 # You must set this PE as allocation_rule=1
+default gpu=0
+option gpu=0
+option gpu=* -l gpu=$0 -q g.q
diff --git a/egs2/ml_openslr63/asr1/conf/slurm.conf b/egs2/ml_openslr63/asr1/conf/slurm.conf
new file mode 100644
index 00000000000..3b229673638
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/slurm.conf
@@ -0,0 +1,14 @@
+# Default configuration
+command sbatch --export=PATH
+option name=* --job-name $0
+option time=* --time $0
+option mem=* --mem-per-cpu $0
+option mem=0
+option num_threads=* --cpus-per-task $0
+option num_threads=1 --cpus-per-task 1
+option num_nodes=* --nodes $0
+default gpu=0
+option gpu=0 -p cpu
+option gpu=* -p gpu --gres=gpu:$0 -c $0 # Recommend allocating more CPU than, or equal to the number of GPU
+# note: the --max-jobs-run option is supported as a special case
+# by slurm.pl and you don't have to handle it in the config file.
diff --git a/egs2/ml_openslr63/asr1/conf/train_asr.yaml b/egs2/ml_openslr63/asr1/conf/train_asr.yaml
new file mode 120000
index 00000000000..56ea1bf0c00
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/train_asr.yaml
@@ -0,0 +1 @@
+./tuning/train_asr_conformer.yaml
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/conf/train_lm.yaml b/egs2/ml_openslr63/asr1/conf/train_lm.yaml
new file mode 100644
index 00000000000..bda020d1c57
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/train_lm.yaml
@@ -0,0 +1,14 @@
+lm_conf:
+ nlayers: 2
+ unit: 650
+optim: sgd # or adam
+batch_type: folded
+batch_size: 64 # batch size in LM training
+max_epoch: 30 # if the data size is large, we can reduce this
+patience: 3
+
+best_model_criterion:
+- - valid
+ - loss
+ - min
+keep_nbest_models: 1
diff --git a/egs2/ml_openslr63/asr1/conf/tuning/decode_transformer.yaml b/egs2/ml_openslr63/asr1/conf/tuning/decode_transformer.yaml
new file mode 100644
index 00000000000..d89db079882
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/tuning/decode_transformer.yaml
@@ -0,0 +1,7 @@
+batch_size: 1
+beam_size: 10
+penalty: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+ctc_weight: 0.5
+lm_weight: 0.3
diff --git a/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer.yaml b/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer.yaml
new file mode 100644
index 00000000000..b226d1b519f
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer.yaml
@@ -0,0 +1,78 @@
+# network architecture
+
+# frontend related
+frontend: default
+frontend_conf:
+ n_fft: 512
+ win_length: 400
+ hop_length: 160
+
+# encoder related
+encoder: conformer
+encoder_conf:
+ input_layer: conv2d
+ num_blocks: 12
+ linear_units: 2048
+ dropout_rate: 0.1
+ output_size: 256
+ attention_heads: 4
+ attention_dropout_rate: 0.0
+ pos_enc_layer_type: rel_pos
+ selfattention_layer_type: rel_selfattn
+ activation_type: swish
+ macaron_style: true
+ use_cnn_module: true
+ cnn_module_kernel: 15
+
+
+# decoder related
+decoder: transformer
+decoder_conf:
+ input_layer: embed
+ num_blocks: 6
+ linear_units: 2048
+ dropout_rate: 0.1
+
+# hybrid CTC/attention
+model_conf:
+ ctc_weight: 0.3
+ lsm_weight: 0.1
+ length_normalized_loss: false
+
+# optimization related
+optim: adam
+accum_grad: 1
+grad_clip: 3
+max_epoch: 50
+optim_conf:
+ lr: 4.0
+scheduler: noamlr
+scheduler_conf:
+ model_size: 256
+ warmup_steps: 25000
+
+# minibatch related
+batch_type: numel
+batch_bins: 2000000
+
+best_model_criterion:
+- - valid
+ - acc
+ - max
+keep_nbest_models: 10
+
+specaug: specaug
+specaug_conf:
+ apply_time_warp: true
+ time_warp_window: 5
+ time_warp_mode: bicubic
+ apply_freq_mask: true
+ freq_mask_width_range:
+ - 0
+ - 30
+ num_freq_mask: 2
+ apply_time_mask: true
+ time_mask_width_range:
+ - 0
+ - 40
+ num_time_mask: 2
diff --git a/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer_s3prlfrontend_hubert.yaml b/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer_s3prlfrontend_hubert.yaml
new file mode 100644
index 00000000000..6266111739d
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer_s3prlfrontend_hubert.yaml
@@ -0,0 +1,87 @@
+# network architecture
+
+freeze_param: [
+"frontend.upstream"
+]
+
+# frontend related
+frontend: s3prl
+frontend_conf:
+ frontend_conf:
+ upstream: hubert_large_ll60k # Note: If the upstream is changed, please change the input_size in the preencoder.
+ download_dir: ./hub
+ multilayer_feature: True
+
+preencoder: linear
+preencoder_conf:
+ input_size: 1024 # Note: If the upstream is changed, please change this value accordingly.
+ output_size: 80
+
+# encoder related
+encoder: conformer
+encoder_conf:
+ input_layer: conv2d
+ num_blocks: 12
+ linear_units: 2048
+ dropout_rate: 0.1
+ output_size: 256
+ attention_heads: 4
+ attention_dropout_rate: 0.0
+ pos_enc_layer_type: rel_pos
+ selfattention_layer_type: rel_selfattn
+ activation_type: swish
+ macaron_style: true
+ use_cnn_module: true
+ cnn_module_kernel: 15
+
+
+# decoder related
+decoder: transformer
+decoder_conf:
+ input_layer: embed
+ num_blocks: 6
+ linear_units: 2048
+ dropout_rate: 0.1
+
+# hybrid CTC/attention
+model_conf:
+ ctc_weight: 0.3
+ lsm_weight: 0.1
+ length_normalized_loss: false
+
+# optimization related
+optim: adam
+accum_grad: 1
+grad_clip: 3
+max_epoch: 50
+optim_conf:
+ lr: 4.0
+scheduler: noamlr
+scheduler_conf:
+ warmup_steps: 25000
+
+# minibatch related
+batch_type: numel
+batch_bins: 2000000
+
+best_model_criterion:
+- - valid
+ - acc
+ - max
+keep_nbest_models: 10
+
+specaug: specaug
+specaug_conf:
+ apply_time_warp: true
+ time_warp_window: 5
+ time_warp_mode: bicubic
+ apply_freq_mask: true
+ freq_mask_width_range:
+ - 0
+ - 30
+ num_freq_mask: 2
+ apply_time_mask: true
+ time_mask_width_range:
+ - 0
+ - 40
+ num_time_mask: 2
diff --git a/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer_s3prlfrontend_hubert_fused.yaml b/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer_s3prlfrontend_hubert_fused.yaml
new file mode 100644
index 00000000000..9998618bf28
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/conf/tuning/train_asr_conformer_s3prlfrontend_hubert_fused.yaml
@@ -0,0 +1,93 @@
+# network architecture
+
+# frontend related
+frontend: fused
+frontend_conf:
+ frontends:
+ - frontend_type: s3prl
+ frontend_conf:
+ upstream: hubert_large_ll60k
+ download_dir: ./hub
+ multilayer_feature: True
+
+ - frontend_type: default
+ n_fft: 512
+ win_length: 400
+ hop_length: 160
+
+ align_method: linear_projection
+ proj_dim: 100
+
+preencoder: linear
+preencoder_conf:
+ input_size: 200 # Note: If the upstream is changed, please change this value accordingly.
+ output_size: 80
+
+# encoder related
+encoder: conformer
+encoder_conf:
+ input_layer: conv2d
+ num_blocks: 12
+ linear_units: 2048
+ dropout_rate: 0.1
+ output_size: 256
+ attention_heads: 4
+ attention_dropout_rate: 0.0
+ pos_enc_layer_type: rel_pos
+ selfattention_layer_type: rel_selfattn
+ activation_type: swish
+ macaron_style: true
+ use_cnn_module: true
+ cnn_module_kernel: 15
+
+
+# decoder related
+decoder: transformer
+decoder_conf:
+ input_layer: embed
+ num_blocks: 6
+ linear_units: 2048
+ dropout_rate: 0.1
+
+# hybrid CTC/attention
+model_conf:
+ ctc_weight: 0.3
+ lsm_weight: 0.1
+ length_normalized_loss: false
+
+# optimization related
+optim: adam
+accum_grad: 1
+grad_clip: 3
+max_epoch: 50
+optim_conf:
+ lr: 0.5
+scheduler: noamlr
+scheduler_conf:
+ warmup_steps: 2500
+
+# minibatch related
+batch_type: numel
+batch_bins: 2000000
+
+best_model_criterion:
+- - valid
+ - acc
+ - max
+keep_nbest_models: 10
+
+specaug: specaug
+specaug_conf:
+ apply_time_warp: true
+ time_warp_window: 5
+ time_warp_mode: bicubic
+ apply_freq_mask: true
+ freq_mask_width_range:
+ - 0
+ - 30
+ num_freq_mask: 2
+ apply_time_mask: true
+ time_mask_width_range:
+ - 0
+ - 40
+ num_time_mask: 2
diff --git a/egs2/ml_openslr63/asr1/db.sh b/egs2/ml_openslr63/asr1/db.sh
new file mode 120000
index 00000000000..50d86130898
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/db.sh
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/db.sh
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/local/data.sh b/egs2/ml_openslr63/asr1/local/data.sh
new file mode 100755
index 00000000000..6b549f1f096
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/local/data.sh
@@ -0,0 +1,67 @@
+#!/bin/bash
+
+
+. ./path.sh || exit 1;
+. ./cmd.sh || exit 1;
+. ./db.sh || exit 1;
+
+# general configuration
+stage=0 # start from 0 if you need to start from data preparation
+stop_stage=1
+# inclusive, was 100
+SECONDS=0
+
+log() {
+ local fname=${BASH_SOURCE[1]##*/}
+ echo -e "$(date '+%Y-%m-%dT%H:%M:%S') (${fname}:${BASH_LINENO[0]}:${FUNCNAME[1]}) $*"
+}
+
+
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+. utils/parse_options.sh
+
+log "data preparation started"
+
+mkdir -p ${MALAYALAM}
+if [ -z "${MALAYALAM}" ]; then
+ log "Fill the value of 'MALAYALAM' of db.sh"
+ exit 1
+fi
+
+workspace=$PWD
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+ log "sub-stage 0: Download Data to downloads"
+
+ cd ${MALAYALAM}
+ wget https://www.openslr.org/resources/63/ml_in_female.zip
+ unzip -o ml_in_female.zip
+ rm -f ml_in_female.zip
+ wget https://www.openslr.org/resources/63/ml_in_male.zip
+ unzip -o ml_in_male.zip
+ rm -f ml_in_male.zip
+
+ wget https://www.openslr.org/resources/63/line_index_female.tsv
+ wget https://www.openslr.org/resources/63/line_index_male.tsv
+ cat line_index_female.tsv line_index_male.tsv > line_index_all.tsv
+ cd $workspace
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+ log "sub-stage 1: Preparing Data for openslr"
+
+ python3 local/data_prep.py -d ${MALAYALAM}
+ utils/spk2utt_to_utt2spk.pl data/train_ml/spk2utt > data/train_ml/utt2spk
+ utils/spk2utt_to_utt2spk.pl data/dev_ml/spk2utt > data/dev_ml/utt2spk
+ utils/spk2utt_to_utt2spk.pl data/test_ml/spk2utt > data/test_ml/utt2spk
+ utils/fix_data_dir.sh data/train_ml
+ utils/fix_data_dir.sh data/dev_ml
+ utils/fix_data_dir.sh data/test_ml
+fi
+
+log "Successfully finished. [elapsed=${SECONDS}s]"
diff --git a/egs2/ml_openslr63/asr1/local/data_prep.py b/egs2/ml_openslr63/asr1/local/data_prep.py
new file mode 100644
index 00000000000..bd174f75e68
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/local/data_prep.py
@@ -0,0 +1,99 @@
+#!/usr/bin/env python3
+
+# Referred from data_prep.py in jv_openslr35 in ESPnet
+# https://github.com/espnet/espnet/blob/master/egs2/jv_openslr35/
+# asr1/local/data_prep.py
+
+
+import argparse
+import os
+import random
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ parser.add_argument("-d", help="downloads directory", type=str, default="downloads")
+ args = parser.parse_args()
+
+ tsv_path = "%s/line_index_all.tsv" % args.d
+
+ with open(tsv_path, "r", encoding="utf-8") as inf:
+ tsv_lines = inf.readlines()
+ tsv_lines = [line.strip() for line in tsv_lines]
+
+ spk2utt = {}
+ utt2text = {}
+ for line in tsv_lines:
+ l_list = line.split("\t")
+ fid = l_list[0]
+ spk = fid.split("_")[1]
+ text = l_list[1]
+ text = text.replace(".", "")
+ text = text.replace(",", "")
+ text = text.lower()
+ path = "%s/%s.wav" % (args.d, fid)
+ if os.path.exists(path):
+ utt2text[fid] = text
+ if spk in spk2utt:
+ spk2utt[spk].append(fid)
+ else:
+ spk2utt[spk] = [fid]
+
+ spks = sorted(list(spk2utt.keys()))
+ num_fids = 0
+ num_test_spks = 0
+ for spk in spks:
+ num_test_spks += 1
+ fids = sorted(list(set(spk2utt[spk])))
+ num_fids += len(fids)
+ if num_fids >= 1000:
+ break
+ test_spks = spks[:num_test_spks]
+ train_dev_spks = spks[num_test_spks:]
+ random.Random(0).shuffle(train_dev_spks)
+ num_train = int(len(train_dev_spks) * 0.9)
+ train_spks = train_dev_spks[:num_train]
+ dev_spks = train_dev_spks[num_train:]
+
+ spks_by_phase = {"train": train_spks, "dev": dev_spks, "test": test_spks}
+ flac_dir = "%s" % args.d
+ sr = 16000
+ for phase in spks_by_phase:
+ spks = spks_by_phase[phase]
+ text_strs = []
+ wav_scp_strs = []
+ spk2utt_strs = []
+ num_fids = 0
+ for spk in spks:
+ fids = sorted(list(set(spk2utt[spk])))
+ num_fids += len(fids)
+ if phase == "test" and num_fids > 1000:
+ curr_num_fids = num_fids - 1000
+ random.Random(1).shuffle(fids)
+ fids = fids[:curr_num_fids]
+ utts = [spk + "-" + f for f in fids]
+ utts_str = " ".join(utts)
+ spk2utt_strs.append("%s %s" % (spk, utts_str))
+ for fid, utt in zip(fids, utts):
+ cmd = "ffmpeg -i %s/%s.wav -f wav -ar %d -ab 16 -ac 1 - |" % (
+ flac_dir,
+ fid,
+ sr,
+ )
+ text_strs.append("%s %s" % (utt, utt2text[fid]))
+ wav_scp_strs.append("%s %s" % (utt, cmd))
+ phase_dir = "data/%s_ml" % phase
+ if not os.path.exists(phase_dir):
+ os.makedirs(phase_dir)
+ text_strs = sorted(text_strs)
+ wav_scp_strs = sorted(wav_scp_strs)
+ spk2utt_strs = sorted(spk2utt_strs)
+ with open(os.path.join(phase_dir, "text"), "w+") as ouf:
+ for s in text_strs:
+ ouf.write("%s\n" % s)
+ with open(os.path.join(phase_dir, "wav.scp"), "w+") as ouf:
+ for s in wav_scp_strs:
+ ouf.write("%s\n" % s)
+ with open(os.path.join(phase_dir, "spk2utt"), "w+") as ouf:
+ for s in spk2utt_strs:
+ ouf.write("%s\n" % s)
diff --git a/egs2/ml_openslr63/asr1/local/path.sh b/egs2/ml_openslr63/asr1/local/path.sh
new file mode 100644
index 00000000000..e69de29bb2d
diff --git a/egs2/ml_openslr63/asr1/path.sh b/egs2/ml_openslr63/asr1/path.sh
new file mode 120000
index 00000000000..c9ac0a75bc6
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/path.sh
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/path.sh
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/pyscripts b/egs2/ml_openslr63/asr1/pyscripts
new file mode 120000
index 00000000000..ac68ad75b60
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/pyscripts
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/pyscripts
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/run.sh b/egs2/ml_openslr63/asr1/run.sh
new file mode 100644
index 00000000000..e085b5f0002
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/run.sh
@@ -0,0 +1,33 @@
+#!/usr/bin/env bash
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set="train_ml"
+train_dev="dev_ml"
+test_set="test_ml"
+
+asr_config=conf/train_asr.yaml
+inference_config=conf/decode_asr.yaml
+lm_config=conf/train_lm.yaml
+
+./asr.sh \
+ --ngpu 1 \
+ --lang "ml" \
+ --use_lm true \
+ --lm_config "${lm_config}" \
+ --token_type bpe \
+ --nbpe 150 \
+ --bpemode "unigram" \
+ --feats_type raw \
+ --speed_perturb_factors "0.9 1.0 1.1" \
+ --gpu_inference true \
+ --asr_config "${asr_config}" \
+ --inference_config "${inference_config}" \
+ --train_set "${train_set}" \
+ --valid_set "${train_dev}" \
+ --test_sets "${train_dev} ${test_set}" \
+ --bpe_train_text "data/${train_set}/text" \
+ --lm_train_text "data/${train_set}/text"
diff --git a/egs2/ml_openslr63/asr1/scripts b/egs2/ml_openslr63/asr1/scripts
new file mode 120000
index 00000000000..b25829705dc
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/scripts
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/scripts
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/steps b/egs2/ml_openslr63/asr1/steps
new file mode 120000
index 00000000000..69ab7056139
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/steps
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/steps
\ No newline at end of file
diff --git a/egs2/ml_openslr63/asr1/utils b/egs2/ml_openslr63/asr1/utils
new file mode 120000
index 00000000000..e18ae14b549
--- /dev/null
+++ b/egs2/ml_openslr63/asr1/utils
@@ -0,0 +1 @@
+../../TEMPLATE/asr1/utils
\ No newline at end of file
diff --git a/egs2/wsj/asr1/README.md b/egs2/wsj/asr1/README.md
index f87e60e1991..95f4d70d278 100644
--- a/egs2/wsj/asr1/README.md
+++ b/egs2/wsj/asr1/README.md
@@ -53,6 +53,38 @@
|decode_lm_lm_train_lm_transformer_en_char_valid.loss.ave_asr_model_valid.acc.ave/test_eval92|333|33341|99.3|0.3|0.4|0.1|0.8|32.4|
+## Mask-CTC
+
+- Training config: [conf/tuning/train_asr_transformer_maskctc.yaml](conf/tuning/train_asr_transformer_maskctc.yaml)
+- Inference config: [conf/tuning/inference_asr_maskctc.yaml](conf/tuning/inference_asr_maskctc.yaml)
+- Pretrained model: https://huggingface.co/espnet/YosukeHiguchi_espnet2_wsj_asr_transformer_maskctc
+
+### Environments
+
+- date: `Wed Mar 23 04:54:11 JST 2022`
+- python version: `3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]`
+- espnet version: `espnet 0.10.7a1`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.10.1`
+- Git hash: `f29fc9d34f98635bca9e9f7860f3f6cb04300146`
+ - Commit date: `Tue Mar 22 05:48:17 2022 +0900`
+
+
+### WER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|inference_asr_maskctc_asr_model_valid.cer_ctc.ave_10best/test_dev93|503|8234|87.2|11.6|1.2|1.0|13.9|79.3|
+|inference_asr_maskctc_asr_model_valid.cer_ctc.ave_10best/test_eval92|333|5643|90.1|9.2|0.7|1.1|11.0|71.5|
+
+### CER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|inference_asr_maskctc_asr_model_valid.cer_ctc.ave_10best/test_dev93|503|48634|96.7|1.7|1.6|1.0|4.2|81.3|
+|inference_asr_maskctc_asr_model_valid.cer_ctc.ave_10best/test_eval92|333|33341|97.7|1.3|1.1|1.0|3.3|76.0|
+
+
## Using Transformer LM (ASR model is same as the above): lm_weight=1.2, ctc_weight=0.3, beam_size=20
- ASR config: [conf/tuning/train_asr_transformer2.yaml](conf/tuning/train_asr_transformer2.yaml)
diff --git a/egs2/wsj/asr1/conf/tuning/train_asr_transformer_ctc.yaml b/egs2/wsj/asr1/conf/tuning/train_asr_transformer_ctc.yaml
new file mode 100644
index 00000000000..63989a65e7e
--- /dev/null
+++ b/egs2/wsj/asr1/conf/tuning/train_asr_transformer_ctc.yaml
@@ -0,0 +1,55 @@
+batch_type: folded
+batch_size: 32
+accum_grad: 8
+max_epoch: 300
+patience: none
+init: none
+best_model_criterion:
+- - valid
+ - cer_ctc
+ - min
+keep_nbest_models: 10
+
+model: espnet
+model_conf:
+ ctc_weight: 1.0
+ lsm_weight: 0.1
+ length_normalized_loss: false
+
+encoder: transformer
+encoder_conf:
+ output_size: 256
+ attention_heads: 4
+ linear_units: 2048
+ num_blocks: 12
+ dropout_rate: 0.1
+ positional_dropout_rate: 0.1
+ attention_dropout_rate: 0.0
+ input_layer: conv2d
+ normalize_before: true
+
+optim: adam
+optim_conf:
+ lr: 0.002
+ weight_decay: 0.000001
+scheduler: warmuplr
+scheduler_conf:
+ warmup_steps: 15000
+
+num_att_plot: 0
+
+specaug: specaug
+specaug_conf:
+ apply_time_warp: true
+ time_warp_window: 5
+ time_warp_mode: bicubic
+ apply_freq_mask: true
+ freq_mask_width_range:
+ - 0
+ - 27
+ num_freq_mask: 2
+ apply_time_mask: true
+ time_mask_width_ratio_range:
+ - 0.
+ - 0.05
+ num_time_mask: 5
\ No newline at end of file
diff --git a/egs2/wsj/asr1/conf/tuning/train_asr_transformer_maskctc.yaml b/egs2/wsj/asr1/conf/tuning/train_asr_transformer_maskctc.yaml
index 8f5204bef97..fa4cd100542 100644
--- a/egs2/wsj/asr1/conf/tuning/train_asr_transformer_maskctc.yaml
+++ b/egs2/wsj/asr1/conf/tuning/train_asr_transformer_maskctc.yaml
@@ -1,15 +1,16 @@
batch_type: folded
batch_size: 32
accum_grad: 8
-max_epoch: 100
+max_epoch: 300
patience: none
init: none
best_model_criterion:
- - valid
- - acc_mlm
- - max
+ - cer_ctc
+ - min
keep_nbest_models: 10
+# specify model type as "maskctc"
model: maskctc
model_conf:
ctc_weight: 0.3
@@ -28,6 +29,7 @@ encoder_conf:
input_layer: conv2d
normalize_before: true
+# Masked Language Model (MLM)-based decoder
decoder: mlm
decoder_conf:
attention_heads: 4
diff --git a/espnet/version.txt b/espnet/version.txt
index 574cb0d455e..94306f7cdd7 100644
--- a/espnet/version.txt
+++ b/espnet/version.txt
@@ -1 +1 @@
-0.10.7a1
+202204
diff --git a/espnet2/bin/tts_inference.py b/espnet2/bin/tts_inference.py
index 338ce8a016b..683074d2eb0 100755
--- a/espnet2/bin/tts_inference.py
+++ b/espnet2/bin/tts_inference.py
@@ -92,6 +92,7 @@ def __init__(
device: str = "cpu",
seed: int = 777,
always_fix_seed: bool = False,
+ prefer_normalized_feats: bool = False,
):
"""Initialize Text2Speech module."""
assert check_argument_types()
@@ -114,6 +115,7 @@ def __init__(
self.seed = seed
self.always_fix_seed = always_fix_seed
self.vocoder = None
+ self.prefer_normalized_feats = prefer_normalized_feats
if self.tts.require_vocoder:
vocoder = TTSTask.build_vocoder_from_file(
vocoder_config, vocoder_file, model, device
@@ -209,10 +211,13 @@ def __call__(
# apply vocoder (mel-to-wav)
if self.vocoder is not None:
- if output_dict.get("feat_gen_denorm") is not None:
- input_feat = output_dict["feat_gen_denorm"]
- else:
+ if (
+ self.prefer_normalized_feats
+ or output_dict.get("feat_gen_denorm") is None
+ ):
input_feat = output_dict["feat_gen"]
+ else:
+ input_feat = output_dict["feat_gen_denorm"]
wav = self.vocoder(input_feat)
output_dict.update(wav=wav)
diff --git a/espnet2/st/espnet_model.py b/espnet2/st/espnet_model.py
index f4d59d1a0cc..eb4a707f6ca 100644
--- a/espnet2/st/espnet_model.py
+++ b/espnet2/st/espnet_model.py
@@ -78,6 +78,8 @@ def __init__(
# note that eos is the same as sos (equivalent ID)
self.sos = vocab_size - 1
self.eos = vocab_size - 1
+ self.src_sos = src_vocab_size - 1
+ self.src_eos = src_vocab_size - 1
self.vocab_size = vocab_size
self.src_vocab_size = src_vocab_size
self.ignore_id = ignore_id
@@ -409,7 +411,9 @@ def _calc_asr_att_loss(
ys_pad: torch.Tensor,
ys_pad_lens: torch.Tensor,
):
- ys_in_pad, ys_out_pad = add_sos_eos(ys_pad, self.sos, self.eos, self.ignore_id)
+ ys_in_pad, ys_out_pad = add_sos_eos(
+ ys_pad, self.src_sos, self.src_eos, self.ignore_id
+ )
ys_in_lens = ys_pad_lens + 1
# 1. Forward decoder
@@ -420,7 +424,7 @@ def _calc_asr_att_loss(
# 2. Compute attention loss
loss_att = self.criterion_asr(decoder_out, ys_out_pad)
acc_att = th_accuracy(
- decoder_out.view(-1, self.vocab_size),
+ decoder_out.view(-1, self.src_vocab_size),
ys_out_pad,
ignore_label=self.ignore_id,
)
diff --git a/espnet2/tasks/enh.py b/espnet2/tasks/enh.py
index 633bcf1114c..c2ba18b4c01 100644
--- a/espnet2/tasks/enh.py
+++ b/espnet2/tasks/enh.py
@@ -233,12 +233,16 @@ def build_model(cls, args: argparse.Namespace) -> ESPnetEnhancementModel:
decoder = decoder_choices.get_class(args.decoder)(**args.decoder_conf)
loss_wrappers = []
- for ctr in args.criterions:
- criterion = criterion_choices.get_class(ctr["name"])(**ctr["conf"])
- loss_wrapper = loss_wrapper_choices.get_class(ctr["wrapper"])(
- criterion=criterion, **ctr["wrapper_conf"]
- )
- loss_wrappers.append(loss_wrapper)
+
+ if getattr(args, "criterions", None) is not None:
+ # This check is for the compatibility when load models
+ # that packed by older version
+ for ctr in args.criterions:
+ criterion = criterion_choices.get_class(ctr["name"])(**ctr["conf"])
+ loss_wrapper = loss_wrapper_choices.get_class(ctr["wrapper"])(
+ criterion=criterion, **ctr["wrapper_conf"]
+ )
+ loss_wrappers.append(loss_wrapper)
# 1. Build model
model = ESPnetEnhancementModel(
diff --git a/setup.py b/setup.py
index 1fb22d2c7c2..054ac22e5cc 100644
--- a/setup.py
+++ b/setup.py
@@ -85,7 +85,7 @@
"hacking>=2.0.0",
"mock>=2.0.0",
"pycodestyle",
- "jsondiff>=1.2.0",
+ "jsondiff<2.0.0,>=1.2.0",
"flake8>=3.7.8",
"flake8-docstrings>=1.3.1",
"black",
diff --git a/test_utils/test_evaluate_asr.bats b/test_utils/test_evaluate_asr.bats
index 3b8b51da792..4831d409412 100644
--- a/test_utils/test_evaluate_asr.bats
+++ b/test_utils/test_evaluate_asr.bats
@@ -15,7 +15,7 @@ EOF
@test "evaluate_asr" {
cd egs2/mini_an4/asr1
- model_tag="kamo-naoyuki/mini_an4_asr_train_raw_bpe_valid.acc.best"
+ model_tag="espnet/kamo-naoyuki-mini_an4_asr_train_raw_bpe_valid.acc.best"
scripts/utils/evaluate_asr.sh \
--stop-stage 3 \
--model_tag "${model_tag}" \
diff --git a/test_utils/test_evaluate_asr_hf.bats b/test_utils/test_evaluate_asr_hf.bats
deleted file mode 100644
index 598455b8529..00000000000
--- a/test_utils/test_evaluate_asr_hf.bats
+++ /dev/null
@@ -1,29 +0,0 @@
-#!/usr/bin/env bats
-
-setup() {
- tmpdir=/tmp/espnet2-test-evaluate-asr-hf-${RANDOM}
- # Create dummy data
- mkdir -p ${tmpdir}/data
- echo "dummy A" > ${tmpdir}/data/text
- echo "dummy ${tmpdir}/data/dummy.wav" > ${tmpdir}/data/wav.scp
- python << EOF
-import numpy as np
-import soundfile as sf
-sf.write("${tmpdir}/data/dummy.wav", np.zeros(16000 * 2,), 16000, "PCM_16")
-EOF
-}
-
-@test "evaluate_asr_hf" {
- cd egs2/mini_an4/asr1
- model_tag="espnet/kamo-naoyuki-mini_an4_asr_train_raw_bpe_valid.acc.best"
- scripts/utils/evaluate_asr.sh \
- --stop-stage 3 \
- --model_tag "${model_tag}" \
- --gt_text "${tmpdir}/data/text" \
- --inference_args "--beam_size 1" \
- "${tmpdir}/data/wav.scp" "${tmpdir}/asr_results"
-}
-
-teardown() {
- rm -r $tmpdir
-}
diff --git a/tools/installers/install_deepxi.sh b/tools/installers/install_deepxi.sh
new file mode 100755
index 00000000000..43d49f29ace
--- /dev/null
+++ b/tools/installers/install_deepxi.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+#==============================================================================
+# Title: install_deepxi.sh
+# Description: Install everything necessary for deepxi to compile.
+# Author: Fabian Hörst, based on DeepXi GitHub page
+# Github DeepXi: https://github.com/anicolson/DeepXi
+# Date: 2021-12-04
+# Version : 1.0
+# Usage: bash install_deepxi.sh
+# Python environment: DeepXi Python environment is saved under ~/venv/DeepXi in
+# your home directory
+#==============================================================================
+
+# Exit script if any command fails
+set -e
+set -o pipefail
+
+echo "Installing DeepXi"
+
+# If Direcotry exists, pull missing files
+if [ -d "DeepXi" ]; then
+ cd DeepXi
+ git pull https://github.com/anicolson/DeepXi.git
+ cd ..
+# Clone git in current directory, build virtual environment and install requirements
+else
+ git clone https://github.com/anicolson/DeepXi.git
+fi
+echo "DeepXi installed"
diff --git a/tools/installers/install_openface.sh b/tools/installers/install_openface.sh
new file mode 100755
index 00000000000..8b589ef824e
--- /dev/null
+++ b/tools/installers/install_openface.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+#==============================================================================
+# Title: install_openface.sh
+# Description: Install everything necessary for OpenFace to compile.
+# Will install all required dependencies, only use if you do not have the dependencies
+# already installed or if you don't mind specific versions of gcc,g++,cmake,opencv etc. installed
+# Author: Fabian Hörst
+# Reference: Thanks to Daniyal Shahrokhian , Tadas Baltrusaitis
+# on which this script is based
+# Github OpenFace: https://github.com/TadasBaltrusaitis/OpenFace
+# Date: 2021-03-30
+# Version : 1.0
+# Usage: bash install.sh, please use just for ubuntu 18.04 or 20.04
+#==============================================================================
+
+# Exit script if any command fails
+set -e
+set -o pipefail
+
+# Get current directory
+DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
+
+# Check Ubuntu Version
+if [ `lsb_release -d` != "18.04" ] || [ `lsb_release -d` != "20.04" ]; then
+ echo "This script does not support your ubuntu Version. Please install manually. Further informations can be found here:"
+ echo "https://github.com/TadasBaltrusaitis/OpenFace/wiki/Unix-Installation"
+ exit 1
+fi
+
+
+# OpenFace installation
+echo "Downloading OpenFace"
+git clone https://github.com/TadasBaltrusaitis/OpenFace.git
+cd OpenFace
+rm -rf CMakeLists.txt
+cd ../..
+cp CMakeLists.txt installations/OpenFace
+cd installations/OpenFace
+echo "Installing OpenFace..."
+mkdir -p build
+cd build
+cmake -D CMAKE_CXX_COMPILER=g++-8 -D CMAKE_C_COMPILER=gcc-8 -D CMAKE_BUILD_TYPE=RELEASE ..
+make
+
+./download_models.sh
+cp lib/local/LandmarkDetector/model/patch_experts/cen_* build/bin/model/patch_experts/
+
+cd ../..
+echo "OpenFace successfully installed."
+
+
diff --git a/tools/installers/install_pesq.sh b/tools/installers/install_pesq.sh
index 29677c5c32e..5e707e9151d 100755
--- a/tools/installers/install_pesq.sh
+++ b/tools/installers/install_pesq.sh
@@ -9,7 +9,7 @@ fi
if [ ! -e PESQ.zip ]; then
wget --tries=3 --no-check-certificate \
- 'http://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-P.862-200511-I!Amd2!SOFT-ZST-E&type=items' -O PESQ.zip
+ 'https://github.com/LiChenda/itu_pesq/raw/main/T-REC-P.862-200511.zip' -O PESQ.zip
fi
if [ ! -e PESQ ]; then
mkdir -p PESQ_P.862.2
diff --git a/tools/installers/install_vidaug.sh b/tools/installers/install_vidaug.sh
new file mode 100755
index 00000000000..71b27a95b2a
--- /dev/null
+++ b/tools/installers/install_vidaug.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+#==============================================================================
+# Title: install_espnet.sh
+# Description: Install everything necessary for ESPnet to compile.
+# Will install all required dependencies, only use if you do not have the dependencies
+# Author: Fabian Hörst
+# Github Vidaug: https://github.com/okankop/vidaug
+# Date: 2021-07-19
+# Version : 1.0
+# Usage: bash install_vidaug.sh PATH_TO_ESPNET_MAIN FOLDER, please use just for ubuntu 18.04 or 20.04
+#==============================================================================
+
+# Get ESPNET Path, e.g. "/home/fabian/AVSR/espnet" from parameter handover
+ESPNET=$1
+. "${ESPNET}"/tools/activate_python.sh
+
+# Install required packages
+pip3 install numpy
+pip3 install scipy
+pip3 install scikit-image
+pip3 install pillow
+
+git clone https://github.com/okankop/vidaug
+cd vidaug
+python3 setup.py sdist && pip3 install dist/vidaug-0.1.tar.gz
+