Merge branch 'master' of github.com:espnet/espnet into dpclanddan

chintu619 · Feb 27, 2022 · 5f7e2e7 · 5f7e2e7
2 parents d3acdcc + 637d8c3
commit 5f7e2e7
Show file tree

Hide file tree

Showing 80 changed files with 3,378 additions and 519 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -53,9 +53,9 @@ ESPnet2's recipes correspond to `egs2`. ESPnet2 applies a new paradigm without d
 For ESPnet2, we do not recommend preparing the recipe's stages for each corpus but using the common pipelines we provided in `asr.sh`, `tts.sh`, and
 `enh.sh`. For details of creating ESPnet2 recipes, please refer to [egs2-readme](https://github.com/espnet/espnet/blob/master/egs2/TEMPLATE/README.md).
 
-The common pipeline of ESPnet2 recipes will take care of the `RESULTS.md` generation, model packing, and uploading. ESPnet2 models are maintained at Zenodo and Hugging Face.
+The common pipeline of ESPnet2 recipes will take care of the `RESULTS.md` generation, model packing, and uploading. ESPnet2 models are maintained at Hugging Face and Zenodo (Deprecated).
 You can also refer to the document in https://github.com/espnet/espnet_model_zoo
-To upload your model, you need first:
+To upload your model, you need first (This is currently deprecated , uploading to Huggingface Hub is prefered) :
 1. Sign up to Zenodo: https://zenodo.org/
 2. Create access token: https://zenodo.org/account/settings/applications/tokens/new/
 3. Set your environment: % export ACCESS_TOKEN="<your token>"
@@ -64,6 +64,21 @@ To port models from zenodo using Hugging Face hub,
 1. Create a Hugging Face account - https://huggingface.co/
 2. Request to be added to espnet organisation - https://huggingface.co/espnet
 3. Go to `egs2/RECIPE/*/scripts/utils` and run `./upload_models_to_hub.sh "ZENODO_MODEL_NAME"`
+
+To upload models using Huggingface-cli follow the following steps:
+You can also refer to https://huggingface.co/docs/transformers/model_sharing
+1. Create a Hugging Face account - https://huggingface.co/
+2. Request to be added to espnet organisation - https://huggingface.co/espnet 
+3. Run huggingface-cli login (You can get the token request at this step under setting > Access Tokens > espnet token  
+4. `huggingface-cli repo create your-model-name --organization espnet`
+5. `git clone https://huggingface.co/username/your-model-name` (clone this outside ESPNet to avoid issues as this a git repo)
+6. `cd your-model-name`
+7. `git lfs install`
+8. copy contents from exp diretory of your recipe into this directory (Check other models of similar task under ESPNet to confirm your directory structure) 
+9. `git add . `
+10. `git commit -m "Add model files"`
+11. `git push`
+12. Check if the inference demo on HF is running successfully to verify the upload      
 
 #### 1.3.3 Additional requirements for new recipe
 

diff --git a/doc/espnet2_tutorial.md b/doc/espnet2_tutorial.md
@@ -180,7 +180,7 @@ You need to do one of the following two ways to change the training configuratio
 
 ```sh
 # Give a configuration file
-./run.sh --asr_train_config conf/train_asr.yaml
+./run.sh --asr_config conf/train_asr.yaml
 # Give arguments to "espnet2/bin/asr_train.py" directly
 ./run.sh --asr_args "--foo arg --bar arg2"
 ```
@@ -291,7 +291,7 @@ To use SSLRs in your task, you need to make several modifications.
 ### Usage
 1. To reduce the time used in `collect_stats` step, please specify `--feats_normalize uttmvn` in `run.sh` and pass it as arguments to `asr.sh` or other task-specific scripts. (Recommended)
 2. In the configuration file, specify the `frontend` and `preencoder`. Taking `HuBERT` as an example:
-   The `upsteam` name can be whatever supported in S3PRL. `multilayer-feature=True` means the final representation is a weighted-sum of all layers' hidden states from SSLR model.
+   The `upstream` name can be whatever supported in S3PRL. `multilayer-feature=True` means the final representation is a weighted-sum of all layers' hidden states from SSLR model.
    ```
    frontend: s3prl
    frontend_conf:

diff --git a/egs/README.md b/egs/README.md
@@ -49,6 +49,7 @@ See: https://espnet.github.io/espnet/tutorial.html
 | librispeech             | LibriSpeech ASR corpus                                       | ASR                                        | EN             | http://www.openslr.org/12                                    |                               |
 | libritts                | LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech | TTS                                      | EN             | http://www.openslr.org/60/                                   |                               |
 | ljspeech                | The LJ Speech Dataset                                        | TTS                                        | EN             | https://keithito.com/LJ-Speech-Dataset/                      |                               |
+| lrs                     | The Lip Reading Sentences Dataset                            | ASR/AVSR                                       | EN             | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html                      |                               |
 | m_ailabs                | The M-AILABS Speech Dataset                                  | TTS                                        | ~5 languages   | https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/    |
 | mucs_2021               | MUCS 2021: MUltilingual and Code-Switching ASR Challenges for Low Resource Indian Languages   | ASR/Code Switching          | HI, MR, OR, TA, TE, GU, HI-EN, BN-EN | https://navana-tech.github.io/MUCS2021/data.html                    |                               |
 | mtedx                   | Multilingual TEDx | ASR/Machine Translation/Speech Translation | 13 Language pairs | http://www.openslr.org/100/                         |

diff --git a/egs/librispeech/asr1/RESULTS.md b/egs/librispeech/asr1/RESULTS.md
@@ -63,6 +63,37 @@ exp/train_960_pytorch_train_pytorch_conformer_large_specaug/decode_test_other_mo
 |    Sum/Avg         |    2939        52343     |    95.3          4.1           0.6          0.6           5.3         44.8     |
 ```
 
+# pytorch large conformer-transducer with specaug + speed perturbation (4 GPUs)
+
+- Environments
+  - python version: `3.8.3 (default)  [GCC 7.3.0]`
+  - espnet version: `espnet 0.10.7a1`
+  - chainer version: `chainer 6.0.0`
+  - pytorch version: `pytorch 1.10.0`
+
+- Model files (archived to model.tar.gz by `$ pack_model.sh`)
+    - model link: ([pretrained model](https://drive.google.com/file/d/1fdadICi2w_b6lqb9_7J3wfRJc3LTnnSq/view?usp=sharing))
+    - training config file: `conf/tuning/transducer/train_conformer-rnn_transducer.yaml`
+    - decoding config file: `conf/tuning/transducer/decode.yaml`
+    - cmvn file: `data/train_sp/cmvn.ark`
+    - e2e file: `exp/train_960_pytorch_transducer_train_conformer-rnn_transducer/results/model.last10.avg.best`
+    - e2e JSON file: `exp/train_960_pytorch_transducer_train_conformer-rnn_transducer/results/model.json`
+    - dict file: `data/lang_char`
+  - Results (paste them by yourself or obtained by `$ pack_model.sh --results <results>`)
+```
+exp/train_960_pytorch_transducer_train_conformer-rnn_transducer/decode_dev_clean_model.last10.avg.best/result.wrd.txt
+|    SPKR           |    # Snt       # Wrd     |    Corr          Sub          Del          Ins           Err        S.Err    |
+|    Sum/Avg        |    2703        54402     |    97.6          2.2          0.2          0.3           2.7         33.0    |
+exp/train_960_pytorch_transducer_train_conformer-rnn_transducer/decode_dev_other_model.last10.avg.best/result.wrd.txt
+|    SPKR           |    # Snt       # Wrd     |    Corr          Sub          Del          Ins           Err        S.Err    |
+|    Sum/Avg        |    2864        50948     |    93.7          5.7          0.6          0.7           7.0         52.8    |
+exp/train_960_pytorch_transducer_train_conformer-rnn_transducer/decode_test_clean_model.last10.avg.best/result.wrd.txt
+|    SPKR           |    # Snt        # Wrd    |    Corr          Sub           Del          Ins          Err         S.Err    |
+|    Sum/Avg        |    2620         52576    |    97.4          2.3           0.3          0.3          2.9          33.1    |
+exp/train_960_pytorch_transducer_train_conformer-rnn_transducer/decode_test_other_model.last10.avg.best/result.wrd.txt
+|    SPKR           |    # Snt        # Wrd    |    Corr          Sub           Del          Ins          Err         S.Err    |
+|    Sum/Avg        |    2939         52343    |    93.7          5.6           0.7          0.8          7.1          55.1    |
+```
 
 # Lightweight/Dynamic convolution results
 | |         | # Snt | # Wrd |Corr|Sub|Del|Ins|Err|S.Err |

diff --git a/egs/librispeech/asr1/conf/tuning/transducer/decode.yaml b/egs/librispeech/asr1/conf/tuning/transducer/decode.yaml
@@ -0,0 +1,4 @@
+batch: 0
+beam-size: 10
+search-type: default
+score-norm: True
diff --git a/egs/librispeech/asr1/conf/tuning/transducer/train_conformer-rnn_transducer.yaml b/egs/librispeech/asr1/conf/tuning/transducer/train_conformer-rnn_transducer.yaml
@@ -0,0 +1,50 @@
+# minibatch related
+batch-size: 32
+maxlen-in: 512
+maxlen-out: 150
+
+# optimization related
+criterion: loss
+early-stop-criterion: "validation/main/loss"
+sortagrad: 0
+opt: noam
+noam-adim: 256
+transformer-lr: 1.0
+transformer-warmup-steps: 25000
+epochs: 100
+patience: 0
+accum-grad: 4
+grad-clip: 5.0
+
+# network architecture
+## general
+custom-enc-positional-encoding-type: rel_pos
+custom-enc-self-attn-type: rel_self_attn
+custom-enc-pw-activation-type: swish
+## encoder related
+etype: custom
+custom-enc-input-layer: vgg2l
+enc-block-arch:
+        - type: conformer
+          d_hidden: 512
+          d_ff: 2048
+          heads: 4
+          macaron_style: True
+          use_conv_mod: True
+          conv_mod_kernel: 15
+          dropout-rate: 0.3
+          att-dropout-rate: 0.3
+enc-block-repeat: 12
+## decoder related
+dtype: lstm
+dlayers: 1
+dec-embed-dim: 1024
+dunits: 512
+dropout-rate-embed-decoder: 0.2
+dropout-rate-decoder: 0.1
+## joint network related
+joint-dim: 512
+
+# transducer related
+model-module: "espnet.nets.pytorch_backend.e2e_asr_transducer:E2E"
+
diff --git a/egs/lrs/asr1/RESULTS.md b/egs/lrs/asr1/RESULTS.md
@@ -0,0 +1,39 @@
+## pretrain_Train_pytorch_train_specaug
+
+* Model files (archived to model.tar.gz by <code>$ pack_model.sh</code>)
+  - download link: <code>https://drive.google.com/file/d/1YUePEjk2Utgznr7sP0x4KdKCcPjbMM7C/view?usp=sharing</code>
+  - training config file: <code>conf/train.yaml</code>
+  - decoding config file: <code>conf/decode.yaml</code>
+  - preprocess config file: <code>conf/specaug.yaml</code>
+  - lm config file: <code>conf/lm.yaml</code> 
+  - cmvn file: <code>data/pretrain_Train/cmvn.ark</code>
+  - e2e file: <code>exp/pretrain_Train_pytorch_train_specaug/results/model.val5.avg.best</code>
+  - e2e json file: <code>exp/pretrain_Train_pytorch_train_specaug/results/model.json</code>
+  - lm file: <code>exp/pretrainedlm/rnnlm.model.best</code>
+  - lm JSON file: <code>exp/pretrainedlm/model.json</code>
+  - dict file: <code>data/lang_char/pretrain_Train_unigram5000_units.txt</code>
+
+
+## Environments
+- date: `Wed Feb 16 09:06:58 CET 2022`
+- python version: `3.8.5 (default, Sep  4 2020, 07:30:14)  [GCC 7.3.0]`
+- espnet version: `espnet 0.9.8`
+- chainer version: `chainer 6.0.0`
+- pytorch version: `pytorch 1.4.0`
+- Git hash: `19aabb415657c05a45467f9d8bb612db4764f6a1`
+  - Commit date: `Tue Oct 19 12:00:34 2021 +0200`
+
+
+### CER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_Test_model.val5.avg.best_decode_|1243|12648|96.3|1.6|2.1|0.2|3.9|15.8| 
+|decode_Val_model.val5.avg.best_decode_|1082|14858|92.7|3.2|4.1|0.9|8.2|38.2|
+
+### WER
+
+|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err|
+|---|---|---|---|---|---|---|---|---|
+|decode_Test_model.val5.avg.best_decode_|1243|6660|96.2|2.1|1.7|0.4|4.2|15.7|
+|decode_Val_model.val5.avg.best_decode_|1082|7866|91.6|4.7|3.7|1.0|9.4|38.2|
diff --git a/egs/lrs/asr1/cmd.sh b/egs/lrs/asr1/cmd.sh
@@ -0,0 +1,89 @@
+# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ======
+# Usage: <cmd>.pl [options] JOB=1:<nj> <log> <command...>
+# e.g.
+#   run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB
+#
+# Options:
+#   --time <time>: Limit the maximum time to execute.
+#   --mem <mem>: Limit the maximum memory usage.
+#   -–max-jobs-run <njob>: Limit the number parallel jobs. This is ignored for non-array jobs.
+#   --num-threads <ngpu>: Specify the number of CPU core.
+#   --gpu <ngpu>: Specify the number of GPU devices.
+#   --config: Change the configuration file from default.
+#
+# "JOB=1:10" is used for "array jobs" and it can control the number of parallel jobs.
+# The left string of "=", i.e. "JOB", is replaced by <N>(Nth job) in the command and the log file name,
+# e.g. "echo JOB" is changed to "echo 3" for the 3rd job and "echo 8" for 8th job respectively.
+# Note that the number must start with a positive number, so you can't use "JOB=0:10" for example.
+#
+# run.pl, queue.pl, slurm.pl, and ssh.pl have unified interface, not depending on its backend.
+# These options are mapping to specific options for each backend and
+# it is configured by "conf/queue.conf" and "conf/slurm.conf" by default.
+# If jobs failed, your configuration might be wrong for your environment.
+#
+#
+# The official documentation for run.pl, queue.pl, slurm.pl, and ssh.pl:
+#   "Parallelization in Kaldi": http://kaldi-asr.org/doc/queue.html
+# =========================================================~
+
+
+# Select the backend used by run.sh from "local", "sge", "slurm", or "ssh"
+cmd_backend='local'
+
+# Local machine, without any Job scheduling system
+if [ "${cmd_backend}" = local ]; then
+
+    # The other usage
+    export train_cmd="run.pl"
+    # Used for "*_train.py": "--gpu" is appended optionally by run.sh
+    export cuda_cmd="run.pl"
+    # Used for "*_recog.py"
+    export decode_cmd="run.pl"
+
+# "qsub" (SGE, Torque, PBS, etc.)
+elif [ "${cmd_backend}" = sge ]; then
+    # The default setting is written in conf/queue.conf.
+    # You must change "-q g.q" for the "queue" for your environment.
+    # To know the "queue" names, type "qhost -q"
+    # Note that to use "--gpu *", you have to setup "complex_value" for the system scheduler.
+
+    export train_cmd="queue.pl"
+    export cuda_cmd="queue.pl"
+    export decode_cmd="queue.pl"
+
+# "sbatch" (Slurm)
+elif [ "${cmd_backend}" = slurm ]; then
+    # The default setting is written in conf/slurm.conf.
+    # You must change "-p cpu" and "-p gpu" for the "partion" for your environment.
+    # To know the "partion" names, type "sinfo".
+    # You can use "--gpu * " by default for slurm and it is interpreted as "--gres gpu:*"
+    # The devices are allocated exclusively using "${CUDA_VISIBLE_DEVICES}".
+
+    export train_cmd="slurm.pl"
+    export cuda_cmd="slurm.pl"
+    export decode_cmd="slurm.pl"
+
+elif [ "${cmd_backend}" = ssh ]; then
+    # You have to create ".queue/machines" to specify the host to execute jobs.
+    # e.g. .queue/machines
+    #   host1
+    #   host2
+    #   host3
+    # Assuming you can login them without any password, i.e. You have to set ssh keys.
+
+    export train_cmd="ssh.pl"
+    export cuda_cmd="ssh.pl"
+    export decode_cmd="ssh.pl"
+
+# This is an example of specifying several unique options in the JHU CLSP cluster setup.
+# Users can modify/add their own command options according to their cluster environments.
+elif [ "${cmd_backend}" = jhu ]; then
+
+    export train_cmd="queue.pl --mem 2G"
+    export cuda_cmd="queue-freegpu.pl --mem 2G --gpu 1 --config conf/gpu.conf"
+    export decode_cmd="queue.pl --mem 4G"
+
+else
+    echo "$0: Error: Unknown cmd_backend=${cmd_backend}" 1>&2
+    return 1
+fi
diff --git a/egs/lrs/asr1/conf/decode.yaml b/egs/lrs/asr1/conf/decode.yaml
@@ -0,0 +1,7 @@
+batchsize: 0
+beam-size: 60
+ctc-weight: 0.4
+lm-weight: 0.6
+maxlenratio: 0.0
+minlenratio: 0.0
+penalty: 0.0
diff --git a/egs/lrs/asr1/conf/fbank.conf b/egs/lrs/asr1/conf/fbank.conf
@@ -0,0 +1,2 @@
+--sample-frequency=16000 
+--num-mel-bins=80
diff --git a/egs/lrs/asr1/conf/gpu.conf b/egs/lrs/asr1/conf/gpu.conf
@@ -0,0 +1,10 @@
+# Default configuration
+command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
+option mem=* -l mem_free=$0,ram_free=$0
+option mem=0          # Do not add anything to qsub_opts
+option num_threads=* -pe smp $0
+option num_threads=1  # Do not add anything to qsub_opts
+option max_jobs_run=* -tc $0
+default gpu=0
+option gpu=0
+option gpu=* -l 'hostname=b1[12345678]*|c*,gpu=$0' -q g.q
diff --git a/egs/lrs/asr1/conf/lm.yaml b/egs/lrs/asr1/conf/lm.yaml
@@ -0,0 +1,9 @@
+layer: 4
+unit: 2048
+opt: sgd       # or adam
+sortagrad: 0   # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
+batchsize: 512 # batch size in LM training
+epoch: 20      # if the data size is large, we can reduce this
+patience: 3
+maxlen: 40     # if sentence length > lm_maxlen, lm_batchsize is automatically reduced
+dropout-rate: 0.0
diff --git a/egs/lrs/asr1/conf/pitch.conf b/egs/lrs/asr1/conf/pitch.conf
@@ -0,0 +1 @@
+--sample-frequency=16000
diff --git a/egs/lrs/asr1/conf/queue.conf b/egs/lrs/asr1/conf/queue.conf
@@ -0,0 +1,10 @@
+# Default configuration
+command qsub -v PATH -cwd -S /bin/bash -j y -l arch=*64*
+option mem=* -l mem_free=$0,ram_free=$0
+option mem=0          # Do not add anything to qsub_opts
+option num_threads=* -pe smp $0
+option num_threads=1  # Do not add anything to qsub_opts
+option max_jobs_run=* -tc $0
+default gpu=0
+option gpu=0
+option gpu=* -l gpu=$0 -q g.q
diff --git a/egs/lrs/asr1/conf/slurm.conf b/egs/lrs/asr1/conf/slurm.conf
@@ -0,0 +1,14 @@
+# Default configuration
+command sbatch --export=PATH
+option name=* --job-name $0
+option time=* --time $0
+option mem=* --mem-per-cpu $0
+option mem=0
+option num_threads=* --cpus-per-task $0
+option num_threads=1 --cpus-per-task 1
+option num_nodes=* --nodes $0
+default gpu=0
+option gpu=0 -p cpu
+option gpu=* -p gpu --gres=gpu:$0 -c $0  # Recommend allocating more CPU than, or equal to the number of GPU
+# note: the --max-jobs-run option is supported as a special case
+# by slurm.pl and you don't have to handle it in the config file.
diff --git a/egs/lrs/asr1/conf/specaug.yaml b/egs/lrs/asr1/conf/specaug.yaml
@@ -0,0 +1,16 @@
+process:
+  # these three processes are a.k.a. SpecAugument
+  - type: "time_warp"
+    max_time_warp: 5
+    inplace: true
+    mode: "PIL"
+  - type: "freq_mask"
+    F: 30
+    n_mask: 2
+    inplace: true
+    replace_with_zero: false
+  - type: "time_mask"
+    T: 40
+    n_mask: 2
+    inplace: true
+    replace_with_zero: false