diff --git a/.gitmodules b/.gitmodules index bc771d8c6ee..e69de29bb2d 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,3 +0,0 @@ -[submodule "doc/notebook"] - path = doc/notebook - url = https://github.com/espnet/notebook diff --git a/README.md b/README.md index ff32569cf67..082e5450f78 100644 --- a/README.md +++ b/README.md @@ -133,7 +133,7 @@ To train the neural vocoder, please check the following repositories: - Multi-speaker speech separation - Unified encoder-separator-decoder structure for time-domain and frequency-domain models - Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution - - Separators: BLSTM, Transformer, Conformer, DPRNN, [DCCRN](https://arxiv.org/abs/2008.00264), Neural Beamformers, etc. + - Separators: BLSTM, Transformer, Conformer, [TasNet](https://arxiv.org/abs/1809.07454), [DPRNN](https://arxiv.org/abs/1910.06379), [DC-CRN](https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf), [DCCRN](https://arxiv.org/abs/2008.00264), Neural Beamformers, etc. - Flexible ASR integration: working as an individual task or as the ASR frontend - Easy to import pretrained models from [Asteroid](https://github.com/asteroid-team/asteroid) - Both the pre-trained models from Asteroid and the specific configuration are supported. diff --git a/ci/doc.sh b/ci/doc.sh index cbcd78f4b21..114bc92b952 100755 --- a/ci/doc.sh +++ b/ci/doc.sh @@ -26,6 +26,8 @@ set -euo pipefail find ./utils/{*.sh,spm_*} -exec ./doc/usage2rst.sh {} \; | tee ./doc/_gen/utils_sh.rst find ./espnet2/bin/*.py -exec ./doc/usage2rst.sh {} \; | tee ./doc/_gen/espnet2_bin.rst +./doc/notebook2rst.sh > ./doc/_gen/notebooks.rst + # generate package doc ./doc/module2rst.py --root espnet espnet2 --dst ./doc --exclude espnet.bin diff --git a/doc/.gitignore b/doc/.gitignore index d4058a5aa91..79f7202744d 100644 --- a/doc/.gitignore +++ b/doc/.gitignore @@ -1,4 +1,4 @@ _gen/ _build/ build/ - +notebook/ \ No newline at end of file diff --git a/doc/index.rst b/doc/index.rst index 13f20ab0a96..30cd3d35fd4 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -28,16 +28,7 @@ ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end ./espnet2_task.md ./espnet2_distributed.md -.. toctree:: - :maxdepth: 1 - :caption: Notebook: - - ./notebook/asr_cli.ipynb - ./notebook/asr_library.ipynb - ./notebook/tts_cli.ipynb - ./notebook/pretrained.ipynb - ./notebook/tts_realtime_demo.ipynb - ./notebook/st_demo.ipynb +.. include:: ./_gen/notebooks.rst .. include:: ./_gen/modules.rst diff --git a/doc/installation.md b/doc/installation.md index 0a1c8acf022..db45a09135b 100644 --- a/doc/installation.md +++ b/doc/installation.md @@ -32,14 +32,14 @@ the following packages are installed using Anaconda, so you can skip them.) # For CentOS $ sudo yum install libsndfile ``` -- ffmpeg (This is not required when installataion, but used in some recipes) +- ffmpeg (This is not required when installing, but used in some recipes) ```sh # For Ubuntu $ sudo apt-get install ffmpeg # For CentOS $ sudo yum install ffmpeg ``` -- flac (This is not required when installataion, but used in some recipes) +- flac (This is not required when installing, but used in some recipes) ```sh # For Ubuntu $ sudo apt-get install flac diff --git a/doc/notebook b/doc/notebook deleted file mode 160000 index ef3cbf880fc..00000000000 --- a/doc/notebook +++ /dev/null @@ -1 +0,0 @@ -Subproject commit ef3cbf880fcd725d11021e541a0cdfae4080446d diff --git a/doc/notebook2rst.sh b/doc/notebook2rst.sh new file mode 100755 index 00000000000..83bf7d57794 --- /dev/null +++ b/doc/notebook2rst.sh @@ -0,0 +1,17 @@ +#!/usr/bin/env bash + +set -euo pipefail + +cd "$(dirname "$0")" + +if [ ! -d notebook ]; then + git clone https://github.com/espnet/notebook --depth 1 +fi + +echo "\ +.. toctree:: + :maxdepth: 1 + :caption: Notebook: +" + +find ./notebook/*.ipynb -exec echo " {}" \; diff --git a/egs2/README.md b/egs2/README.md index 2b9bdbbca27..133fc9192f6 100755 --- a/egs2/README.md +++ b/egs2/README.md @@ -52,6 +52,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2 | librispeech_100 | LibriSpeech ASR corpus 100h subset | ASR | ENG | http://www.openslr.org/12 | | | libritts | LibriTTS corpus | TTS | ENG | http://www.openslr.org/60 | | | ljspeech | The LJ Speech Dataset | TTS | ENG | https://keithito.com/LJ-Speech-Dataset/ | | +| lrs3 | The Oxford-BBC Lip Reading Sentences 3 (LRS3) Dataset | ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html | | | lrs2 | The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset | Lipreading/ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | | | mini_an4 | Mini version of CMU AN4 database for the integration test | ASR/TTS/SE | ENG | http://www.speech.cs.cmu.edu/databases/an4/ | | | mini_librispeech | Mini version of Librispeech corpus | DIAR | ENG | https://openslr.org/31/ | | @@ -82,7 +83,7 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2 | timit | TIMIT Acoustic-Phonetic Continuous Speech Corpus | ASR | ENG | https://catalog.ldc.upenn.edu/LDC93S1 | | | totonac | Highland Totonac corpus (endangered language in central Mexico) | ASR | TOS | http://www.openslr.org/107/ | | | tsukuyomi | つくよみちゃんコーパス | TTS | JPN | https://tyc.rei-yumesaki.net/material/corpus | | -| vctk | English Multi-speaker Corpus for CSTR Voice Cloning Toolkit | TTS | ENG | http://www.udialogue.org/download/cstr-vctk-corpus.html | | +| vctk | English Multi-speaker Corpus for CSTR Voice Cloning Toolkit | ASR/TTS | ENG | http://www.udialogue.org/download/cstr-vctk-corpus.html | | | vctk_noisyreverb | Noisy reverberant speech database (48kHz) | SE | ENG | https://datashare.ed.ac.uk/handle/10283/2826 | | | vivos | VIVOS (Vietnamese corpus for ASR) | ASR | VIE | https://ailab.hcmus.edu.vn/vivos/ | | | voxforge | VoxForge | ASR | 7 languages | http://www.voxforge.org/ | | @@ -95,4 +96,3 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2 | yesno | The "yesno" corpus | ASR | HEB | http://www.openslr.org/1 | | | yoloxochitl_mixtec | Yoloxochitl-Mixtec corpus (endangered language in central Mexico) | ASR | XTY | http://www.openslr.org/89 | | | zeroth_korean | Zeroth-Korean | ASR | KOR | http://www.openslr.org/40 | | - diff --git a/egs2/TEMPLATE/asr1/asr.sh b/egs2/TEMPLATE/asr1/asr.sh index 04f7578b5b0..f4d7a8ad24a 100755 --- a/egs2/TEMPLATE/asr1/asr.sh +++ b/egs2/TEMPLATE/asr1/asr.sh @@ -110,6 +110,8 @@ k2_config=./conf/decode_asr_transformer_with_k2.yaml use_streaming=false # Whether to use streaming decoding +use_maskctc=false # Whether to use maskctc decoding + batch_size=1 inference_tag= # Suffix to the result dir for decoding. inference_config= # Config for decoding. @@ -224,6 +226,7 @@ Options: --inference_asr_model # ASR model path for decoding (default="${inference_asr_model}"). --download_model # Download a model from Model Zoo and use it for decoding (default="${download_model}"). --use_streaming # Whether to use streaming decoding (default="${use_streaming}"). + --use_maskctc # Whether to use maskctc decoding (default="${use_streaming}"). # [Task dependent] Set the datadir name created by local/data.sh --train_set # Name of training set (required). @@ -895,7 +898,7 @@ if ! "${skip_train}"; then if "${use_ngram}"; then log "Stage 9: Ngram Training: train_set=${data_feats}/lm_train.txt" cut -f 2- -d " " ${data_feats}/lm_train.txt | lmplz -S "20%" --discount_fallback -o ${ngram_num} - >${ngram_exp}/${ngram_num}gram.arpa - build_binary -s ${ngram_exp}/${ngram_num}gram.arpa ${ngram_exp}/${ngram_num}gram.bin + build_binary -s ${ngram_exp}/${ngram_num}gram.arpa ${ngram_exp}/${ngram_num}gram.bin else log "Stage 9: Skip ngram stages: use_ngram=${use_ngram}" fi @@ -1195,6 +1198,8 @@ if ! "${skip_eval}"; then else if "${use_streaming}"; then asr_inference_tool="espnet2.bin.asr_inference_streaming" + elif "${use_maskctc}"; then + asr_inference_tool="espnet2.bin.asr_inference_maskctc" else asr_inference_tool="espnet2.bin.asr_inference" fi diff --git a/egs2/TEMPLATE/asr1/db.sh b/egs2/TEMPLATE/asr1/db.sh index 88113b1d547..f7d686fa164 100755 --- a/egs2/TEMPLATE/asr1/db.sh +++ b/egs2/TEMPLATE/asr1/db.sh @@ -108,6 +108,7 @@ GOOGLEI18N=downloads NOISY_SPEECH= NOISY_REVERBERANT_SPEECH= LRS2= +LRS3= SUNDA=downloads CMU_ARCTIC=downloads CMU_INDIC=downloads diff --git a/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py b/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py index 13354637d52..4f0f074c9db 100755 --- a/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py +++ b/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py @@ -12,7 +12,7 @@ import argparse -def get_classification_result(hyp_file, ref_file): +def get_classification_result(hyp_file, ref_file, hyp_write, ref_write): hyp_lines = [line for line in hyp_file] ref_lines = [line for line in ref_file] @@ -22,6 +22,16 @@ def get_classification_result(hyp_file, ref_file): ref_intent = ref_lines[line_count].split(" ")[0] if hyp_intent != ref_intent: error += 1 + hyp_write.write( + " ".join(hyp_lines[line_count].split("\t")[0].split(" ")[1:]) + + "\t" + + hyp_lines[line_count].split("\t")[1] + ) + ref_write.write( + " ".join(ref_lines[line_count].split("\t")[0].split(" ")[1:]) + + "\t" + + ref_lines[line_count].split("\t")[1] + ) return 1 - (error / len(hyp_lines)) @@ -56,7 +66,16 @@ def get_classification_result(hyp_file, ref_file): os.path.join(exp_root, valid_inference_folder + "score_wer/ref.trn") ) -result = get_classification_result(valid_hyp_file, valid_ref_file) +valid_hyp_write_file = open( + os.path.join(exp_root, valid_inference_folder + "score_wer/hyp_asr.trn"), "w" +) +valid_ref_write_file = open( + os.path.join(exp_root, valid_inference_folder + "score_wer/ref_asr.trn"), "w" +) + +result = get_classification_result( + valid_hyp_file, valid_ref_file, valid_hyp_write_file, valid_ref_write_file +) print("Valid Intent Classification Result") print(result) @@ -66,8 +85,16 @@ def get_classification_result(hyp_file, ref_file): test_ref_file = open( os.path.join(exp_root, test_inference_folder + "score_wer/ref.trn") ) +test_hyp_write_file = open( + os.path.join(exp_root, test_inference_folder + "score_wer/hyp_asr.trn"), "w" +) +test_ref_write_file = open( + os.path.join(exp_root, test_inference_folder + "score_wer/ref_asr.trn"), "w" +) -result = get_classification_result(test_hyp_file, test_ref_file) +result = get_classification_result( + test_hyp_file, test_ref_file, test_hyp_write_file, test_ref_write_file +) print("Test Intent Classification Result") print(result) @@ -79,6 +106,17 @@ def get_classification_result(hyp_file, ref_file): utt_test_ref_file = open( os.path.join(exp_root, utt_test_inference_folder + "score_wer/ref.trn") ) - result = get_classification_result(utt_test_hyp_file, utt_test_ref_file) + utt_test_hyp_write_file = open( + os.path.join(exp_root, utt_test_inference_folder + "score_wer/hyp_asr.trn"), "w" + ) + utt_test_ref_write_file = open( + os.path.join(exp_root, utt_test_inference_folder + "score_wer/ref_asr.trn"), "w" + ) + result = get_classification_result( + utt_test_hyp_file, + utt_test_ref_file, + utt_test_hyp_write_file, + utt_test_ref_write_file, + ) print("Unseen Utterance Test Intent Classification Result") print(result) diff --git a/egs2/bn_openslr53/asr1/README.md b/egs2/bn_openslr53/asr1/README.md new file mode 100644 index 00000000000..542c8053339 --- /dev/null +++ b/egs2/bn_openslr53/asr1/README.md @@ -0,0 +1,29 @@ +# RESULTS +## Environments +- date: `Mon Jan 31 10:53:20 EST 2022` +- python version: `3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0]` +- espnet version: `espnet 0.10.6a1` +- pytorch version: `pytorch 1.8.1+cu102` +- Git hash: `9d09bf551a9fe090973de60e15adec1de6b3d054` + - Commit date: `Fri Jan 21 11:43:15 2022 -0500` +- Pretrained Model: https://huggingface.co/espnet/bn_openslr53 + +## asr_train_asr_raw_bpe1000 +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_batch_size1_lm_lm_train_lm_bpe1000_valid.loss.ave_asr_model_valid.acc.best/sbn_test|2018|6470|74.2|21.3|4.5|2.2|28.0|48.8| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_batch_size1_lm_lm_train_lm_bpe1000_valid.loss.ave_asr_model_valid.acc.best/sbn_test|2018|39196|89.4|4.3|6.3|1.4|12.0|48.8| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_batch_size1_lm_lm_train_lm_bpe1000_valid.loss.ave_asr_model_valid.acc.best/sbn_test|2018|15595|77.6|12.7|9.7|1.6|24.0|48.7| + diff --git a/egs2/chime4/enh1/README.md b/egs2/chime4/enh1/README.md index 9ca905d08cd..886eb0cbf26 100644 --- a/egs2/chime4/enh1/README.md +++ b/egs2/chime4/enh1/README.md @@ -6,6 +6,7 @@ - python version: `3.6.3 |Anaconda, Inc.| (default, Nov 20 2017, 20:41:42) [GCC 7.2.0]` - espnet version: `espnet 0.9.7` - pytorch version: `pytorch 1.6.0` +- Note: PESQ is evaluated based on https://github.com/vBaiCai/python-pesq ## enh_train_enh_conv_tasnet_raw @@ -25,3 +26,36 @@ config: conf/tuning/train_enh_beamformer_mvdr.yaml |---|---|---|---|---|---|---| |enhanced_dt05_simu_isolated_6ch_track|2.60|0.94|13.67|13.67|0|12.51| |enhanced_et05_simu_isolated_6ch_track|2.63|0.95|15.51|15.51|0|14.65| + + +## enh_train_enh_dc_crn_mapping_snr_raw + +config: conf/tuning/train_enh_dc_crn_mapping_snr.yaml + +|dataset|PESQ|STOI|SAR|SDR|SIR|SI_SNR| +|---|---|---|---|---|---|---| +|enhanced_dt05_simu_isolated_6ch_track|3.10|0.96|17.82|17.82|0.00|17.59| +|enhanced_et05_simu_isolated_6ch_track|2.95|0.95|17.33|17.33|0.00|17.04| + + +# RESULTS +## Environments +- date: `Sat Mar 19 07:17:45 CST 2022` +- python version: `3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]` +- espnet version: `espnet 0.10.7a1` +- pytorch version: `pytorch 1.8.1` +- Git hash: `648b024d8fb262eb9923c06a698b9c6df5b16e51` + - Commit date: `Wed Mar 16 18:47:21 2022 +0800` + + +## enh_train_enh_dprnntac_fasnet_raw + +config: conf/tuning/train_enh_dprnntac_fasnet.yaml + +Pretrained model: https://huggingface.co/lichenda/chime4_fasnet_dprnn_tac + +|dataset|STOI|SAR|SDR|SIR| +|---|---|---|---|---| +|enhanced_dt05_simu_isolated_6ch_track|0.95|15.75|15.75|0.00| +|enhanced_et05_simu_isolated_6ch_track|0.94|15.40|15.40|0.00| + diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml index fc996552cd3..cee051c8ef1 100644 --- a/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml +++ b/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml @@ -53,7 +53,7 @@ separator_conf: bunits: 512 bprojs: 512 badim: 320 - ref_channel: 4 + ref_channel: 3 use_noise_mask: True beamformer_type: mvdr_souden bdropout_rate: 0.0 diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_dc_crn_mapping_snr.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_dc_crn_mapping_snr.yaml new file mode 100644 index 00000000000..38d61843282 --- /dev/null +++ b/egs2/chime4/enh1/conf/tuning/train_enh_dc_crn_mapping_snr.yaml @@ -0,0 +1,67 @@ +init: xavier_uniform +max_epoch: 200 +batch_type: folded +batch_size: 16 +iterator_type: chunk +chunk_length: 32000 +num_workers: 4 +optim: adam +optim_conf: + lr: 1.0e-03 + eps: 1.0e-08 + weight_decay: 1.0e-7 + amsgrad: true +patience: 10 +grad_clip: 5 +val_scheduler_criterion: +- valid +- loss +best_model_criterion: +- - valid + - si_snr + - max +- - valid + - loss + - min +keep_nbest_models: 1 +scheduler: steplr +scheduler_conf: + step_size: 2 + gamma: 0.98 + +# A list for criterions +# The overlall loss in the multi-task learning will be: +# loss = weight_1 * loss_1 + ... + weight_N * loss_N +# The default `weight` for each sub-loss is 1.0 +criterions: + # The first criterion + - name: snr + conf: + eps: 1.0e-7 + # the wrapper for the current criterion + # PIT is widely used in the speech separation task + wrapper: pit + wrapper_conf: + weight: 1.0 + + +encoder: stft +encoder_conf: + n_fft: 256 + hop_length: 128 +decoder: stft +decoder_conf: + n_fft: 256 + hop_length: 128 +separator: dc_crn +separator_conf: + num_spk: 1 + input_channels: [10, 16, 32, 64, 128, 256] # 5x2=10 input channels + enc_hid_channels: 8 + enc_layers: 5 + glstm_groups: 2 + glstm_layers: 2 + glstm_bidirectional: true + glstm_rearrange: false + mode: mapping + ref_channel: 3 diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_fasnet.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_fasnet.yaml new file mode 100644 index 00000000000..b5dd47ddac7 --- /dev/null +++ b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_fasnet.yaml @@ -0,0 +1,59 @@ +optim: adam +init: xavier_uniform +max_epoch: 100 +batch_type: folded +batch_size: 8 +iterator_type: chunk +chunk_length: 32000 +num_workers: 4 +optim_conf: + lr: 1.0e-03 + eps: 1.0e-08 + weight_decay: 0 +patience: 10 +val_scheduler_criterion: +- valid +- loss +best_model_criterion: +- - valid + - si_snr + - max +- - valid + - loss + - min +keep_nbest_models: 1 +scheduler: steplr +scheduler_conf: + step_size: 2 + gamma: 0.98 + +encoder: same +encoder_conf: {} +decoder: same +decoder_conf: {} +separator: fasnet +separator_conf: + enc_dim: 64 + feature_dim: 64 + hidden_dim: 128 + layer: 6 + segment_size: 24 + num_spk: 1 + win_len: 16 + context_len: 16 + sr: 16000 + fasnet_type: 'fasnet' + dropout: 0.2 + + + +criterions: + # The first criterion + - name: si_snr + conf: + eps: 1.0e-7 + # the wrapper for the current criterion + # for single-talker case, we simplely use fixed_order wrapper + wrapper: fixed_order + wrapper_conf: + weight: 1.0 diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_ifasnet.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_ifasnet.yaml new file mode 100644 index 00000000000..ef1349ad8b9 --- /dev/null +++ b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_ifasnet.yaml @@ -0,0 +1,58 @@ +optim: adam +init: xavier_uniform +max_epoch: 100 +batch_type: folded +batch_size: 8 +iterator_type: chunk +chunk_length: 32000 +num_workers: 4 +optim_conf: + lr: 1.0e-03 + eps: 1.0e-08 + weight_decay: 0 +patience: 10 +val_scheduler_criterion: +- valid +- loss +best_model_criterion: +- - valid + - si_snr + - max +- - valid + - loss + - min +keep_nbest_models: 1 +scheduler: steplr +scheduler_conf: + step_size: 2 + gamma: 0.98 + +encoder: same +encoder_conf: {} +decoder: same +decoder_conf: {} +separator: fasnet +separator_conf: + enc_dim: 64 + feature_dim: 64 + hidden_dim: 128 + layer: 6 + segment_size: 24 + num_spk: 1 + win_len: 16 + context_len: 16 + sr: 16000 + fasnet_type: 'ifasnet' + + + +criterions: + # The first criterion + - name: si_snr + conf: + eps: 1.0e-7 + # the wrapper for the current criterion + # for single-talker case, we simplely use fixed_order wrapper + wrapper: fixed_order + wrapper_conf: + weight: 1.0 diff --git a/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh b/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh index 08df7d0dc4c..5cd50773aeb 100755 --- a/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh +++ b/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh @@ -85,6 +85,8 @@ elif [[ "$track" == "6" ]]; then done for x in $list_set; do + # drop the second channel to follow the convention in CHiME-4 + # see P27 in https://hal.inria.fr/hal-01399180/file/vincent_CSL16.pdf mix-mono-wav-scp.py ${x}_wav.CH{1,3,4,5,6}.scp > ${x}_wav.scp mix-mono-wav-scp.py ${x}_spk1_wav.CH{1,3,4,5,6}.scp > ${x}_spk1_wav.scp sed -E "s#\.Clean\.wav#\.Noise\.wav#g" ${x}_spk1_wav.scp > ${x}_noise_wav.scp diff --git a/egs2/chime4/enh1/run.sh b/egs2/chime4/enh1/run.sh index cf95ee85954..60ee54ec435 100755 --- a/egs2/chime4/enh1/run.sh +++ b/egs2/chime4/enh1/run.sh @@ -25,7 +25,7 @@ test_sets="et05_simu_isolated_1ch_track" --fs ${sample_rate} \ --ngpu 2 \ --spk_num 1 \ - --ref_channel 4 \ + --ref_channel 3 \ --local_data_opts "--extra-annotations ${extra_annotations} --stage 1 --stop-stage 2" \ --enh_config conf/tuning/train_enh_conv_tasnet.yaml \ --use_dereverb_ref false \ diff --git a/egs2/dsing/asr1/RESULTS.md b/egs2/dsing/asr1/RESULTS.md new file mode 100644 index 00000000000..0cdd661e049 --- /dev/null +++ b/egs2/dsing/asr1/RESULTS.md @@ -0,0 +1,55 @@ + +# RESULTS +## Environments +- date: `Sat Mar 19 23:02:37 EDT 2022` +- python version: `3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]` +- espnet version: `espnet 0.10.7a1` +- pytorch version: `pytorch 1.10.1` +- Git hash: `c1ed71c6899e54c0b3dad82687886b1183cd0885` + - Commit date: `Wed Mar 16 23:34:49 2022 -0400` + +## asr_train_asr_conformer7_hubert_ll60k_large_raw_bpe500_sp +- model: https://huggingface.co/espnet/ftshijt_espnet2_asr_dsing_hubert_conformer +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/dev|482|4018|83.6|9.4|7.0|6.4|22.8|58.3| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/test|480|4632|81.4|12.3|6.3|4.5|23.1|52.1| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/dev|482|18692|88.5|3.1|8.4|5.9|17.4|58.3| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/test|480|21787|87.9|4.3|7.8|4.5|16.6|52.1| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/dev|482|6097|82.2|7.1|10.7|5.7|23.5|58.3| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/test|480|7736|81.7|9.2|9.1|4.0|22.3|52.1| + +## asr_train_asr_raw_bpe500_sp +- model: https://huggingface.co/espnet/ftshijt_espnet2_asr_dsing_transformer +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/dev|482|4018|77.0|16.2|6.8|4.0|27.0|65.1| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/test|480|4632|76.1|17.3|6.6|3.7|27.6|57.7| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/dev|482|18692|85.0|5.8|9.2|4.2|19.2|65.1| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/test|480|21787|84.9|6.3|8.8|4.2|19.3|57.7| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/dev|482|6097|75.2|12.8|12.0|4.1|28.9|65.1| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/test|480|7736|75.3|14.3|10.4|4.1|28.8|57.7| \ No newline at end of file diff --git a/egs2/dsing/asr1/conf/pitch.conf b/egs2/dsing/asr1/conf/pitch.conf index 926bcfca92a..e959a19d5b8 100644 --- a/egs2/dsing/asr1/conf/pitch.conf +++ b/egs2/dsing/asr1/conf/pitch.conf @@ -1 +1 @@ ---sample-frequency=8000 +--sample-frequency=16000 diff --git a/egs2/dsing/asr1/local/data.sh b/egs2/dsing/asr1/local/data.sh index ee9c82872b7..26c61801e5f 100644 --- a/egs2/dsing/asr1/local/data.sh +++ b/egs2/dsing/asr1/local/data.sh @@ -58,6 +58,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then for datadir in ${train_set} ${train_dev} ${test_set}; do python local/data_prep.py data/ ${DSING}/sing_300x30x2 local/dsing_task/DSing\ Kaldi\ Recipe/dsing/s5/conf/${datadir}.json ${datadir} utils/utt2spk_to_spk2utt.pl data/${datadir}/utt2spk > data/${datadir}/spk2utt + utils/fix_data_dir.sh data/${datadir} done fi diff --git a/egs2/dsing/asr1/local/data_prep.py b/egs2/dsing/asr1/local/data_prep.py index 6675d31ae5c..98d82fe1259 100644 --- a/egs2/dsing/asr1/local/data_prep.py +++ b/egs2/dsing/asr1/local/data_prep.py @@ -60,11 +60,17 @@ def _add_utt2spk(self, utt_id, spk): self.utt2spk.append("{} {}".format(utt_id, spk)) def _add_wavscp(self, rec_id, wavpath): + # use ffmpeg or sox (default ffmepg) self.wavscp.append( - "{} sox {}/{} -G -t wav -r 16000 -c 1 - remix 1 | ".format( - rec_id, db_path, wavpath + "{} ffmpeg -i {}/{} -f wav -ar 16000 -ac 1 - | ".format( + rec_id, self.db_path, wavpath ) ) + # self.wavscp.append( + # "{} sox {}/{} -G -t wav -r 16000 -c 1 - remix 1 | ".format( + # rec_id, db_path, wavpath + # ) + # ) def list2file(self, outfile, list_data): list_data = list(set(list_data)) diff --git a/egs2/fisher_callhome_spanish/st1/RESULT.md b/egs2/fisher_callhome_spanish/st1/RESULT.md index 3ab898204f4..6efdcb6d5ef 100644 --- a/egs2/fisher_callhome_spanish/st1/RESULT.md +++ b/egs2/fisher_callhome_spanish/st1/RESULT.md @@ -7,3 +7,9 @@ | RNN (char) [[Weiss et al.]](https://arxiv.org/abs/1703.08581) | 48.3 | 49.1 | 48.7 | 16.8 | 17.4 | | Transformer (BPE1k(500ES,500EN)) + ASR-PT + SpecAugment | 48.4 | 49.5 | 48.6 | 19.7 | 19.6 | | Conformer (BPE1k(500ES,500EN)) + ASR-PT + SpecAugment | **51.8** | **52.3** | **50.5** | **22.3** | **21.7** | + +# Summary (4-gram BLEU, no callhome training) + +| model | fisher_dev | fisher_dev2 | fisher_test | callhome_devtest | callhome_evltest | +| ------------------------------------------------------------- | ---------- | ----------- | ----------- | ---------------- | ---------------- | +| Transformer (BPE1k(500ES,500EN)) + SpecAugment | 44.7 | 45.6 | 45.1 | 17.3 | 16.8 | \ No newline at end of file diff --git a/egs2/librispeech/asr1/README.md b/egs2/librispeech/asr1/README.md index 986479a9946..ddcb14fce05 100644 --- a/egs2/librispeech/asr1/README.md +++ b/egs2/librispeech/asr1/README.md @@ -113,6 +113,62 @@ |decode_asr_lm_lm_train_lm_transformer2_en_bpe5000_valid.loss.ave_asr_model_valid.acc.ave/test_other|2939|65101|94.5|3.9|1.5|1.0|6.4|45.1| +# Conformer, `hop_length=160` +- Params: 116.15 M +- ASR config: [conf/tuning/train_asr_conformer10_hop_length160.yaml](conf/tuning/train_asr_conformer10_hop_length160.yaml) +- LM config: [conf/tuning/train_lm_transformer2.yaml](conf/tuning/train_lm_transformer2.yaml) +- Pretrained model: [https://huggingface.co/pyf98/librispeech_conformer_hop_length160](https://huggingface.co/pyf98/librispeech_conformer_hop_length160) + +# RESULTS +## Environments +- date: `Mon Mar 14 12:26:10 EDT 2022` +- python version: `3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]` +- espnet version: `espnet 0.10.7a1` +- pytorch version: `pytorch 1.10.1` +- Git hash: `467660021998c416ac366aed0f75f3399e321a3a` + - Commit date: `Sun Mar 13 17:08:56 2022 -0400` + +## asr_train_asr_conformer10_hop_length160_raw_en_bpe5000_sp +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|beam60_ctc0.3/dev_clean|2703|54402|98.1|1.7|0.2|0.2|2.1|27.7| +|beam60_ctc0.3/dev_other|2864|50948|95.3|4.3|0.4|0.5|5.2|44.1| +|beam60_ctc0.3/test_clean|2620|52576|97.9|1.9|0.2|0.3|2.4|27.9| +|beam60_ctc0.3/test_other|2939|52343|95.4|4.1|0.4|0.6|5.2|44.8| +|beam60_ctc0.3_lm0.6/dev_clean|2703|54402|98.4|1.4|0.2|0.2|1.8|23.3| +|beam60_ctc0.3_lm0.6/dev_other|2864|50948|96.4|3.2|0.4|0.4|3.9|36.2| +|beam60_ctc0.3_lm0.6/test_clean|2620|52576|98.3|1.5|0.2|0.2|2.0|23.7| +|beam60_ctc0.3_lm0.6/test_other|2939|52343|96.2|3.3|0.4|0.5|4.2|39.6| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|beam60_ctc0.3/dev_clean|2703|288456|99.5|0.3|0.2|0.2|0.7|27.7| +|beam60_ctc0.3/dev_other|2864|265951|98.4|1.0|0.6|0.6|2.2|44.1| +|beam60_ctc0.3/test_clean|2620|281530|99.4|0.3|0.3|0.2|0.8|27.9| +|beam60_ctc0.3/test_other|2939|272758|98.5|0.9|0.7|0.6|2.1|44.8| +|beam60_ctc0.3_lm0.6/dev_clean|2703|288456|99.5|0.2|0.2|0.2|0.6|23.3| +|beam60_ctc0.3_lm0.6/dev_other|2864|265951|98.5|0.8|0.6|0.5|1.9|36.2| +|beam60_ctc0.3_lm0.6/test_clean|2620|281530|99.5|0.2|0.3|0.2|0.7|23.7| +|beam60_ctc0.3_lm0.6/test_other|2939|272758|98.6|0.7|0.7|0.5|1.9|39.6| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|beam60_ctc0.3/dev_clean|2703|68010|97.6|1.7|0.6|0.4|2.7|27.7| +|beam60_ctc0.3/dev_other|2864|63110|94.2|4.3|1.5|0.9|6.7|44.1| +|beam60_ctc0.3/test_clean|2620|65818|97.4|1.8|0.8|0.4|3.0|27.9| +|beam60_ctc0.3/test_other|2939|65101|94.4|3.9|1.7|0.8|6.4|44.8| +|beam60_ctc0.3_lm0.6/dev_clean|2703|68010|98.0|1.4|0.6|0.3|2.3|23.3| +|beam60_ctc0.3_lm0.6/dev_other|2864|63110|95.2|3.4|1.4|0.6|5.5|36.2| +|beam60_ctc0.3_lm0.6/test_clean|2620|65818|97.8|1.4|0.8|0.3|2.5|23.7| +|beam60_ctc0.3_lm0.6/test_other|2939|65101|95.1|3.2|1.7|0.6|5.5|39.6| + + # Conformer, using stochastic depth - Params: 116.15M diff --git a/egs2/librispeech/asr1/conf/train_asr_confformer.yaml b/egs2/librispeech/asr1/conf/train_asr_confformer.yaml deleted file mode 120000 index 2b1e07638c8..00000000000 --- a/egs2/librispeech/asr1/conf/train_asr_confformer.yaml +++ /dev/null @@ -1 +0,0 @@ -tuning/train_asr_conformer6_n_fft512_hop_length256.yaml \ No newline at end of file diff --git a/egs2/librispeech/asr1/conf/train_asr_conformer.yaml b/egs2/librispeech/asr1/conf/train_asr_conformer.yaml new file mode 120000 index 00000000000..11b013a3089 --- /dev/null +++ b/egs2/librispeech/asr1/conf/train_asr_conformer.yaml @@ -0,0 +1 @@ +tuning/train_asr_conformer10_hop_length160.yaml \ No newline at end of file diff --git a/egs2/librispeech/asr1/conf/tuning/train_asr_conformer10_hop_length160.yaml b/egs2/librispeech/asr1/conf/tuning/train_asr_conformer10_hop_length160.yaml new file mode 100644 index 00000000000..76094f0c4a9 --- /dev/null +++ b/egs2/librispeech/asr1/conf/tuning/train_asr_conformer10_hop_length160.yaml @@ -0,0 +1,76 @@ +# Trained with Tesla V100 (32GB) x 4 GPUs. It takes about 3.5 days. +encoder: conformer +encoder_conf: + output_size: 512 + attention_heads: 8 + linear_units: 2048 + num_blocks: 12 + dropout_rate: 0.1 + positional_dropout_rate: 0.1 + attention_dropout_rate: 0.1 + input_layer: conv2d + normalize_before: true + macaron_style: true + rel_pos_type: latest + pos_enc_layer_type: rel_pos + selfattention_layer_type: rel_selfattn + activation_type: swish + use_cnn_module: true + cnn_module_kernel: 31 + +decoder: transformer +decoder_conf: + attention_heads: 8 + linear_units: 2048 + num_blocks: 6 + dropout_rate: 0.1 + positional_dropout_rate: 0.1 + self_attention_dropout_rate: 0.1 + src_attention_dropout_rate: 0.1 + +model_conf: + ctc_weight: 0.3 + lsm_weight: 0.1 + length_normalized_loss: false + +frontend_conf: + n_fft: 512 + hop_length: 160 + +use_amp: true +num_workers: 4 +batch_type: numel +batch_bins: 35000000 +accum_grad: 4 +max_epoch: 50 +patience: none +init: none +best_model_criterion: +- - valid + - acc + - max +keep_nbest_models: 10 + +optim: adam +optim_conf: + lr: 0.0025 + weight_decay: 0.000001 +scheduler: warmuplr +scheduler_conf: + warmup_steps: 40000 + +specaug: specaug +specaug_conf: + apply_time_warp: true + time_warp_window: 5 + time_warp_mode: bicubic + apply_freq_mask: true + freq_mask_width_range: + - 0 + - 27 + num_freq_mask: 2 + apply_time_mask: true + time_mask_width_ratio_range: + - 0. + - 0.05 + num_time_mask: 10 diff --git a/egs2/librispeech/asr1/run.sh b/egs2/librispeech/asr1/run.sh index 8ca7155d69d..4a457e86a7d 100755 --- a/egs2/librispeech/asr1/run.sh +++ b/egs2/librispeech/asr1/run.sh @@ -9,13 +9,13 @@ train_set="train_960" valid_set="dev" test_sets="test_clean test_other dev_clean dev_other" -asr_config=conf/tuning/train_asr_conformer8.yaml +asr_config=conf/train_asr_conformer.yaml lm_config=conf/tuning/train_lm_transformer2.yaml inference_config=conf/decode_asr.yaml ./asr.sh \ --lang en \ - --ngpu 16 \ + --ngpu 4 \ --nbpe 5000 \ --max_wav_duration 30 \ --speed_perturb_factors "0.9 1.0 1.1" \ diff --git a/egs2/lrs3/asr1/RESULTS.md b/egs2/lrs3/asr1/RESULTS.md new file mode 100644 index 00000000000..be579a0ee64 --- /dev/null +++ b/egs2/lrs3/asr1/RESULTS.md @@ -0,0 +1,32 @@ + +# RESULTS +## Environments +- date: `Mon Mar 7 16:57:48 EST 2022` +- python version: `3.8.12 (default, Oct 12 2021, 13:49:34) [GCC 7.5.0]` +- espnet version: `espnet 0.10.7a1` +- pytorch version: `pytorch 1.10.1` +- Git hash: `ce48b589cd2d04b00a867a24352fc8d45fc6afc9` + - Commit date: `Mon Mar 7 09:20:56 2022 -0500` + +## asr_train_asr_transformer_no_lm +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|inference_asr_model_valid.acc.ave/dev|2686|30060|81.8|15.2|3.0|4.0|22.2|75.3| +|inference_asr_model_valid.acc.ave/test|1321|9890|90.0|8.9|1.1|1.9|11.9|46.6| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|inference_asr_model_valid.acc.ave/dev|2686|155720|91.2|4.5|4.3|4.0|12.8|75.3| +|inference_asr_model_valid.acc.ave/test|1321|49750|95.2|2.7|2.1|1.7|6.5|46.6| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|inference_asr_model_valid.acc.ave/dev|2686|36737|77.1|13.2|9.7|2.9|25.8|75.3| +|inference_asr_model_valid.acc.ave/test|1321|11831|86.5|8.0|5.5|1.3|14.7|46.6| + diff --git a/egs2/lrs3/asr1/asr.sh b/egs2/lrs3/asr1/asr.sh new file mode 120000 index 00000000000..60b05122cfd --- /dev/null +++ b/egs2/lrs3/asr1/asr.sh @@ -0,0 +1 @@ +../../TEMPLATE/asr1/asr.sh \ No newline at end of file diff --git a/egs2/lrs3/asr1/cmd.sh b/egs2/lrs3/asr1/cmd.sh new file mode 100644 index 00000000000..2aae6919fef --- /dev/null +++ b/egs2/lrs3/asr1/cmd.sh @@ -0,0 +1,110 @@ +# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ====== +# Usage: .pl [options] JOB=1: +# e.g. +# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB +# +# Options: +# --time