diff --git a/.gitmodules b/.gitmodules index bc771d8c6ee..e69de29bb2d 100644 --- a/.gitmodules +++ b/.gitmodules @@ -1,3 +0,0 @@ -[submodule "doc/notebook"] - path = doc/notebook - url = https://github.com/espnet/notebook diff --git a/README.md b/README.md index 3ab5e55649c..22cffb798be 100644 --- a/README.md +++ b/README.md @@ -78,6 +78,10 @@ ESPnet uses [pytorch](http://pytorch.org/) as a deep learning engine and also fo - Set `frontend` to be `s3prl` - Select any upstream model by setting the `frontend_conf` to the corresponding name. - Streaming Transformer/Conformer ASR with blockwise synchronous beam search. +- Restricted Self-Attention based on [Longformer](https://arxiv.org/abs/2004.05150) as an encoder for long sequences + +### SUM: Speech Summarization +- End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [[Sharma et al., 2022]](https://arxiv.org/abs/2110.06263) Demonstration - Real-time ASR demo with ESPnet2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_asr_realtime_demo.ipynb) @@ -129,7 +133,7 @@ To train the neural vocoder, please check the following repositories: - Multi-speaker speech separation - Unified encoder-separator-decoder structure for time-domain and frequency-domain models - Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution - - Separators: BLSTM, Transformer, Conformer, DPRNN, [DCCRN](https://arxiv.org/abs/2008.00264), [Deep Clustering](https://ieeexplore.ieee.org/document/7471631), [Deep Attractor Network](https://pubmed.ncbi.nlm.nih.gov/29430212/), Neural Beamformers, etc. + - Separators: BLSTM, Transformer, Conformer, [TasNet](https://arxiv.org/abs/1809.07454), [DPRNN](https://arxiv.org/abs/1910.06379), [DC-CRN](https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf), [DCCRN](https://arxiv.org/abs/2008.00264), [Deep Clustering](https://ieeexplore.ieee.org/document/7471631), [Deep Attractor Network](https://pubmed.ncbi.nlm.nih.gov/29430212/), Neural Beamformers, etc. - Flexible ASR integration: working as an individual task or as the ASR frontend - Easy to import pretrained models from [Asteroid](https://github.com/asteroid-team/asteroid) - Both the pre-trained models from Asteroid and the specific configuration are supported. diff --git a/ci/doc.sh b/ci/doc.sh index cbcd78f4b21..114bc92b952 100755 --- a/ci/doc.sh +++ b/ci/doc.sh @@ -26,6 +26,8 @@ set -euo pipefail find ./utils/{*.sh,spm_*} -exec ./doc/usage2rst.sh {} \; | tee ./doc/_gen/utils_sh.rst find ./espnet2/bin/*.py -exec ./doc/usage2rst.sh {} \; | tee ./doc/_gen/espnet2_bin.rst +./doc/notebook2rst.sh > ./doc/_gen/notebooks.rst + # generate package doc ./doc/module2rst.py --root espnet espnet2 --dst ./doc --exclude espnet.bin diff --git a/ci/install.sh b/ci/install.sh index eeb531d7ddd..5bfed7584ad 100755 --- a/ci/install.sh +++ b/ci/install.sh @@ -21,7 +21,7 @@ ${CXX:-g++} -v . ./activate_python.sh make TH_VERSION="${TH_VERSION}" - make warp-ctc.done warp-transducer.done chainer_ctc.done nkf.done moses.done mwerSegmenter.done pesq pyopenjtalk.done py3mmseg.done s3prl.done transformers.done phonemizer.done fairseq.done k2.done gtn.done + make warp-ctc.done warp-transducer.done chainer_ctc.done nkf.done moses.done mwerSegmenter.done pesq pyopenjtalk.done py3mmseg.done s3prl.done transformers.done phonemizer.done fairseq.done k2.done gtn.done longformer.done rm -rf kaldi ) . tools/activate_python.sh diff --git a/doc/.gitignore b/doc/.gitignore index d4058a5aa91..79f7202744d 100644 --- a/doc/.gitignore +++ b/doc/.gitignore @@ -1,4 +1,4 @@ _gen/ _build/ build/ - +notebook/ \ No newline at end of file diff --git a/doc/index.rst b/doc/index.rst index 13f20ab0a96..30cd3d35fd4 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -28,16 +28,7 @@ ESPnet is an end-to-end speech processing toolkit, mainly focuses on end-to-end ./espnet2_task.md ./espnet2_distributed.md -.. toctree:: - :maxdepth: 1 - :caption: Notebook: - - ./notebook/asr_cli.ipynb - ./notebook/asr_library.ipynb - ./notebook/tts_cli.ipynb - ./notebook/pretrained.ipynb - ./notebook/tts_realtime_demo.ipynb - ./notebook/st_demo.ipynb +.. include:: ./_gen/notebooks.rst .. include:: ./_gen/modules.rst diff --git a/doc/installation.md b/doc/installation.md index 0a1c8acf022..db45a09135b 100644 --- a/doc/installation.md +++ b/doc/installation.md @@ -32,14 +32,14 @@ the following packages are installed using Anaconda, so you can skip them.) # For CentOS $ sudo yum install libsndfile ``` -- ffmpeg (This is not required when installataion, but used in some recipes) +- ffmpeg (This is not required when installing, but used in some recipes) ```sh # For Ubuntu $ sudo apt-get install ffmpeg # For CentOS $ sudo yum install ffmpeg ``` -- flac (This is not required when installataion, but used in some recipes) +- flac (This is not required when installing, but used in some recipes) ```sh # For Ubuntu $ sudo apt-get install flac diff --git a/doc/notebook b/doc/notebook deleted file mode 160000 index ef3cbf880fc..00000000000 --- a/doc/notebook +++ /dev/null @@ -1 +0,0 @@ -Subproject commit ef3cbf880fcd725d11021e541a0cdfae4080446d diff --git a/doc/notebook2rst.sh b/doc/notebook2rst.sh new file mode 100755 index 00000000000..83bf7d57794 --- /dev/null +++ b/doc/notebook2rst.sh @@ -0,0 +1,17 @@ +#!/usr/bin/env bash + +set -euo pipefail + +cd "$(dirname "$0")" + +if [ ! -d notebook ]; then + git clone https://github.com/espnet/notebook --depth 1 +fi + +echo "\ +.. toctree:: + :maxdepth: 1 + :caption: Notebook: +" + +find ./notebook/*.ipynb -exec echo " {}" \; diff --git a/egs2/README.md b/egs2/README.md index dcbd80bf5b9..8da8f300214 100755 --- a/egs2/README.md +++ b/egs2/README.md @@ -8,39 +8,40 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2 | Directory name | Corpus name | Task | Language | URL | Note | | ----------------------- | --------------------------------------------------------------------------------------- | ----------------------- | --------------------- | ------------------------------------------------------------------------------------------------------------ | ------------ | -| aidatatang_200zh | Aidatatang_200zh A free Chinese Mandarin speech corpus | ASR | CMN | http://www.openslr.org/resources/62 | | -| aishell | AISHELL-ASR0009-OS1 Open Source Mandarin Speech Corpus | ASR | CMN | http://www.aishelltech.com/kysjcp | | -| aishell3 | AISHELL3 Mandarin multi-speaker text-to-speech | TTS | CMN | https://www.openslr.org/93/ | | -| ami | The AMI Meeting Corpus | ASR | ENG | http://groups.inf.ed.ac.uk/ami/corpus/ | | -| an4 | CMU AN4 database | ASR/TTS | ENG | http://www.speech.cs.cmu.edu/databases/an4/ | | -| babel | IARPA Babel corups | ASR | ~20 languages | https://www.iarpa.gov/index.php/research-programs/babel | | -| bn_openslr53 | Large bengali ASR training dataset | ASR | BEN | https://openslr.org/53/ | | +| aidatatang_200zh | Aidatatang_200zh A free Chinese Mandarin speech corpus | ASR | CMN | http://www.openslr.org/resources/62 | | +| aishell | AISHELL-ASR0009-OS1 Open Source Mandarin Speech Corpus | ASR | CMN | http://www.aishelltech.com/kysjcp | | +| aishell3 | AISHELL3 Mandarin multi-speaker text-to-speech | TTS | CMN | https://www.openslr.org/93/ | | +| ami | The AMI Meeting Corpus | ASR | ENG | http://groups.inf.ed.ac.uk/ami/corpus/ | | +| an4 | CMU AN4 database | ASR/TTS | ENG | http://www.speech.cs.cmu.edu/databases/an4/ | | +| babel | IARPA Babel corups | ASR | ~20 languages | https://www.iarpa.gov/index.php/research-programs/babel | | +| bn_openslr53 | Large bengali ASR training dataset | ASR | BEN | https://openslr.org/53/ | | | catslu | CATSLU-MAPS | SLU | CMN | https://sites.google.com/view/catslu/home | | -| chime4 | The 4th CHiME Speech Separation and Recognition Challenge | ASR/Multichannel ASR | ENG | http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/ | | -| cmu_indic | CMU INDIC | TTS | 7 languages | http://festvox.org/cmu_indic/ | | +| chime4 | The 4th CHiME Speech Separation and Recognition Challenge | ASR/Multichannel ASR | ENG | http://spandh.dcs.shef.ac.uk/chime_challenge/chime2016/ | | +| cmu_indic | CMU INDIC | TTS | 7 languages | http://festvox.org/cmu_indic/ | | | commonvoice | The Mozilla Common Voice | ASR | 13 languages | https://voice.mozilla.org/datasets | | -| csj | Corpus of Spontaneous Japanese | ASR | JPN | https://pj.ninjal.ac.jp/corpus_center/csj/en/ | | -| csmsc | Chinese Standard Mandarin Speech Copus | TTS | CMN | https://www.data-baker.com/open_source.html | | +| csj | Corpus of Spontaneous Japanese | ASR | JPN | https://pj.ninjal.ac.jp/corpus_center/csj/en/ | | +| csmsc | Chinese Standard Mandarin Speech Copus | TTS | CMN | https://www.data-baker.com/open_source.html | | | css10 | CSS10: A Collection of Single Speaker Speech Datasets for 10 Languages | TTS | 10 langauges | https://github.com/Kyubyong/css10 | | -| dirha_wsj | Distant-speech Interaction for Robust Home Applications | Multichannel ASR | ENG | https://dirha.fbk.eu/, https://github.com/SHINE-FBK/DIRHA_English_wsj | | -| dns_ins20 | Deep Noise Suppression Challenge – INTERSPEECH 2020 | SE | 7 languages + singing | https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2020/ | | -| fisher_callhome_spanish | Fisher and CALLHOME Spanish--English Speech Translation | ASR/ST | SPA->ENG | https://catalog.ldc.upenn.edu/LDC2014T23 | | -| fsc | Fluent Speech Commands Dataset | SLU | ENG | https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/ | | -| fsc_unseen | Fluent Speech Commands Dataset MASE Eval Unseen splits | SLU | ENG | https://github.com/maseEval/mase | | -| fsc_challenge | Fluent Speech Commands Dataset MASE Eval Challenge splits | SLU | ENG | https://github.com/maseEval/mase | | -| gigaspeech | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | ASR | ENG | https://github.com/SpeechColab/GigaSpeech | | -| grabo | Grabo dataset | SLU | ENG + NLD | https://www.esat.kuleuven.be/psi/spraak/downloads/ | | -| hkust | HKUST/MTS: A very large scale Mandarin telephone speech corpus | ASR | CMN | https://catalog.ldc.upenn.edu/LDC2005S15 | | -| hui_acg | HUI-audio-corpus-german | TTS | DEU | https://opendata.iisys.de/datasets.html#hui-audio-corpus-german | | +| dirha_wsj | Distant-speech Interaction for Robust Home Applications | Multichannel ASR | ENG | https://dirha.fbk.eu/, https://github.com/SHINE-FBK/DIRHA_English_wsj | | +| dns_ins20 | Deep Noise Suppression Challenge – INTERSPEECH 2020 | SE | 7 languages + singing | https://www.microsoft.com/en-us/research/academic-program/deep-noise-suppression-challenge-interspeech-2020/ | | +| dsing | Automatic Lyric Transcription from Karaoke Vocal Tracks (From DAMP Sing300x30x2) | ASR (ALT) | ENG singing | https://github.com/groadabike/Kaldi-Dsing-task | | +| fisher_callhome_spanish | Fisher and CALLHOME Spanish--English Speech Translation | ASR/ST | SPA->ENG | https://catalog.ldc.upenn.edu/LDC2014T23 | | +| fsc | Fluent Speech Commands Dataset | SLU | ENG | https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/ | | +| fsc_unseen | Fluent Speech Commands Dataset MASE Eval Unseen splits | SLU | ENG | https://github.com/maseEval/mase | | +| fsc_challenge | Fluent Speech Commands Dataset MASE Eval Challenge splits | SLU | ENG | https://github.com/maseEval/mase | | +| gigaspeech | GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio | ASR | ENG | https://github.com/SpeechColab/GigaSpeech | | +| grabo | Grabo dataset | SLU | ENG + NLD | https://www.esat.kuleuven.be/psi/spraak/downloads/ | | +| hkust | HKUST/MTS: A very large scale Mandarin telephone speech corpus | ASR | CMN | https://catalog.ldc.upenn.edu/LDC2005S15 | | +| hui_acg | HUI-audio-corpus-german | TTS | DEU | https://opendata.iisys.de/datasets.html#hui-audio-corpus-german | | | how2 | How2: A Large-scale Dataset for Multimodal Language Understanding | ASR/MT/ST | ENG->POR | https://github.com/srvk/how2-dataset | | -| iemocap | IEMOCAP database: The Interactive Emotional Dyadic Motion Capture database | SLU | ENG | https://sail.usc.edu/iemocap/ | | -| iwslt21_low_resource | ALFFA, IARPA Babel, Gamayun, IWSLT 2021 | ASR | SWA | http://www.openslr.org/25/ https://catalog.ldc.upenn.edu/LDC2017S05 https://gamayun.translatorswb.org/data/ https://iwslt.org/2021/low-resource | | +| iemocap | IEMOCAP database: The Interactive Emotional Dyadic Motion Capture database | SLU | ENG | https://sail.usc.edu/iemocap/ | | +| iwslt21_low_resource | ALFFA, IARPA Babel, Gamayun, IWSLT 2021 | ASR | SWA | http://www.openslr.org/25/ https://catalog.ldc.upenn.edu/LDC2017S05 https://gamayun.translatorswb.org/data/ https://iwslt.org/2021/low-resource | | | jdcinal | Japanese Dialogue Corpus of Information Navigation and Attentive Listening Annotated with Extended ISO-24617-2 Dialogue Act Tags | SLU | JPN | http://www.lrec-conf.org/proceedings/lrec2018/pdf/464.pdf http://tts.speech.cs.cmu.edu/awb/infomation_navigation_and_attentive_listening_0.2.zip | | | jkac | J-KAC: Japanese Kamishibai and audiobook corpus | TTS | JPN | https://sites.google.com/site/shinnosuketakamichi/research-topics/j-kac_corpus | | | jmd | JMD: Japanese multi-dialect corpus for speech synthesis | TTS | JPN | https://sites.google.com/site/shinnosuketakamichi/research-topics/jmd_corpus | | | jsss | JSSS: Japanese speech corpus for summarization and simplification | TTS | JPN | https://sites.google.com/site/shinnosuketakamichi/research-topics/jsss_corpus | | | jsut | Japanese speech corpus of Saruwatari-lab., University of Tokyo | ASR/TTS | JPN | https://sites.google.com/site/shinnosuketakamichi/publication/jsut | | -| jtubespeech | Japanese YouTube Speech corpus | ASR/TTS | JPN | | | +| jtubespeech | Japanese YouTube Speech corpus | ASR/TTS | JPN | | | | jv_openslr35 | Javanese | ASR | JAV | http://www.openslr.org/35 | | | jvs | JVS (Japanese versatile speech) corpus | TTS | JPN | https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus | | | ksponspeech | KsponSpeech (Korean spontaneous speech) corpus | ASR | KOR | https://aihub.or.kr/aidata/105 | | @@ -49,26 +50,29 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2 | librimix | LibriMix: An Open-Source Dataset for Generalizable Speech Separation | SE | ENG | https://github.com/JorisCos/LibriMix | | | librispeech | LibriSpeech ASR corpus | ASR | ENG | http://www.openslr.org/12 | | | librispeech_100 | LibriSpeech ASR corpus 100h subset | ASR | ENG | http://www.openslr.org/12 | | -| libritts | LibriTTS corpus | TTS | ENG | http://www.openslr.org/60 | | +| libritts | LibriTTS corpus | TTS | ENG | http://www.openslr.org/60 | | | ljspeech | The LJ Speech Dataset | TTS | ENG | https://keithito.com/LJ-Speech-Dataset/ | | +| lrs3 | The Oxford-BBC Lip Reading Sentences 3 (LRS3) Dataset | ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs3.html | | | lrs2 | The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset | Lipreading/ASR | ENG | https://www.robots.ox.ac.uk/~vgg/data/lip_reading/lrs2.html | | | mini_an4 | Mini version of CMU AN4 database for the integration test | ASR/TTS/SE | ENG | http://www.speech.cs.cmu.edu/databases/an4/ | | | mini_librispeech | Mini version of Librispeech corpus | DIAR | ENG | https://openslr.org/31/ | | -| mls | MLS (A large multilingual corpus derived from LibriVox audiobooks) | ASR | 8 languages | http://www.openslr.org/94/ | | +| mls | MLS (A large multilingual corpus derived from LibriVox audiobooks) | ASR | 8 languages | http://www.openslr.org/94/ | | +| mr_openslr64 | OpenSLR Marathi Corpus | ASR | MAR | http://www.openslr.org/64/ | | +| ms_indic_is18 | Microsoft Speech Corpus (Indian languages) | ASR | 3 langs: TEL TAM GUJ | https://msropendata.com/datasets/7230b4b1-912d-400e-be58-f84e0512985e | | | nsc | National Speech Corpus | ASR | ENG-SG | https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus | | -| open_li52 | Corpus combination with 52 languages(Commonvocie + voxforge) | Multilingual ASR | 52 languages | | | +| open_li52 | Corpus combination with 52 languages(Commonvocie + voxforge) | Multilingual ASR | 52 languages | | | | polyphone_swiss_french | Swiss French Polyphone corpus | ASR | FRA | http://catalog.elra.info/en-us/repository/browse/ELRA-S0030_02 | | | primewords_chinese | Primewords Chinese Corpus Set 1 | ASR | CMN | https://www.openslr.org/47/ | | -| puebla_nahuatl | Highland Puebla Nahuatl corpus (endangered language in central Mexico) | ASR | HPN | https://www.openslr.org/92/ | | +| puebla_nahuatl | Highland Puebla Nahuatl corpus (endangered language in central Mexico) | ASR | HPN | https://www.openslr.org/92/ | | | reverb | REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge | ASR | ENG | https://reverb2014.dereverberation.com/ | | | ru_open_stt | Russian Open Speech To Text (STT/ASR) Dataset | ASR | RUS | https://github.com/snakers4/open_stt | | | ruslan | RUSLAN: Russian Spoken Language Corpus For Speech Synthesis | TTS | RUS | https://ruslan-corpus.github.io/ | | | snips | SNIPS: A dataset for spoken language understanding | SLU | ENG | https://github.com/sonos/spoken-language-understanding-research-datasets | | | seame | SEAME: a Mandarin-English Code-switching Speech Corpus in South-East Asia | ASR | ENG + CMN | https://catalog.ldc.upenn.edu/LDC2015S04 | | -| siwis | SIWIS: Spoken Interaction with Interpretation in Switzerland | TTS | FRA | https://https://datashare.ed.ac.uk/handle/10283/2353 | | +| siwis | SIWIS: Spoken Interaction with Interpretation in Switzerland | TTS | FRA | https://https://datashare.ed.ac.uk/handle/10283/2353 | | | slue-voxceleb | SLUE: Spoken Language Understanding Evaluation | SLU | ENG | https://github.com/asappresearch/slue-toolkit | | | slurp | SLURP: A Spoken Language Understanding Resource Package | SLU | ENG | https://github.com/pswietojanski/slurp | | -| slurp_entity | SLURP: A Spoken Language Understanding Resource Package | SLU/Entity Classification | ENG | https://github.com/pswietojanski/slurp | | +| slurp_entity | SLURP: A Spoken Language Understanding Resource Package | SLU/Entity Classifi. | ENG | https://github.com/pswietojanski/slurp | | | sms_wsj | SMS-WSJ: A database for in-depth analysis of multi-channel source separation algorithms | SE | ENG | https://github.com/fgnt/sms_wsj | | | speechcommands | Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition | SLU | ENG | https://www.tensorflow.org/datasets/catalog/speech_commands | | | spgispeech | SPGISpeech 5k corpus | ASR | ENG | https://datasets.kensho.com/datasets/scribe | | @@ -79,12 +83,12 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2 | tedlium2 | TED-LIUM corpus release 2 | ASR | ENG | https://www.openslr.org/19/, http://www.lrec-conf.org/proceedings/lrec2014/pdf/1104_Paper.pdf | | | thchs30 | A Free Chinese Speech Corpus Released by CSLT@Tsinghua University | TTS | CMN | https://www.openslr.org/18/ | | | timit | TIMIT Acoustic-Phonetic Continuous Speech Corpus | ASR | ENG | https://catalog.ldc.upenn.edu/LDC93S1 | | -| totonac | Highland Totonac corpus (endangered language in central Mexico) | ASR | TOS | http://www.openslr.org/107/ | | -| tsukuyomi | つくよみちゃんコーパス | TTS | JPN | https://tyc.rei-yumesaki.net/material/corpus | | -| vctk | English Multi-speaker Corpus for CSTR Voice Cloning Toolkit | TTS | ENG | http://www.udialogue.org/download/cstr-vctk-corpus.html | | +| totonac | Highland Totonac corpus (endangered language in central Mexico) | ASR | TOS | http://www.openslr.org/107/ | | +| tsukuyomi | つくよみちゃんコーパス | TTS | JPN | https://tyc.rei-yumesaki.net/material/corpus | | +| vctk | English Multi-speaker Corpus for CSTR Voice Cloning Toolkit | ASR/TTS | ENG | http://www.udialogue.org/download/cstr-vctk-corpus.html | | | vctk_noisyreverb | Noisy reverberant speech database (48kHz) | SE | ENG | https://datashare.ed.ac.uk/handle/10283/2826 | | | vivos | VIVOS (Vietnamese corpus for ASR) | ASR | VIE | https://ailab.hcmus.edu.vn/vivos/ | | -| voxforge | VoxForge | ASR | 7 languages | http://www.voxforge.org/ | | +| voxforge | VoxForge | ASR | 7 languages | http://www.voxforge.org/ | | | wenetspeech | WenetSpeech: A 10000+ Hours Multi-domain Chinese Corpus for Speech Recognition | ASR | CMN | https://wenet-e2e.github.io/WenetSpeech/ | | | wham | The WSJ0 Hipster Ambient Mixtures (WHAM!) dataset | SE | ENG | https://wham.whisper.ai/ | | | whamr | WHAMR!: Noisy and Reverberant Single-Channel Speech Separation | SE | ENG | https://wham.whisper.ai/ | | @@ -94,3 +98,4 @@ See: https://espnet.github.io/espnet/espnet2_tutorial.html#recipes-using-espnet2 | yesno | The "yesno" corpus | ASR | HEB | http://www.openslr.org/1 | | | yoloxochitl_mixtec | Yoloxochitl-Mixtec corpus (endangered language in central Mexico) | ASR | XTY | http://www.openslr.org/89 | | | zeroth_korean | Zeroth-Korean | ASR | KOR | http://www.openslr.org/40 | | +| zh_openslr38 | ST-CMDS-20170001_1, Free ST Chinese Mandarin Corpus | ASR | CMN | http://www.openslr.org/38 | | \ No newline at end of file diff --git a/egs2/TEMPLATE/asr1/asr.sh b/egs2/TEMPLATE/asr1/asr.sh index 04f7578b5b0..f4d7a8ad24a 100755 --- a/egs2/TEMPLATE/asr1/asr.sh +++ b/egs2/TEMPLATE/asr1/asr.sh @@ -110,6 +110,8 @@ k2_config=./conf/decode_asr_transformer_with_k2.yaml use_streaming=false # Whether to use streaming decoding +use_maskctc=false # Whether to use maskctc decoding + batch_size=1 inference_tag= # Suffix to the result dir for decoding. inference_config= # Config for decoding. @@ -224,6 +226,7 @@ Options: --inference_asr_model # ASR model path for decoding (default="${inference_asr_model}"). --download_model # Download a model from Model Zoo and use it for decoding (default="${download_model}"). --use_streaming # Whether to use streaming decoding (default="${use_streaming}"). + --use_maskctc # Whether to use maskctc decoding (default="${use_streaming}"). # [Task dependent] Set the datadir name created by local/data.sh --train_set # Name of training set (required). @@ -895,7 +898,7 @@ if ! "${skip_train}"; then if "${use_ngram}"; then log "Stage 9: Ngram Training: train_set=${data_feats}/lm_train.txt" cut -f 2- -d " " ${data_feats}/lm_train.txt | lmplz -S "20%" --discount_fallback -o ${ngram_num} - >${ngram_exp}/${ngram_num}gram.arpa - build_binary -s ${ngram_exp}/${ngram_num}gram.arpa ${ngram_exp}/${ngram_num}gram.bin + build_binary -s ${ngram_exp}/${ngram_num}gram.arpa ${ngram_exp}/${ngram_num}gram.bin else log "Stage 9: Skip ngram stages: use_ngram=${use_ngram}" fi @@ -1195,6 +1198,8 @@ if ! "${skip_eval}"; then else if "${use_streaming}"; then asr_inference_tool="espnet2.bin.asr_inference_streaming" + elif "${use_maskctc}"; then + asr_inference_tool="espnet2.bin.asr_inference_maskctc" else asr_inference_tool="espnet2.bin.asr_inference" fi diff --git a/egs2/TEMPLATE/asr1/db.sh b/egs2/TEMPLATE/asr1/db.sh index 3785aef57a8..31008b9502c 100755 --- a/egs2/TEMPLATE/asr1/db.sh +++ b/egs2/TEMPLATE/asr1/db.sh @@ -11,6 +11,7 @@ DIRHA_ENGLISH_PHDEV= DIRHA_WSJ= DIRHA_WSJ_PROCESSED="${PWD}/data/local/dirha_wsj_processed" # Output file path DNS= +DSING=downloads WSJ0= WSJ1= WSJCAM0= @@ -107,6 +108,7 @@ GOOGLEI18N=downloads NOISY_SPEECH= NOISY_REVERBERANT_SPEECH= LRS2= +LRS3= SUNDA=downloads CMU_ARCTIC=downloads CMU_INDIC=downloads @@ -126,6 +128,9 @@ PRIMEWORDS_CHINESE=downloads SEAME= BENGALI=downloads IWSLT14= +ST_CMDS=downloads +MS_INDIC_IS18= +MARATHI=downloads # For only CMU TIR environment if [[ "$(hostname)" == tir* ]]; then @@ -159,6 +164,8 @@ if [[ "$(hostname)" == tir* ]]; then IWSLT22_DIALECT=/projects/tir5/data/speech_corpora/LDC2022E01_IWSLT22_Tunisian_Arabic_Shared_Task_Training_Data/ PRIMEWORDS_CHINESE=/projects/tir5/data/speech_corpora/Primewords_Chinese FISHER_CALLHOME_SPANISH=/projects/tir5/data/speech_corpora/fisher_callhome_spanish + DSING=/projects/tir5/data/speech_corpora/sing_300x30x2 + MS_INDIC_IS18=/projects/tir6/general/cnariset/corpora/microsoft_speech_corpus_indian_languages fi # For only JHU environment diff --git a/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py b/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py index 13354637d52..4f0f074c9db 100755 --- a/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py +++ b/egs2/TEMPLATE/asr1/pyscripts/utils/score_intent.py @@ -12,7 +12,7 @@ import argparse -def get_classification_result(hyp_file, ref_file): +def get_classification_result(hyp_file, ref_file, hyp_write, ref_write): hyp_lines = [line for line in hyp_file] ref_lines = [line for line in ref_file] @@ -22,6 +22,16 @@ def get_classification_result(hyp_file, ref_file): ref_intent = ref_lines[line_count].split(" ")[0] if hyp_intent != ref_intent: error += 1 + hyp_write.write( + " ".join(hyp_lines[line_count].split("\t")[0].split(" ")[1:]) + + "\t" + + hyp_lines[line_count].split("\t")[1] + ) + ref_write.write( + " ".join(ref_lines[line_count].split("\t")[0].split(" ")[1:]) + + "\t" + + ref_lines[line_count].split("\t")[1] + ) return 1 - (error / len(hyp_lines)) @@ -56,7 +66,16 @@ def get_classification_result(hyp_file, ref_file): os.path.join(exp_root, valid_inference_folder + "score_wer/ref.trn") ) -result = get_classification_result(valid_hyp_file, valid_ref_file) +valid_hyp_write_file = open( + os.path.join(exp_root, valid_inference_folder + "score_wer/hyp_asr.trn"), "w" +) +valid_ref_write_file = open( + os.path.join(exp_root, valid_inference_folder + "score_wer/ref_asr.trn"), "w" +) + +result = get_classification_result( + valid_hyp_file, valid_ref_file, valid_hyp_write_file, valid_ref_write_file +) print("Valid Intent Classification Result") print(result) @@ -66,8 +85,16 @@ def get_classification_result(hyp_file, ref_file): test_ref_file = open( os.path.join(exp_root, test_inference_folder + "score_wer/ref.trn") ) +test_hyp_write_file = open( + os.path.join(exp_root, test_inference_folder + "score_wer/hyp_asr.trn"), "w" +) +test_ref_write_file = open( + os.path.join(exp_root, test_inference_folder + "score_wer/ref_asr.trn"), "w" +) -result = get_classification_result(test_hyp_file, test_ref_file) +result = get_classification_result( + test_hyp_file, test_ref_file, test_hyp_write_file, test_ref_write_file +) print("Test Intent Classification Result") print(result) @@ -79,6 +106,17 @@ def get_classification_result(hyp_file, ref_file): utt_test_ref_file = open( os.path.join(exp_root, utt_test_inference_folder + "score_wer/ref.trn") ) - result = get_classification_result(utt_test_hyp_file, utt_test_ref_file) + utt_test_hyp_write_file = open( + os.path.join(exp_root, utt_test_inference_folder + "score_wer/hyp_asr.trn"), "w" + ) + utt_test_ref_write_file = open( + os.path.join(exp_root, utt_test_inference_folder + "score_wer/ref_asr.trn"), "w" + ) + result = get_classification_result( + utt_test_hyp_file, + utt_test_ref_file, + utt_test_hyp_write_file, + utt_test_ref_write_file, + ) print("Unseen Utterance Test Intent Classification Result") print(result) diff --git a/egs2/TEMPLATE/asr1/pyscripts/utils/score_summarization.py b/egs2/TEMPLATE/asr1/pyscripts/utils/score_summarization.py new file mode 100644 index 00000000000..35202f1ce88 --- /dev/null +++ b/egs2/TEMPLATE/asr1/pyscripts/utils/score_summarization.py @@ -0,0 +1,50 @@ +import sys +import os +from datasets import load_metric +import numpy as np +from nlgeval import compute_metrics +from nlgeval import NLGEval + + +ref_file = sys.argv[1] +hyp_file = sys.argv[2] + +with open(ref_file, "r") as f: + ref_dict = { + line.strip().split(" ")[0]: " ".join(line.strip().split(" ")[1:]) + for line in f.readlines() + } + +with open(hyp_file, "r") as f: + hyp_dict = { + line.strip().split(" ")[0]: " ".join(line.strip().split(" ")[1:]) + for line in f.readlines() + } + +keys = [k for k, v in hyp_dict.items()] +labels = [ref_dict[k] for k, _ in hyp_dict.items()] +decoded_preds = [v for k, v in hyp_dict.items()] + +metric = load_metric("bertscore") +result_bert = metric.compute( + predictions=decoded_preds, + references=labels, + lang="en", +) + + +nlg = NLGEval() # loads the models +print("Key", "\t", "METEOR", "\t", "ROUGE-L") +for (key, ref, hyp) in zip(keys, labels, decoded_preds): + metrics_dict = nlg.compute_individual_metrics([ref], hyp) + print(key, "\t", metrics_dict["METEOR"], "\t", metrics_dict["ROUGE_L"]) +refs = [[x] for x in labels] +metrics_dict = nlg.compute_metrics(ref_list=[labels], hyp_list=decoded_preds) +metric = load_metric("rouge") +result = metric.compute(predictions=decoded_preds, references=labels) +result = {key: value.mid.fmeasure * 100 for key, value in result.items()} + +print( + f"RESULT {result['rouge1']} {result['rouge2']} {result['rougeL']} \ + {metrics_dict['METEOR']*100.0} {100*np.mean(result_bert['precision'])}" +) diff --git a/egs2/TEMPLATE/asr1/scripts/utils/show_asr_result.sh b/egs2/TEMPLATE/asr1/scripts/utils/show_asr_result.sh index afa768bf5d5..9b8abb9d658 100755 --- a/egs2/TEMPLATE/asr1/scripts/utils/show_asr_result.sh +++ b/egs2/TEMPLATE/asr1/scripts/utils/show_asr_result.sh @@ -44,7 +44,16 @@ cat << EOF EOF while IFS= read -r expdir; do - if ls "${expdir}"/*/*/score_*/result.txt &> /dev/null; then + + if ls "${expdir}"/*/*/result.sum &> /dev/null; then + echo "## $(basename ${expdir})" + cat << EOF +|dataset|ROUGE-1|ROUGE-2|ROUGE-L|METEOR|BERTScore| +|---|---|---|---|---|---| +EOF + grep -H -e "RESULT" "${expdir}"/*/*/result.sum | sed 's=RESULT==g' | cut -d ' ' -f 1,2- | tr ' ' '|' + echo + elif ls "${expdir}"/*/*/score_*/result.txt &> /dev/null; then echo "## $(basename ${expdir})" for type in wer cer ter; do cat << EOF diff --git a/egs2/TEMPLATE/mt1/mt.sh b/egs2/TEMPLATE/mt1/mt.sh index 6164c155558..35c6ab276c3 100755 --- a/egs2/TEMPLATE/mt1/mt.sh +++ b/egs2/TEMPLATE/mt1/mt.sh @@ -1165,37 +1165,54 @@ if ! "${skip_eval}"; then _scoredir="${_dir}/score_bleu" mkdir -p "${_scoredir}" - paste \ - <(<"${_data}/text.${tgt_case}.${tgt_lang}" \ - ${python} -m espnet2.bin.tokenize_text \ - -f 2- --input - --output - \ - --token_type word \ - --non_linguistic_symbols "${nlsyms_txt}" \ - --remove_non_linguistic_symbols true \ - --cleaner "${cleaner}" \ - ) \ - <(<"${_data}/text.${tgt_case}.${tgt_lang}" awk '{ print "(" $2 "-" $1 ")" }') \ - >"${_scoredir}/ref.trn.org" + <"${_data}/text.${tgt_case}.${tgt_lang}" \ + ${python} -m espnet2.bin.tokenize_text \ + -f 2- --input - --output - \ + --token_type word \ + --non_linguistic_symbols "${nlsyms_txt}" \ + --remove_non_linguistic_symbols true \ + --cleaner "${cleaner}" \ + >"${_scoredir}/ref.trn" + + #paste \ + # <(<"${_data}/text.${tgt_case}.${tgt_lang}" \ + # ${python} -m espnet2.bin.tokenize_text \ + # -f 2- --input - --output - \ + # --token_type word \ + # --non_linguistic_symbols "${nlsyms_txt}" \ + # --remove_non_linguistic_symbols true \ + # --cleaner "${cleaner}" \ + # ) \ + # <(<"${_data}/text.${tgt_case}.${tgt_lang}" awk '{ print "(" $2 "-" $1 ")" }') \ + # >"${_scoredir}/ref.trn.org" # NOTE(kamo): Don't use cleaner for hyp - paste \ - <(<"${_dir}/text" \ - ${python} -m espnet2.bin.tokenize_text \ - -f 2- --input - --output - \ - --token_type word \ - --non_linguistic_symbols "${nlsyms_txt}" \ - --remove_non_linguistic_symbols true \ - ) \ - <(<"${_data}/text.${tgt_case}.${tgt_lang}" awk '{ print "(" $2 "-" $1 ")" }') \ - >"${_scoredir}/hyp.trn.org" + <"${_dir}/text" \ + ${python} -m espnet2.bin.tokenize_text \ + -f 2- --input - --output - \ + --token_type word \ + --non_linguistic_symbols "${nlsyms_txt}" \ + --remove_non_linguistic_symbols true \ + >"${_scoredir}/hyp.trn" + + #paste \ + # <(<"${_dir}/text" \ + # ${python} -m espnet2.bin.tokenize_text \ + # -f 2- --input - --output - \ + # --token_type word \ + # --non_linguistic_symbols "${nlsyms_txt}" \ + # --remove_non_linguistic_symbols true \ + # ) \ + # <(<"${_data}/text.${tgt_case}.${tgt_lang}" awk '{ print "(" $2 "-" $1 ")" }') \ + # >"${_scoredir}/hyp.trn.org" # remove utterance id - perl -pe 's/\([^\)]+\)//g;' "${_scoredir}/ref.trn.org" > "${_scoredir}/ref.trn" - perl -pe 's/\([^\)]+\)//g;' "${_scoredir}/hyp.trn.org" > "${_scoredir}/hyp.trn" + #perl -pe 's/\([^\)]+\)//g;' "${_scoredir}/ref.trn.org" > "${_scoredir}/ref.trn" + #perl -pe 's/\([^\)]+\)//g;' "${_scoredir}/hyp.trn.org" > "${_scoredir}/hyp.trn" # detokenizer - detokenizer.perl -l en -q < "${_scoredir}/ref.trn" > "${_scoredir}/ref.trn.detok" - detokenizer.perl -l en -q < "${_scoredir}/hyp.trn" > "${_scoredir}/hyp.trn.detok" + detokenizer.perl -l ${tgt_lang} -q < "${_scoredir}/ref.trn" > "${_scoredir}/ref.trn.detok" + detokenizer.perl -l ${tgt_lang} -q < "${_scoredir}/hyp.trn" > "${_scoredir}/hyp.trn.detok" if [ ${tgt_case} = "tc" ]; then echo "Case sensitive BLEU result (single-reference)" >> ${_scoredir}/result.tc.txt @@ -1238,7 +1255,7 @@ if ! "${skip_eval}"; then # perl -pe 's/\([^\)]+\)//g;' "${_scoredir}/ref.trn.org.${ref_idx}" > "${_scoredir}/ref.trn.${ref_idx}" - detokenizer.perl -l en -q < "${_scoredir}/ref.trn.${ref_idx}" > "${_scoredir}/ref.trn.detok.${ref_idx}" + detokenizer.perl -l ${tgt_lang} -q < "${_scoredir}/ref.trn.${ref_idx}" > "${_scoredir}/ref.trn.detok.${ref_idx}" remove_punctuation.pl < "${_scoredir}/ref.trn.detok.${ref_idx}" > "${_scoredir}/ref.trn.detok.lc.rm.${ref_idx}" case_sensitive_refs="${case_sensitive_refs} ${_scoredir}/ref.trn.detok.${ref_idx}" case_insensitive_refs="${case_insensitive_refs} ${_scoredir}/ref.trn.detok.lc.rm.${ref_idx}" diff --git a/egs2/TEMPLATE/ssl1/pyscripts/dump_km_label.py b/egs2/TEMPLATE/ssl1/pyscripts/dump_km_label.py index 6b14ac4ec97..552c84f89ad 100644 --- a/egs2/TEMPLATE/ssl1/pyscripts/dump_km_label.py +++ b/egs2/TEMPLATE/ssl1/pyscripts/dump_km_label.py @@ -39,13 +39,13 @@ class ApplyKmeans(object): def __init__(self, km_path): self.km_model = joblib.load(km_path) self.nc = self.km_model.cluster_centers_.transpose() - self.nc_norm = (self.nc ** 2).sum(0, keepdims=True) + self.nc_norm = (self.nc**2).sum(0, keepdims=True) def __call__(self, x): if isinstance(x, torch.Tensor): x = x.cpu().numpy() probs = ( - (x ** 2).sum(1, keepdims=True) - 2 * np.matmul(x, self.nc) + self.nc_norm + (x**2).sum(1, keepdims=True) - 2 * np.matmul(x, self.nc) + self.nc_norm ) return np.argmin(probs, axis=1) diff --git a/egs2/TEMPLATE/st1/st.sh b/egs2/TEMPLATE/st1/st.sh index 93ffe4d3cf5..9867f341f88 100755 --- a/egs2/TEMPLATE/st1/st.sh +++ b/egs2/TEMPLATE/st1/st.sh @@ -296,18 +296,8 @@ fi # Extra files for translation process utt_extra_files="text.${src_case}.${src_lang} text.${tgt_case}.${tgt_lang}" # Use the same text as ST for bpe training if not specified. -if "${token_joint}"; then - # if token_joint, the bpe training will use both src_lang and tgt_lang to train a single bpe model - [ -z "${src_bpe_train_text}" ] && src_bpe_train_text="${data_feats}/${train_set}/text.${src_case}.${src_lang}" - [ -z "${tgt_bpe_train_text}" ] && tgt_bpe_train_text="${data_feats}/${train_set}/text.${tgt_case}.${tgt_lang}" - - # Prepare data as text.${src_lang}_${tgt_lang}) - cat $src_bpe_train_text $tgt_bpe_train_text > ${data_feats}/${train_set}/text.${src_lang}_${tgt_lang} - tgt_bpe_train_text="${data_feats}/${train_set}/text.${src_lang}_${tgt_lang}" -else - [ -z "${src_bpe_train_text}" ] && src_bpe_train_text="${data_feats}/${train_set}/text.${src_case}.${src_lang}" - [ -z "${tgt_bpe_train_text}" ] && tgt_bpe_train_text="${data_feats}/${train_set}/text.${tgt_case}.${tgt_lang}" -fi +[ -z "${src_bpe_train_text}" ] && src_bpe_train_text="${data_feats}/${train_set}/text.${src_case}.${src_lang}" +[ -z "${tgt_bpe_train_text}" ] && tgt_bpe_train_text="${data_feats}/${train_set}/text.${tgt_case}.${tgt_lang}" # Use the same text as ST for lm training if not specified. [ -z "${lm_train_text}" ] && lm_train_text="${data_feats}/${train_set}/text.${tgt_case}.${tgt_lang}" # Use the same text as ST for lm training if not specified. @@ -743,6 +733,16 @@ if ! "${skip_data_prep}"; then fi if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # Combine source and target texts when using joint tokenization + if "${token_joint}"; then + log "Merge src and target data if joint BPE" + + cat $tgt_bpe_train_text > ${data_feats}/${train_set}/text.${src_lang}_${tgt_lang} + [ ! -z "${src_bpe_train_text}" ] && cat ${src_bpe_train_text} >> ${data_feats}/${train_set}/text.${src_lang}_${tgt_lang} + # Set the new text as the target text + tgt_bpe_train_text="${data_feats}/${train_set}/text.${src_lang}_${tgt_lang}" + fi + # First generate tgt lang if [ "${tgt_token_type}" = bpe ]; then log "Stage 5a: Generate token_list from ${tgt_bpe_train_text} using BPE for tgt_lang" @@ -1484,8 +1484,8 @@ if ! "${skip_eval}"; then perl -pe 's/\([^\)]+\)//g;' "${_scoredir}/hyp.trn.org" > "${_scoredir}/hyp.trn" # detokenizer - detokenizer.perl -l en -q < "${_scoredir}/ref.trn" > "${_scoredir}/ref.trn.detok" - detokenizer.perl -l en -q < "${_scoredir}/hyp.trn" > "${_scoredir}/hyp.trn.detok" + detokenizer.perl -l ${tgt_lang} -q < "${_scoredir}/ref.trn" > "${_scoredir}/ref.trn.detok" + detokenizer.perl -l ${tgt_lang} -q < "${_scoredir}/hyp.trn" > "${_scoredir}/hyp.trn.detok" if [ ${tgt_case} = "tc" ]; then echo "Case sensitive BLEU result (single-reference)" >> ${_scoredir}/result.tc.txt @@ -1528,7 +1528,7 @@ if ! "${skip_eval}"; then # perl -pe 's/\([^\)]+\)//g;' "${_scoredir}/ref.trn.org.${ref_idx}" > "${_scoredir}/ref.trn.${ref_idx}" - detokenizer.perl -l en -q < "${_scoredir}/ref.trn.${ref_idx}" > "${_scoredir}/ref.trn.detok.${ref_idx}" + detokenizer.perl -l ${tgt_lang} -q < "${_scoredir}/ref.trn.${ref_idx}" > "${_scoredir}/ref.trn.detok.${ref_idx}" remove_punctuation.pl < "${_scoredir}/ref.trn.detok.${ref_idx}" > "${_scoredir}/ref.trn.detok.lc.rm.${ref_idx}" case_sensitive_refs="${case_sensitive_refs} ${_scoredir}/ref.trn.detok.${ref_idx}" case_insensitive_refs="${case_insensitive_refs} ${_scoredir}/ref.trn.detok.lc.rm.${ref_idx}" @@ -1551,7 +1551,7 @@ if ! "${skip_eval}"; then done # Show results in Markdown syntax - scripts/utils/show_st_result.sh --case $tgt_case "${st_exp}" > "${st_exp}"/RESULTS.md + scripts/utils/show_translation_result.sh --case $tgt_case "${st_exp}" > "${st_exp}"/RESULTS.md cat "${cat_exp}"/RESULTS.md fi else diff --git a/egs2/bn_openslr53/asr1/README.md b/egs2/bn_openslr53/asr1/README.md new file mode 100644 index 00000000000..542c8053339 --- /dev/null +++ b/egs2/bn_openslr53/asr1/README.md @@ -0,0 +1,29 @@ +# RESULTS +## Environments +- date: `Mon Jan 31 10:53:20 EST 2022` +- python version: `3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0]` +- espnet version: `espnet 0.10.6a1` +- pytorch version: `pytorch 1.8.1+cu102` +- Git hash: `9d09bf551a9fe090973de60e15adec1de6b3d054` + - Commit date: `Fri Jan 21 11:43:15 2022 -0500` +- Pretrained Model: https://huggingface.co/espnet/bn_openslr53 + +## asr_train_asr_raw_bpe1000 +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_batch_size1_lm_lm_train_lm_bpe1000_valid.loss.ave_asr_model_valid.acc.best/sbn_test|2018|6470|74.2|21.3|4.5|2.2|28.0|48.8| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_batch_size1_lm_lm_train_lm_bpe1000_valid.loss.ave_asr_model_valid.acc.best/sbn_test|2018|39196|89.4|4.3|6.3|1.4|12.0|48.8| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_batch_size1_lm_lm_train_lm_bpe1000_valid.loss.ave_asr_model_valid.acc.best/sbn_test|2018|15595|77.6|12.7|9.7|1.6|24.0|48.7| + diff --git a/egs2/chime4/enh1/README.md b/egs2/chime4/enh1/README.md index 9ca905d08cd..886eb0cbf26 100644 --- a/egs2/chime4/enh1/README.md +++ b/egs2/chime4/enh1/README.md @@ -6,6 +6,7 @@ - python version: `3.6.3 |Anaconda, Inc.| (default, Nov 20 2017, 20:41:42) [GCC 7.2.0]` - espnet version: `espnet 0.9.7` - pytorch version: `pytorch 1.6.0` +- Note: PESQ is evaluated based on https://github.com/vBaiCai/python-pesq ## enh_train_enh_conv_tasnet_raw @@ -25,3 +26,36 @@ config: conf/tuning/train_enh_beamformer_mvdr.yaml |---|---|---|---|---|---|---| |enhanced_dt05_simu_isolated_6ch_track|2.60|0.94|13.67|13.67|0|12.51| |enhanced_et05_simu_isolated_6ch_track|2.63|0.95|15.51|15.51|0|14.65| + + +## enh_train_enh_dc_crn_mapping_snr_raw + +config: conf/tuning/train_enh_dc_crn_mapping_snr.yaml + +|dataset|PESQ|STOI|SAR|SDR|SIR|SI_SNR| +|---|---|---|---|---|---|---| +|enhanced_dt05_simu_isolated_6ch_track|3.10|0.96|17.82|17.82|0.00|17.59| +|enhanced_et05_simu_isolated_6ch_track|2.95|0.95|17.33|17.33|0.00|17.04| + + +# RESULTS +## Environments +- date: `Sat Mar 19 07:17:45 CST 2022` +- python version: `3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0]` +- espnet version: `espnet 0.10.7a1` +- pytorch version: `pytorch 1.8.1` +- Git hash: `648b024d8fb262eb9923c06a698b9c6df5b16e51` + - Commit date: `Wed Mar 16 18:47:21 2022 +0800` + + +## enh_train_enh_dprnntac_fasnet_raw + +config: conf/tuning/train_enh_dprnntac_fasnet.yaml + +Pretrained model: https://huggingface.co/lichenda/chime4_fasnet_dprnn_tac + +|dataset|STOI|SAR|SDR|SIR| +|---|---|---|---|---| +|enhanced_dt05_simu_isolated_6ch_track|0.95|15.75|15.75|0.00| +|enhanced_et05_simu_isolated_6ch_track|0.94|15.40|15.40|0.00| + diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml index fc996552cd3..cee051c8ef1 100644 --- a/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml +++ b/egs2/chime4/enh1/conf/tuning/train_enh_beamformer_mvdr.yaml @@ -53,7 +53,7 @@ separator_conf: bunits: 512 bprojs: 512 badim: 320 - ref_channel: 4 + ref_channel: 3 use_noise_mask: True beamformer_type: mvdr_souden bdropout_rate: 0.0 diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_dc_crn_mapping_snr.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_dc_crn_mapping_snr.yaml new file mode 100644 index 00000000000..38d61843282 --- /dev/null +++ b/egs2/chime4/enh1/conf/tuning/train_enh_dc_crn_mapping_snr.yaml @@ -0,0 +1,67 @@ +init: xavier_uniform +max_epoch: 200 +batch_type: folded +batch_size: 16 +iterator_type: chunk +chunk_length: 32000 +num_workers: 4 +optim: adam +optim_conf: + lr: 1.0e-03 + eps: 1.0e-08 + weight_decay: 1.0e-7 + amsgrad: true +patience: 10 +grad_clip: 5 +val_scheduler_criterion: +- valid +- loss +best_model_criterion: +- - valid + - si_snr + - max +- - valid + - loss + - min +keep_nbest_models: 1 +scheduler: steplr +scheduler_conf: + step_size: 2 + gamma: 0.98 + +# A list for criterions +# The overlall loss in the multi-task learning will be: +# loss = weight_1 * loss_1 + ... + weight_N * loss_N +# The default `weight` for each sub-loss is 1.0 +criterions: + # The first criterion + - name: snr + conf: + eps: 1.0e-7 + # the wrapper for the current criterion + # PIT is widely used in the speech separation task + wrapper: pit + wrapper_conf: + weight: 1.0 + + +encoder: stft +encoder_conf: + n_fft: 256 + hop_length: 128 +decoder: stft +decoder_conf: + n_fft: 256 + hop_length: 128 +separator: dc_crn +separator_conf: + num_spk: 1 + input_channels: [10, 16, 32, 64, 128, 256] # 5x2=10 input channels + enc_hid_channels: 8 + enc_layers: 5 + glstm_groups: 2 + glstm_layers: 2 + glstm_bidirectional: true + glstm_rearrange: false + mode: mapping + ref_channel: 3 diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_fasnet.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_fasnet.yaml new file mode 100644 index 00000000000..b5dd47ddac7 --- /dev/null +++ b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_fasnet.yaml @@ -0,0 +1,59 @@ +optim: adam +init: xavier_uniform +max_epoch: 100 +batch_type: folded +batch_size: 8 +iterator_type: chunk +chunk_length: 32000 +num_workers: 4 +optim_conf: + lr: 1.0e-03 + eps: 1.0e-08 + weight_decay: 0 +patience: 10 +val_scheduler_criterion: +- valid +- loss +best_model_criterion: +- - valid + - si_snr + - max +- - valid + - loss + - min +keep_nbest_models: 1 +scheduler: steplr +scheduler_conf: + step_size: 2 + gamma: 0.98 + +encoder: same +encoder_conf: {} +decoder: same +decoder_conf: {} +separator: fasnet +separator_conf: + enc_dim: 64 + feature_dim: 64 + hidden_dim: 128 + layer: 6 + segment_size: 24 + num_spk: 1 + win_len: 16 + context_len: 16 + sr: 16000 + fasnet_type: 'fasnet' + dropout: 0.2 + + + +criterions: + # The first criterion + - name: si_snr + conf: + eps: 1.0e-7 + # the wrapper for the current criterion + # for single-talker case, we simplely use fixed_order wrapper + wrapper: fixed_order + wrapper_conf: + weight: 1.0 diff --git a/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_ifasnet.yaml b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_ifasnet.yaml new file mode 100644 index 00000000000..ef1349ad8b9 --- /dev/null +++ b/egs2/chime4/enh1/conf/tuning/train_enh_dprnntac_ifasnet.yaml @@ -0,0 +1,58 @@ +optim: adam +init: xavier_uniform +max_epoch: 100 +batch_type: folded +batch_size: 8 +iterator_type: chunk +chunk_length: 32000 +num_workers: 4 +optim_conf: + lr: 1.0e-03 + eps: 1.0e-08 + weight_decay: 0 +patience: 10 +val_scheduler_criterion: +- valid +- loss +best_model_criterion: +- - valid + - si_snr + - max +- - valid + - loss + - min +keep_nbest_models: 1 +scheduler: steplr +scheduler_conf: + step_size: 2 + gamma: 0.98 + +encoder: same +encoder_conf: {} +decoder: same +decoder_conf: {} +separator: fasnet +separator_conf: + enc_dim: 64 + feature_dim: 64 + hidden_dim: 128 + layer: 6 + segment_size: 24 + num_spk: 1 + win_len: 16 + context_len: 16 + sr: 16000 + fasnet_type: 'ifasnet' + + + +criterions: + # The first criterion + - name: si_snr + conf: + eps: 1.0e-7 + # the wrapper for the current criterion + # for single-talker case, we simplely use fixed_order wrapper + wrapper: fixed_order + wrapper_conf: + weight: 1.0 diff --git a/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh b/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh index 08df7d0dc4c..5cd50773aeb 100755 --- a/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh +++ b/egs2/chime4/enh1/local/simu_ext_chime4_data_prep.sh @@ -85,6 +85,8 @@ elif [[ "$track" == "6" ]]; then done for x in $list_set; do + # drop the second channel to follow the convention in CHiME-4 + # see P27 in https://hal.inria.fr/hal-01399180/file/vincent_CSL16.pdf mix-mono-wav-scp.py ${x}_wav.CH{1,3,4,5,6}.scp > ${x}_wav.scp mix-mono-wav-scp.py ${x}_spk1_wav.CH{1,3,4,5,6}.scp > ${x}_spk1_wav.scp sed -E "s#\.Clean\.wav#\.Noise\.wav#g" ${x}_spk1_wav.scp > ${x}_noise_wav.scp diff --git a/egs2/chime4/enh1/run.sh b/egs2/chime4/enh1/run.sh index cf95ee85954..60ee54ec435 100755 --- a/egs2/chime4/enh1/run.sh +++ b/egs2/chime4/enh1/run.sh @@ -25,7 +25,7 @@ test_sets="et05_simu_isolated_1ch_track" --fs ${sample_rate} \ --ngpu 2 \ --spk_num 1 \ - --ref_channel 4 \ + --ref_channel 3 \ --local_data_opts "--extra-annotations ${extra_annotations} --stage 1 --stop-stage 2" \ --enh_config conf/tuning/train_enh_conv_tasnet.yaml \ --use_dereverb_ref false \ diff --git a/egs2/dsing/asr1/RESULTS.md b/egs2/dsing/asr1/RESULTS.md new file mode 100644 index 00000000000..0cdd661e049 --- /dev/null +++ b/egs2/dsing/asr1/RESULTS.md @@ -0,0 +1,55 @@ + +# RESULTS +## Environments +- date: `Sat Mar 19 23:02:37 EDT 2022` +- python version: `3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]` +- espnet version: `espnet 0.10.7a1` +- pytorch version: `pytorch 1.10.1` +- Git hash: `c1ed71c6899e54c0b3dad82687886b1183cd0885` + - Commit date: `Wed Mar 16 23:34:49 2022 -0400` + +## asr_train_asr_conformer7_hubert_ll60k_large_raw_bpe500_sp +- model: https://huggingface.co/espnet/ftshijt_espnet2_asr_dsing_hubert_conformer +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/dev|482|4018|83.6|9.4|7.0|6.4|22.8|58.3| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/test|480|4632|81.4|12.3|6.3|4.5|23.1|52.1| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/dev|482|18692|88.5|3.1|8.4|5.9|17.4|58.3| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/test|480|21787|87.9|4.3|7.8|4.5|16.6|52.1| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/dev|482|6097|82.2|7.1|10.7|5.7|23.5|58.3| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_latest/test|480|7736|81.7|9.2|9.1|4.0|22.3|52.1| + +## asr_train_asr_raw_bpe500_sp +- model: https://huggingface.co/espnet/ftshijt_espnet2_asr_dsing_transformer +### WER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/dev|482|4018|77.0|16.2|6.8|4.0|27.0|65.1| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/test|480|4632|76.1|17.3|6.6|3.7|27.6|57.7| + +### CER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/dev|482|18692|85.0|5.8|9.2|4.2|19.2|65.1| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/test|480|21787|84.9|6.3|8.8|4.2|19.3|57.7| + +### TER + +|dataset|Snt|Wrd|Corr|Sub|Del|Ins|Err|S.Err| +|---|---|---|---|---|---|---|---|---| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/dev|482|6097|75.2|12.8|12.0|4.1|28.9|65.1| +|decode_asr_lm_lm_train_lm_bpe500_valid.loss.ave_asr_model_valid.acc.ave/test|480|7736|75.3|14.3|10.4|4.1|28.8|57.7| \ No newline at end of file diff --git a/egs2/dsing/asr1/asr.sh b/egs2/dsing/asr1/asr.sh new file mode 120000 index 00000000000..60b05122cfd --- /dev/null +++ b/egs2/dsing/asr1/asr.sh @@ -0,0 +1 @@ +../../TEMPLATE/asr1/asr.sh \ No newline at end of file diff --git a/egs2/dsing/asr1/cmd.sh b/egs2/dsing/asr1/cmd.sh new file mode 100644 index 00000000000..2aae6919fef --- /dev/null +++ b/egs2/dsing/asr1/cmd.sh @@ -0,0 +1,110 @@ +# ====== About run.pl, queue.pl, slurm.pl, and ssh.pl ====== +# Usage: .pl [options] JOB=1: +# e.g. +# run.pl --mem 4G JOB=1:10 echo.JOB.log echo JOB +# +# Options: +# --time