Skip to content

Commit

Permalink
Merge branch 'espnet:master' into master
Browse files Browse the repository at this point in the history
  • Loading branch information
roshansh-cmu authored Apr 30, 2022
2 parents 835033c + b757b89 commit ffe7c58
Show file tree
Hide file tree
Showing 370 changed files with 10,451 additions and 175 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/docker.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ on:
jobs:
docker:
runs-on: ubuntu-latest
if: ${{ github.event.pull_request.merged == 'true' }}
if: github.event.pull_request.merged == true
steps:
- uses: actions/checkout@v2

Expand Down
70 changes: 59 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,12 +77,12 @@ ESPnet uses [pytorch](http://pytorch.org/) as a deep learning engine and also fo
- Self-supervised learning representations as features, using upstream models in [S3PRL](https://github.com/s3prl/s3prl) in frontend.
- Set `frontend` to be `s3prl`
- Select any upstream model by setting the `frontend_conf` to the corresponding name.
- Transfer Learning :
- easy usage and transfers from models previously trained by your group, or models from [ESPnet huggingface repository](https://huggingface.co/espnet).
- [Documentation](https://github.com/espnet/espnet/tree/master/egs2/mini_an4/asr1/transfer_learning.md) and [toy example runnable on colab](https://github.com/espnet/notebook/blob/master/espnet2_asr_transfer_learning_demo.ipynb).
- Streaming Transformer/Conformer ASR with blockwise synchronous beam search.
- Restricted Self-Attention based on [Longformer](https://arxiv.org/abs/2004.05150) as an encoder for long sequences

### SUM: Speech Summarization
- End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [[Sharma et al., 2022]](https://arxiv.org/abs/2110.06263)

Demonstration
- Real-time ASR demo with ESPnet2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/espnet/notebook/blob/master/espnet2_asr_realtime_demo.ipynb)
- [Gradio](https://github.com/gradio-app/gradio) Web Demo on [Huggingface Spaces](https://huggingface.co/docs/hub/spaces). Check out the [Web Demo](https://huggingface.co/spaces/akhaliq/espnet2_asr)
Expand Down Expand Up @@ -133,15 +133,14 @@ To train the neural vocoder, please check the following repositories:
- Multi-speaker speech separation
- Unified encoder-separator-decoder structure for time-domain and frequency-domain models
- Encoder/Decoder: STFT/iSTFT, Convolution/Transposed-Convolution
- Separators: BLSTM, Transformer, Conformer, [TasNet](https://arxiv.org/abs/1809.07454), [DPRNN](https://arxiv.org/abs/1910.06379), [DC-CRN](https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf), [DCCRN](https://arxiv.org/abs/2008.00264), Neural Beamformers, etc.
- Separators: BLSTM, Transformer, Conformer, [TasNet](https://arxiv.org/abs/1809.07454), [DPRNN](https://arxiv.org/abs/1910.06379), [SkiM](https://arxiv.org/abs/2201.10800), [SVoice](https://arxiv.org/abs/2011.02329), [DC-CRN](https://web.cse.ohio-state.edu/~wang.77/papers/TZW.taslp21.pdf), [DCCRN](https://arxiv.org/abs/2008.00264), [Deep Clustering](https://ieeexplore.ieee.org/document/7471631), [Deep Attractor Network](https://pubmed.ncbi.nlm.nih.gov/29430212/), [FaSNet](https://arxiv.org/abs/1909.13387), [iFaSNet](https://arxiv.org/abs/1910.14104), Neural Beamformers, etc.
- Flexible ASR integration: working as an individual task or as the ASR frontend
- Easy to import pretrained models from [Asteroid](https://github.com/asteroid-team/asteroid)
- Both the pre-trained models from Asteroid and the specific configuration are supported.

Demonstration
- Interactive SE demo with ESPnet2 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1fjRJCh96SoYLZPRxsjF9VDv4Q2VoIckI?usp=sharing)


### ST: Speech Translation & MT: Machine Translation
- **State-of-the-art performance** in several ST benchmarks (comparable/superior to cascaded ASR and MT)
- Transformer based end-to-end ST (new!)
Expand All @@ -152,9 +151,34 @@ Demonstration
- End-to-end VC based on cascaded ASR+TTS (Baseline system for Voice Conversion Challenge 2020!)

### SLU: Speech Language Understanding
- Predicting intent by directly classifying it as one of intent or decoding by character
- Transformer & RNN based encoder-decoder model
- Establish SOTA results with spectral augmentation (Performs better than reported results of pretrained model on Fluent Speech Command Dataset)
- Architecture
- Transformer based Encoder
- Conformer based Encoder
- RNN based Decoder
- Transformer based Decoder
- Support Multitasking with ASR
- Predict both intent and ASR transcript
- Support Multitasking with NLU
- Deliberation encoder based 2 pass model
- Support using pretrained ASR models
- Hubert
- Wav2vec2
- VQ-APC
- TERA and more ...
- Support using pretrained NLP models
- BERT
- MPNet And more...
- Various language support
- En / Jp / Zn / Nl / And more...
- Supports using context from previous utterances
- Supports using other tasks like SE in pipeline manner
Demonstration
- Performing noisy spoken language understanding using speech enhancement model followed by spoken language understanding model. [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14nCrJ05vJcQX0cJuXjbMVFWUHJ3Wfb6N?usp=sharing)
- Integrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See SLU demo on multiple languages: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/Siddhant/ESPnet2-SLU)


### SUM: Speech Summarization
- End to End Speech Summarization Recipe for Instructional Videos using Restricted Self-Attention [[Sharma et al., 2022]](https://arxiv.org/abs/2110.06263)

### DNN Framework
- Flexible network architecture thanks to chainer and pytorch
Expand Down Expand Up @@ -532,11 +556,33 @@ You can download converted samples of the cascade ASR+TTS baseline system [here]

### SLU results

<details><summary>ESPnet2</summary><div>
<details><summary>expand</summary><div>


We list the performance on various SLU tasks and dataset using the metric reported in the original dataset paper

| Task | Dataset | Metric | Result | Pretrained Model |
| ----------------------------------------------------------------- | :-------------: | :-------------: | :-------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| Intent Classification | SLURP | Acc | 86.3 | [link](https://github.com/espnet/espnet/tree/master/egs2/slurp/asr1/README.md) |
| Intent Classification | FSC | Acc | 99.6 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc/asr1/README.md) |
| Intent Classification | FSC Unseen Speaker Set | Acc | 98.6 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_unseen/asr1/README.md) |
| Intent Classification | FSC Unseen Utterance Set | Acc | 86.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_unseen/asr1/README.md) |
| Intent Classification | FSC Challenge Speaker Set | Acc | 97.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_challenge/asr1/README.md) |
| Intent Classification | FSC Challenge Utterance Set | Acc | 78.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/fsc_challenge/asr1/README.md) |
| Intent Classification | SNIPS | F1 | 91.7 | [link](https://github.com/espnet/espnet/tree/master/egs2/snips/asr1/README.md) |
| Intent Classification | Grabo (Nl) | Acc | 97.2 | [link](https://github.com/espnet/espnet/tree/master/egs2/grabo/asr1/README.md) |
| Intent Classification | CAT SLU MAP (Zn) | Acc | 78.9 | [link](https://github.com/espnet/espnet/tree/master/egs2/catslu/asr1/README.md) |
| Intent Classification | Google Speech Commands | Acc | 98.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/speechcommands/asr1/README.md) |
| Slot Filling | SLURP | SLU-F1 | 71.9 | [link](https://github.com/espnet/espnet/tree/master/egs2/slurp_entity/asr1/README.md) |
| Dialogue Act Classification | Switchboard | Acc | 67.5 | [link](https://github.com/espnet/espnet/tree/master/egs2/swbd_da/asr1/README.md) |
| Dialogue Act Classification | Jdcinal (Jp) | Acc | 67.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/jdcinal/asr1/README.md) |
| Emotion Recognition | IEMOCAP | Acc | 69.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/iemocap/asr1/README.md) |
| Emotion Recognition | swbd_sentiment | Macro F1 | 61.4 | [link](https://github.com/espnet/espnet/tree/master/egs2/swbd_sentiment/asr1/README.md) |
| Emotion Recognition | slue_voxceleb | Macro F1 | 44.0 | [link](https://github.com/espnet/espnet/tree/master/egs2/slue-voxceleb/asr1/README.md) |

- Transformer based SLU for Fluent Speech Command Dataset

If you want to check the results of the other recipes, please check `egs2/<name_of_recipe>/asr1/RESULTS.md`.

In SLU, The objective is to infer the meaning or intent of spoken utterance. The [Fluent Speech Command Dataset](https://fluent.ai/fluent-speech-commands-a-dataset-for-spoken-language-understanding-research/) describes an intent as combination of 3 slot values: action, object and location. You can see baseline results on this dataset [here](https://github.com/espnet/espnet/blob/master/egs2/fsc/asr1/RESULTS.md)


</div></details>
Expand Down Expand Up @@ -689,6 +735,8 @@ See the module documentation for more information.
It is recommended to use models with RNN-based encoders (such as BLSTMP) for aligning large audio files;
rather than using Transformer models that have a high memory consumption on longer audio data.
The sample rate of the audio must be consistent with that of the data used in training; adjust with `sox` if needed.
Also, we can use this tool to provide token-level segmentation information if we prepare a list of tokens instead of that of utterances in the `text` file. See the discussion in https://github.com/espnet/espnet/issues/4278#issuecomment-1100756463.
</div></details>
Expand Down
47 changes: 47 additions & 0 deletions ci/test_integration_espnet2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,50 @@ if python3 -c "import fairseq" &> /dev/null; then
cd "${cwd}"
fi

# [ESPnet2] test enh_asr1 recipe
if python -c 'import torch as t; from distutils.version import LooseVersion as L; assert L(t.__version__) >= L("1.2.0")' &> /dev/null; then
cd ./egs2/mini_an4/enh_asr1
echo "==== [ESPnet2] ENH_ASR ==="
./run.sh --ngpu 0 --stage 0 --stop-stage 15 --skip-upload_hf false --feats-type "raw" --spk-num 1 --enh_asr_args "--max_epoch=1 --enh_separator_conf num_spk=1" --python "${python}"
# Remove generated files in order to reduce the disk usage
rm -rf exp dump data
cd "${cwd}"
fi

# [ESPnet2] test st recipe
cd ./egs2/mini_an4/st1
echo "==== [ESPnet2] ST ==="
./run.sh --stage 1 --stop-stage 1
feats_types="raw fbank_pitch"
token_types="bpe char"
for t in ${feats_types}; do
./run.sh --stage 2 --stop-stage 4 --feats-type "${t}" --python "${python}"
done
for t in ${token_types}; do
./run.sh --stage 5 --stop-stage 5 --tgt_token_type "${t}" --src_token_type "${t}" --python "${python}"
done
for t in ${feats_types}; do
for t2 in ${token_types}; do
echo "==== feats_type=${t}, token_types=${t2} ==="
./run.sh --ngpu 0 --stage 6 --stop-stage 13 --skip-upload false --feats-type "${t}" --tgt_token_type "${t2}" --src_token_type "${t2}" \
--st-args "--max_epoch=1" --lm-args "--max_epoch=1" --inference_args "--beam_size 5" --python "${python}"
done
done
echo "==== feats_type=raw, token_types=bpe, model_conf.extract_feats_in_collect_stats=False, normalize=utt_mvn ==="
./run.sh --ngpu 0 --stage 10 --stop-stage 13 --skip-upload false --feats-type "raw" --tgt_token_type "bpe" --src_token_type "bpe" \
--feats_normalize "utterance_mvn" --lm-args "--max_epoch=1" --inference_args "--beam_size 5" --python "${python}" \
--st-args "--model_conf extract_feats_in_collect_stats=false --max_epoch=1"

echo "==== use_streaming, feats_type=raw, token_types=bpe, model_conf.extract_feats_in_collect_stats=False, normalize=utt_mvn ==="
./run.sh --use_streaming true --ngpu 0 --stage 6 --stop-stage 13 --skip-upload false --feats-type "raw" --tgt_token_type "bpe" --src_token_type "bpe" \
--feats_normalize "utterance_mvn" --lm-args "--max_epoch=1" --inference_args "--beam_size 5" --python "${python}" \
--st-args "--model_conf extract_feats_in_collect_stats=false --max_epoch=1 --encoder=contextual_block_transformer --decoder=transformer
--encoder_conf block_size=40 --encoder_conf hop_size=16 --encoder_conf look_ahead=16"

# Remove generated files in order to reduce the disk usage
rm -rf exp dump data
cd "${cwd}"

# [ESPnet2] Validate configuration files
echo "<blank>" > dummy_token_list
echo "==== [ESPnet2] Validation configuration files ==="
Expand All @@ -124,6 +168,9 @@ if python3 -c 'import torch as t; from distutils.version import LooseVersion as
for f in egs2/*/ssl1/conf/train*.yaml; do
${python} -m espnet2.bin.hubert_train --config "${f}" --iterator_type none --normalize none --dry_run true --output_dir out --token_list dummy_token_list
done
for f in egs2/*/enh_asr1/conf/train_enh_asr*.yaml; do
${python} -m espnet2.bin.enh_s2t_train --config "${f}" --iterator_type none --dry_run true --output_dir out --token_list dummy_token_list
done
fi

# These files must be same each other.
Expand Down
2 changes: 1 addition & 1 deletion egs/commonvoice/asr1/local/download_and_untar.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ fi

if [ $# -ne 3 ]; then
echo "Usage: $0 [--remove-archive] <data-base> <url> <filename>"
echo "e.g.: $0 /export/data/ https://common-voice-data-download.s3.amazonaws.com/cv_corpus_v1.tar.gz cv_corpus_v1.tar.gz"
echo "e.g.: $0 /export/data/ https://us.openslr.org/resources/108/FR.tgz"
echo "With --remove-archive it will remove the archive after successfully un-tarring it."
exit 0;
fi
Expand Down
Loading

0 comments on commit ffe7c58

Please sign in to comment.