Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPGISpeech recipe #334

Merged
merged 20 commits into from
May 16, 2022
Merged

SPGISpeech recipe #334

merged 20 commits into from
May 16, 2022

Conversation

desh2608
Copy link
Collaborator

This is a WIP PR for SPGISpeech. I am opening this here early to discuss some issues I have been facing training a transducer model.

I first trained a conformer CTC model on the data (without speed perturbation) on 4 GPUs for 20 epochs, and the training curve looked reasonable: tensorboard. I was able to get a 4% WER with CTC decoding.

I then tried training a pruned_transducer_stateless2 model (with speed perturbation), but training looks weird: tensorboard. I am not sure if this is because of the 3x speed perturbation (I'm now training a model without speed perturbation to verify). I was hoping someone would be able to suggest what may be the reason for these periodic ups and downs.

Adding screenshot of training curve here:

image

(Please ignore the README files etc. for now since this is only a rough draft.)

@pzelasko
Copy link
Collaborator

pzelasko commented Apr 26, 2022

Try pre-shuffling the input cutset like this: gunzip -c train_cuts.jsonl.gz | shuf | gzip -c > train_cuts_shuf.jsonl.gz and re-trying. DynamicBucketingSampler has a small-ish reservoir-sampling shuffle buffer of 10k cuts, and the data is likely sorted by recording sessions / speakers / topics / etc. which is why you see these patterns. Pre-shuffling will likely help.

EDIT: alternatively try setting the shuffle_buffer_size to sth larger like 100k+ cuts, it's possible the memory usage won't be too drastic (but will increase noticeably).

EDIT2: you can also try splitting the dataset with cuts.split_lazy and read it for training with CutSet.mux(CutSet.from_jsonl_lazy(p) for p in split_paths), that would take care of ensuring sufficient randomness I think.

@desh2608
Copy link
Collaborator Author

desh2608 commented Apr 26, 2022

@pzelasko Thanks for the suggestions. The pre-shuffling idea looks like the simplest, so I'll try that first and see how it goes.

@desh2608
Copy link
Collaborator Author

Pre-shuffling the training cuts seems to have fixed the issue. New training curve after 1 epoch (in blue):
image

@pzelasko
Copy link
Collaborator

Very cool. Can you please compare the WER of both runs, at different epochs and after whole training? I'm curious if it actually affects the performance of the system in a significant way or not.

@desh2608
Copy link
Collaborator Author

desh2608 commented Apr 29, 2022

Very cool. Can you please compare the WER of both runs, at different epochs and after whole training? I'm curious if it actually affects the performance of the system in a significant way or not.

For the original training (orange curve), the WER on a small dev set (decoded with fast_beam_search) after first 4 epochs are:

Epoch WER (old) WER (new)
1 4.26 4.34
2 3.77 3.86
3 3.58
4 3.28

As a comparison, I got about ~4% WER with a conformer-CTC model (but trained without speed perturbation). The corrected model hasn't trained far enough yet so I haven't tried decoding with that one, but it seems the periodicity doesn't seem to have too much (if any) impact on the model performance.

Update: Add WER with new model.

@desh2608
Copy link
Collaborator Author

desh2608 commented May 2, 2022

@csukuangfj I have been wondering recently about the batch size (max-duration) that we can use for the pruned transducer models. Even with 4 GPUs of 24G memory each, I am only able to use --max-duration 60, else I run into OOM errors which are thrown like this. With 8 GPUs of 32G memory each, I run into memory error even with --max-duration 120. However, I have seen in the LibriSpeech recipe that you use batch sizes of up to 300 when training with 8 GPUs?

(Related to @wgb14's observation here)

@csukuangfj
Copy link
Collaborator

csukuangfj commented May 2, 2022

@csukuangfj I have been wondering recently about the batch size (max-duration) that we can use for the pruned transducer models. Even with 4 GPUs of 24G memory each, I am only able to use --max-duration 60, else I run into OOM errors which are thrown like this. With 8 GPUs of 32G memory each, I run into memory error even with --max-duration 120. However, I have seen in the LibriSpeech recipe that you use batch sizes of up to 300 when training with 8 GPUs?

(Related to @wgb14's observation here)

What's the distribution of your utterance duration? You may need to filter out short and long utterances.

@desh2608
Copy link
Collaborator Author

desh2608 commented May 2, 2022

What's the distribution of your utterance duration? You may need to filter out short and long utterances.

In [4]: cuts = load_manifest_lazy('data/manifests/cuts_train.jsonl.gz')

In [5]: cuts.describe()
Cuts count: 5886320
Total duration (hours): 15070.1
Speech duration (hours): 15070.1 (100.0%)
***
Duration statistics (seconds):
mean    9.2
std     2.8
min     4.6
25%     6.9
50%     8.9
75%     11.2
99%     16.0
99.5%   16.3
99.9%   16.6
max     16.7

I would say it's pretty uniform.

@danpovey
Copy link
Collaborator

danpovey commented May 3, 2022

When you get an error, please print out the supervision object for the minibatch that fails.
Then we can figure out what the issue is specifically, e.g. is the batch made of short utts, or is it length mismatch,
or is it long utts?

@desh2608
Copy link
Collaborator Author

I finished training a pruned transducer model and here are the decoding results on the evaluation set:

Decoding method val WER
greedy search 2.40
beam search 2.24
modified beam search 2.30
fast beam search 2.35

Here is the tensorboard: link

As a comparison, the SPGISpeech paper reports 2.6% using NeMo's conformer CTC and 2.3% using ESPNet conformer encoder-decoder model. I believe at this point the recipe is ready to be merged.

@csukuangfj are there some instructions for how to prepare the model to be uploaded on HuggingFace? For instance, the above WERs are using --avg-last-n=10, so I suppose I need to average those models and prepare a "final" checkpoint that should be uploaded?

@csukuangfj
Copy link
Collaborator

For instance, the above WERs are using --avg-last-n=10, so I suppose I need to average those models and prepare a "final" checkpoint that should be uploaded?

Please have a look at https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/export.py

You need to provide --iter xx --avg 10, where xx is your latest checkpoint-xx.pt.

After running export.py, you should get a file pretrained.pt, which can be uploaded to huggingface.

To upload files to huggingface, you need to first register an account on it. After creating an account, you can create a repo on it, clone it to your local computer, copy files to your local repo, and then use git push to upload them to huggingface.

One thing to note is that you have to run sudo apt-get install git-lfs before cloning the repo from huggingface.
(*.pt files are tracked by git lfs by default in the cloned repo.)

@desh2608
Copy link
Collaborator Author

I have uploaded the pretrained model to HF. This PR is ready for review.

@desh2608 desh2608 changed the title [WIP] SPGISpeech recipe SPGISpeech recipe May 13, 2022
@csukuangfj
Copy link
Collaborator

@desh2608

From https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/blob/main/log/modified_beam_search/errs-dev-beam_size_4-epoch-28-avg-15-beam-4.txt
there are some insertions at the end of utterances.

You can fix the the insertion errors at end of utterances by using #358
and re-run the decoding for greedy search and modified beam search. I think it helps reduce WERs.

You need to
(1) Use https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py
(2) In your decode.py, use

hyp_tokens = greedy_search_batch(
model=model,
encoder_out=encoder_out,
encoder_out_lens=encoder_out_lens,
)

hyp_tokens = modified_beam_search(
model=model,
encoder_out=encoder_out,
encoder_out_lens=encoder_out_lens,
beam=params.beam_size,

That is, add encoder_out_lens=encoder_out_lens.

import torch
import torch.nn as nn
from asr_datamodule import SPGISpeechAsrDataModule
from beam_search import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are going to synchronize with the latest master, please use

vimdiff ./decode.py  /path/to/librispeech/ASR/pruned_transducer_stateless2/decode.py

to find the differences.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay thanks, I will synchronize and update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also fix the style issues reported by GitHub actions.

You can fix them locally by following
https://github.com/k2-fsa/icefall/blob/master/docs/source/contributing/code-style.rst

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the remaining style issues are from librispeech code that I have soft-linked.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. Thanks! Merging.

@desh2608
Copy link
Collaborator Author

@desh2608

From https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/blob/main/log/modified_beam_search/errs-dev-beam_size_4-epoch-28-avg-15-beam-4.txt there are some insertions at the end of utterances.

You can fix the the insertion errors at end of utterances by using #358 and re-run the decoding for greedy search and modified beam search. I think it helps reduce WERs.

You need to (1) Use https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py (2) In your decode.py, use

hyp_tokens = greedy_search_batch(
model=model,
encoder_out=encoder_out,
encoder_out_lens=encoder_out_lens,
)

hyp_tokens = modified_beam_search(
model=model,
encoder_out=encoder_out,
encoder_out_lens=encoder_out_lens,
beam=params.beam_size,

That is, add encoder_out_lens=encoder_out_lens.

The WER for modified beam search improved from 2.30% to 2.24% with this change. WER remained unchanged for greedy search.

@csukuangfj csukuangfj merged commit 5aafbb9 into k2-fsa:master May 16, 2022
@csukuangfj
Copy link
Collaborator

@desh2608

Could you please upload a torchscript model to
https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp

You can use export.py --jit=1 to obtain a torchscript model.

I would like to add the pre-trained torchscript model to
https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

@desh2608
Copy link
Collaborator Author

@desh2608

Could you please upload a torchscript model to

https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp

You can use export.py --jit=1 to obtain a torchscript model.

I would like to add the pre-trained torchscript model to

https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

I will do it this weekend (I'm on an internship right now and have limited access to the JHU cluster).

@jtrmal
Copy link
Contributor

jtrmal commented Jul 20, 2022 via email

@csukuangfj
Copy link
Collaborator

@desh2608

Could you please upload a torchscript model to

https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp

You can use export.py --jit=1 to obtain a torchscript model.

I would like to add the pre-trained torchscript model to

https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition

I will do it this weekend (I'm on an internship right now and have limited access to the JHU cluster).

Thanks!

@desh2608
Copy link
Collaborator Author

@csukuangfj sorry it took a while (I just got back from my internship). I have uploaded a jitted model here.

@csukuangfj
Copy link
Collaborator

@csukuangfj sorry it took a while (I just got back from my internship). I have uploaded a jitted model here.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants