-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPGISpeech recipe #334
SPGISpeech recipe #334
Conversation
Try pre-shuffling the input cutset like this: EDIT: alternatively try setting the EDIT2: you can also try splitting the dataset with |
@pzelasko Thanks for the suggestions. The pre-shuffling idea looks like the simplest, so I'll try that first and see how it goes. |
Very cool. Can you please compare the WER of both runs, at different epochs and after whole training? I'm curious if it actually affects the performance of the system in a significant way or not. |
For the original training (orange curve), the WER on a small dev set (decoded with
As a comparison, I got about ~4% WER with a conformer-CTC model (but trained without speed perturbation). The corrected model hasn't trained far enough yet so I haven't tried decoding with that one, but it seems the periodicity doesn't seem to have too much (if any) impact on the model performance. Update: Add WER with new model. |
@csukuangfj I have been wondering recently about the batch size ( |
What's the distribution of your utterance duration? You may need to filter out short and long utterances. |
In [4]: cuts = load_manifest_lazy('data/manifests/cuts_train.jsonl.gz')
In [5]: cuts.describe()
Cuts count: 5886320
Total duration (hours): 15070.1
Speech duration (hours): 15070.1 (100.0%)
***
Duration statistics (seconds):
mean 9.2
std 2.8
min 4.6
25% 6.9
50% 8.9
75% 11.2
99% 16.0
99.5% 16.3
99.9% 16.6
max 16.7 I would say it's pretty uniform. |
When you get an error, please print out the supervision object for the minibatch that fails. |
I finished training a pruned transducer model and here are the decoding results on the evaluation set:
Here is the tensorboard: link As a comparison, the SPGISpeech paper reports 2.6% using NeMo's conformer CTC and 2.3% using ESPNet conformer encoder-decoder model. I believe at this point the recipe is ready to be merged. @csukuangfj are there some instructions for how to prepare the model to be uploaded on HuggingFace? For instance, the above WERs are using |
Please have a look at https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/export.py You need to provide After running To upload files to huggingface, you need to first register an account on it. After creating an account, you can create a repo on it, clone it to your local computer, copy files to your local repo, and then use One thing to note is that you have to run |
I have uploaded the pretrained model to HF. This PR is ready for review. |
From https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/blob/main/log/modified_beam_search/errs-dev-beam_size_4-epoch-28-avg-15-beam-4.txt You can fix the the insertion errors at end of utterances by using #358 You need to icefall/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py Lines 270 to 274 in 0f180b3
icefall/egs/librispeech/ASR/pruned_transducer_stateless2/decode.py Lines 278 to 282 in 0f180b3
That is, add |
import torch | ||
import torch.nn as nn | ||
from asr_datamodule import SPGISpeechAsrDataModule | ||
from beam_search import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are going to synchronize with the latest master, please use
vimdiff ./decode.py /path/to/librispeech/ASR/pruned_transducer_stateless2/decode.py
to find the differences.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks, I will synchronize and update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also fix the style issues reported by GitHub actions.
You can fix them locally by following
https://github.com/k2-fsa/icefall/blob/master/docs/source/contributing/code-style.rst
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the remaining style issues are from librispeech code that I have soft-linked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok. Thanks! Merging.
The WER for modified beam search improved from 2.30% to 2.24% with this change. WER remained unchanged for greedy search. |
Could you please upload a torchscript model to You can use I would like to add the pre-trained torchscript model to |
I will do it this weekend (I'm on an internship right now and have limited access to the JHU cluster). |
@desh2608 let me know and I can add you to the k2-fsa hf repo
y.
…On Wed, Jul 20, 2022 at 1:36 PM Desh Raj ***@***.***> wrote:
@desh2608 <https://github.com/desh2608>
Could you please upload a torchscript model to
https://huggingface.co/desh2608/icefall-asr-spgispeech-pruned-transducer-stateless2/tree/main/exp
You can use export.py --jit=1 to obtain a torchscript model.
I would like to add the pre-trained torchscript model to
https://huggingface.co/spaces/k2-fsa/automatic-speech-recognition
I will do it this weekend (I'm on an internship right now and have limited
access to the JHU cluster).
—
Reply to this email directly, view it on GitHub
<#334 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACUKYX3DUKDYUU7G3YUN7ODVVA2IBANCNFSM5UMWBMNQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thanks! |
@csukuangfj sorry it took a while (I just got back from my internship). I have uploaded a jitted model here. |
Thanks! |
This is a WIP PR for SPGISpeech. I am opening this here early to discuss some issues I have been facing training a transducer model.
I first trained a conformer CTC model on the data (without speed perturbation) on 4 GPUs for 20 epochs, and the training curve looked reasonable: tensorboard. I was able to get a 4% WER with CTC decoding.
I then tried training a pruned_transducer_stateless2 model (with speed perturbation), but training looks weird: tensorboard. I am not sure if this is because of the 3x speed perturbation (I'm now training a model without speed perturbation to verify). I was hoping someone would be able to suggest what may be the reason for these periodic ups and downs.
Adding screenshot of training curve here:
(Please ignore the README files etc. for now since this is only a rough draft.)