-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate fast-forwarding during restarting training from checkpoint #473
Comments
Maybe we can change |
@zzzacwork you might want to double check the code and see whether cur_batch_idx is actually nonzero in the code. |
The I think everything related to |
I fixed that locally and the only place I found icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py Lines 778 to 784 in c17233e
|
I did some tests to compare the checkpoints and found |
Closing via #421 |
Hi all,
I find when training is restarted, the fast-forwarding code is executed twice which causes some batches discarded unintentionally.
First lhotse will fast-forward (skip) those many batches when loading sate dict from the checkpoint,
https://github.com/lhotse-speech/lhotse/blob/7cce647681cece04b40961fc378112d65aafbaa3/lhotse/dataset/sampling/dynamic_bucketing.py#L176-L191
Second inside
train.py
, the train_one_epoch` loop will also fast forward(skip) those many batches,icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py
Lines 721 to 723 in f8d28f0
I also noice that the
scan_pessimistic_batches_for_oom
function will also cause thesampler
to skip batches, probably this function should only be called if we start the training from the beginning.The text was updated successfully, but these errors were encountered: