Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate fast-forwarding during restarting training from checkpoint #473

Closed
zzzacwork opened this issue Jul 13, 2022 · 6 comments
Closed

Comments

@zzzacwork
Copy link

Hi all,
I find when training is restarted, the fast-forwarding code is executed twice which causes some batches discarded unintentionally.

First lhotse will fast-forward (skip) those many batches when loading sate dict from the checkpoint,
https://github.com/lhotse-speech/lhotse/blob/7cce647681cece04b40961fc378112d65aafbaa3/lhotse/dataset/sampling/dynamic_bucketing.py#L176-L191

Second inside train.py, the train_one_epoch` loop will also fast forward(skip) those many batches,

if batch_idx < cur_batch_idx:
continue
cur_batch_idx = batch_idx

I also noice that the scan_pessimistic_batches_for_oom function will also cause the sampler to skip batches, probably this function should only be called if we start the training from the beginning.

@luomingshuang
Copy link
Collaborator

Maybe we can change
if not params.print_diagnostics:
to
if params.print_diagnostics == False and params.start_epoch == 1 and params.start_batch == 0: .

@danpovey
Copy link
Collaborator

@zzzacwork you might want to double check the code and see whether cur_batch_idx is actually nonzero in the code.
if it was skipping too much, it might be a bug in lhotse that was recently discussed on an issue there, some interaction with multiple jobs.

@csukuangfj
Copy link
Collaborator

The cur_batch_idx was added when lhotse did not support resuming from a checkpoint.

I think everything related to cur_batch_idx can be safely removed when using the latest lhotse.

@zzzacwork
Copy link
Author

zzzacwork commented Jul 14, 2022

The cur_batch_idx was added when lhotse did not support resuming from a checkpoint.

I think everything related to cur_batch_idx can be safely removed when using the latest lhotse.

I fixed that locally and the only place I found cur_batch_idx still useful is at logging,

if batch_idx % params.log_interval == 0:
cur_lr = scheduler.get_last_lr()[0]
logging.info(
f"Epoch {params.cur_epoch}, "
f"batch {batch_idx}, loss[{loss_info}], "
f"tot_loss[{tot_loss}], batch size: {batch_size}, "
f"lr: {cur_lr:.2e}"

@zzzacwork
Copy link
Author

@zzzacwork you might want to double check the code and see whether cur_batch_idx is actually nonzero in the code. if it was skipping too much, it might be a bug in lhotse that was recently discussed on an issue there, some interaction with multiple jobs.

I did some tests to compare the checkpoints and found cur_batch_idx nonzero if we restart from the middle of an epoch. I already installed the fix from lhotse

@csukuangfj
Copy link
Collaborator

Closing via #421

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants