Duplicate fast-forwarding during restarting training from checkpoint #473

zzzacwork · 2022-07-13T23:47:31Z

Hi all,
I find when training is restarted, the fast-forwarding code is executed twice which causes some batches discarded unintentionally.

First lhotse will fast-forward (skip) those many batches when loading sate dict from the checkpoint,
https://github.com/lhotse-speech/lhotse/blob/7cce647681cece04b40961fc378112d65aafbaa3/lhotse/dataset/sampling/dynamic_bucketing.py#L176-L191

Second inside train.py, the train_one_epoch` loop will also fast forward(skip) those many batches,

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py

Lines 721 to 723 in f8d28f0

    
           if batch_idx < cur_batch_idx: 
        
               continue 
        
           cur_batch_idx = batch_idx

I also noice that the scan_pessimistic_batches_for_oom function will also cause the sampler to skip batches, probably this function should only be called if we start the training from the beginning.

The text was updated successfully, but these errors were encountered:

luomingshuang · 2022-07-14T03:57:10Z

Maybe we can change
if not params.print_diagnostics:
to
if params.print_diagnostics == False and params.start_epoch == 1 and params.start_batch == 0: .

danpovey · 2022-07-14T05:12:46Z

@zzzacwork you might want to double check the code and see whether cur_batch_idx is actually nonzero in the code.
if it was skipping too much, it might be a bug in lhotse that was recently discussed on an issue there, some interaction with multiple jobs.

csukuangfj · 2022-07-14T05:19:29Z

The cur_batch_idx was added when lhotse did not support resuming from a checkpoint.

I think everything related to cur_batch_idx can be safely removed when using the latest lhotse.

zzzacwork · 2022-07-14T13:02:28Z

The cur_batch_idx was added when lhotse did not support resuming from a checkpoint.

I think everything related to cur_batch_idx can be safely removed when using the latest lhotse.

I fixed that locally and the only place I found cur_batch_idx still useful is at logging,

icefall/egs/librispeech/ASR/pruned_transducer_stateless2/train.py

Lines 778 to 784 in c17233e

    
           if batch_idx % params.log_interval == 0: 
        
               cur_lr = scheduler.get_last_lr()[0] 
        
               logging.info( 
        
                   f"Epoch {params.cur_epoch}, " 
        
                   f"batch {batch_idx}, loss[{loss_info}], " 
        
                   f"tot_loss[{tot_loss}], batch size: {batch_size}, " 
        
                   f"lr: {cur_lr:.2e}"

zzzacwork · 2022-07-14T13:06:55Z

@zzzacwork you might want to double check the code and see whether cur_batch_idx is actually nonzero in the code. if it was skipping too much, it might be a bug in lhotse that was recently discussed on an issue there, some interaction with multiple jobs.

I did some tests to compare the checkpoints and found cur_batch_idx nonzero if we restart from the middle of an epoch. I already installed the fix from lhotse

csukuangfj · 2022-08-06T02:40:51Z

Closing via #421

csukuangfj closed this as completed Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate fast-forwarding during restarting training from checkpoint #473

Duplicate fast-forwarding during restarting training from checkpoint #473

zzzacwork commented Jul 13, 2022

luomingshuang commented Jul 14, 2022

danpovey commented Jul 14, 2022

csukuangfj commented Jul 14, 2022

zzzacwork commented Jul 14, 2022 •

edited

Loading

zzzacwork commented Jul 14, 2022

csukuangfj commented Aug 6, 2022

Duplicate fast-forwarding during restarting training from checkpoint #473

Duplicate fast-forwarding during restarting training from checkpoint #473

Comments

zzzacwork commented Jul 13, 2022

luomingshuang commented Jul 14, 2022

danpovey commented Jul 14, 2022

csukuangfj commented Jul 14, 2022

zzzacwork commented Jul 14, 2022 • edited Loading

zzzacwork commented Jul 14, 2022

csukuangfj commented Aug 6, 2022

zzzacwork commented Jul 14, 2022 •

edited

Loading