Partially missing training_step outputs in training_epoch_end #2320

mmiakashs · 2020-06-22T21:12:36Z

Learning Model: the model consists of two branches: teacher and student. Two losses have been used to train two branches of the learning model using two optimizers.

ISSUE: In the training_step, I produced two set of metrics outputs for teacher (optimizer_idx=0) and student (optimizer_idx=1) branches (teacher: {loss,log:{acc, f1}} and student: {loss,log:{acc,f1,precision}} ). But in the training_epoch_end, I only got the combined outputs for only student outputs (optimizer_idx=1). All the teacher outputs (optimizer_idx=0) are missing.

I also see the training loop of PL and didn't observe any issue when the training loop tries to combine the training_step outputs. I am not sure what am I missing?

Version: 0.8.2.dev0

The text was updated successfully, but these errors were encountered:

mmiakashs · 2020-06-22T22:57:06Z

I dig the PL training_loop, may be I found a possible issue. In the run_training_batch method in training_loop.py, see the following two lines:

Line 639: loss, batch_output = optimizer_closure()
and 
Line 691: return 0, grad_norm_dic, all_log_metrics, batch_output

It seems to me that the batch_output of the last split_batch and last opt_idx iteration is just returned. Shouldn't we collect all the split batch and optimizers iteration batch_output
just like the all_log_metrics?

Moreover, in the run_training_epoch method, it seems to me that only the last iteration batch_output is passed to the training_epoch_end, see the following lines in the training_loop.py

Line 459: batch_result, grad_norm_dic, batch_step_metrics, batch_output = _outputs
Line 464: outputs.append(batch_output)
Line 526: epoch_output = model.training_epoch_end(outputs)

@williamFalcon @Borda could you please give a look into it?

mmiakashs · 2020-06-22T23:38:44Z

I dig the PL training_loop, may be I found a possible issue. In the run_training_batch method in training_loop.py, see the following two lines:
Line 639: loss, batch_output = optimizer_closure()
and 
Line 691: return 0, grad_norm_dic, all_log_metrics, batch_output
It seems to me that the batch_output of the last split_batch and last opt_idx iteration is just returned. Shouldn't we collect all the split batch and optimizers iteration batch_output
just like the all_log_metrics?

Moreover, in the run_training_epoch method, it seems to me that only the last iteration batch_output is passed to the training_epoch_end, see the following lines in the training_loop.py
Line 459: batch_result, grad_norm_dic, batch_step_metrics, batch_output = _outputs
Line 464: outputs.append(batch_output)
Line 526: epoch_output = model.training_epoch_end(outputs)
@williamFalcon @Borda could you please give a look into it?

I debug a bit I think the problem is with the multi optimizer only the last iteration (split batch and optimizer iteration) output is passed. If you can confirm this problem I can resolve and request a pull request. Thanks

williamFalcon · 2020-08-09T10:11:20Z

@mmiakashs can you check that this works for you now on master?

mmiakashs · 2020-08-27T16:50:00Z

@williamFalcon Thanks for the update. Sorry I missed that notification. I will check this after my deadline next week :)
Just a followup question. Do we need to turn on any flag for this or by default it merges and gives the combined results in the epoch_end results object?

mmiakashs added the question Further information is requested label Jun 22, 2020

This was referenced Jun 23, 2020

Missing training_step outputs in training_epoch_end #2326

Closed

Missing training_step outputs in training_epoch_end #2327

Closed

Borda added the bug Something isn't working label Jun 23, 2020

williamFalcon mentioned this issue Jun 23, 2020

refactored training_batch + tests to verify correctness #2328

Merged

williamFalcon closed this as completed in #2328 Jun 23, 2020

mmiakashs mentioned this issue Jun 25, 2020

Collect all the training_step outputs for training epoch end #2354

Closed

williamFalcon mentioned this issue Aug 9, 2020

tracks all outputs including TBPTT and multiple optimizers #2890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partially missing training_step outputs in training_epoch_end #2320

Partially missing training_step outputs in training_epoch_end #2320

mmiakashs commented Jun 22, 2020 •

edited

Loading

mmiakashs commented Jun 22, 2020

mmiakashs commented Jun 22, 2020 •

edited

Loading

williamFalcon commented Aug 9, 2020

mmiakashs commented Aug 27, 2020

Partially missing training_step outputs in training_epoch_end #2320

Partially missing training_step outputs in training_epoch_end #2320

Comments

mmiakashs commented Jun 22, 2020 • edited Loading

mmiakashs commented Jun 22, 2020

mmiakashs commented Jun 22, 2020 • edited Loading

williamFalcon commented Aug 9, 2020

mmiakashs commented Aug 27, 2020

mmiakashs commented Jun 22, 2020 •

edited

Loading

mmiakashs commented Jun 22, 2020 •

edited

Loading