Fix distributed loss #5381

AkshitaB · 2021-08-27T04:16:36Z

The train and val loss are reduced just once before finalizing the metrics for that epoch (instead of every batch).

epwalsh

This looks good to me, but you can you just remove the unused arguments from the training.util.get_metrics() function (cuda_device and world_size)?

epwalsh

LGTM!

fix distributed loss

a37c52d

AkshitaB mentioned this pull request Aug 27, 2021

fix distributed loss #5380

Closed

AkshitaB requested a review from epwalsh August 27, 2021 04:31

epwalsh suggested changes Aug 27, 2021

View reviewed changes

remove extra args

2a38e6f

AkshitaB requested a review from epwalsh August 27, 2021 18:06

epwalsh approved these changes Aug 27, 2021

View reviewed changes

Merge branch 'main' into fix-dist-loss

50d1547

AkshitaB merged commit b41cb3e into main Aug 27, 2021

AkshitaB deleted the fix-dist-loss branch August 27, 2021 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix distributed loss #5381

Fix distributed loss #5381

AkshitaB commented Aug 27, 2021 •

edited

Loading

epwalsh left a comment

epwalsh left a comment

Fix distributed loss #5381

Fix distributed loss #5381

Conversation

AkshitaB commented Aug 27, 2021 • edited Loading

epwalsh left a comment

Choose a reason for hiding this comment

epwalsh left a comment

Choose a reason for hiding this comment

AkshitaB commented Aug 27, 2021 •

edited

Loading