Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Fix distributed loss #5381

Merged
merged 3 commits into from
Aug 27, 2021
Merged

Fix distributed loss #5381

merged 3 commits into from
Aug 27, 2021

Conversation

AkshitaB
Copy link
Contributor

@AkshitaB AkshitaB commented Aug 27, 2021

The train and val loss are reduced just once before finalizing the metrics for that epoch (instead of every batch).

@AkshitaB AkshitaB mentioned this pull request Aug 27, 2021
@AkshitaB AkshitaB requested a review from epwalsh August 27, 2021 04:31
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but you can you just remove the unused arguments from the training.util.get_metrics() function (cuda_device and world_size)?

@AkshitaB AkshitaB requested a review from epwalsh August 27, 2021 18:06
Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AkshitaB AkshitaB merged commit b41cb3e into main Aug 27, 2021
@AkshitaB AkshitaB deleted the fix-dist-loss branch August 27, 2021 20:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants