Correctly calculate distributed loss average #269

gasteigerjo · 2021-09-06T19:26:11Z

We previously first calculated the loss average per DDP replica and then averaged across replicas. This leads to wrong results since not every replica has the same number of atoms or even systems. This PR changes the computation to instead calculate an average over all replicas by aggregating the number of samples.

This seems to give a tiny improvement in MAE (0.5% or so). But training looks very similar to previously.

Co-authored-by: Abhishek Das <das.abhshk@gmail.com>

Correctly calculate distributed loss average

67c1120

gasteigerjo requested review from abhshkdz and anuroopsriram September 6, 2021 20:01

abhshkdz approved these changes Sep 18, 2021

View reviewed changes

Merge branch 'master' into fix_distributed_aggregation

c8bfbea

abhshkdz merged commit ef98b27 into FAIR-Chem:master Sep 18, 2021

sparticlesteve pushed a commit to sparticlesteve/ocp that referenced this pull request May 23, 2022

Correctly calculate distributed loss average (FAIR-Chem#269)

fb929a9

Co-authored-by: Abhishek Das <das.abhshk@gmail.com>

This was referenced May 23, 2022

Update mlperf reference for MLPerf HPC v2.0 sparticlesteve/ocp#5

Merged

OpenCatalyst reference updates for MLPerf HPC v2.0 mlcommons/hpc#29

Merged

levineds pushed a commit that referenced this pull request Jul 11, 2024

Correctly calculate distributed loss average (#269)

329f43d

Co-authored-by: Abhishek Das <das.abhshk@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly calculate distributed loss average #269

Correctly calculate distributed loss average #269

gasteigerjo commented Sep 6, 2021

Correctly calculate distributed loss average #269

Correctly calculate distributed loss average #269

Conversation

gasteigerjo commented Sep 6, 2021