val loss in distribute training #674

LiuSiQi-TJ · 2023-08-15T01:52:24Z

I use librimix dataset to traing DCCRN by 8gpus
I open early stop in conf
I find the model always stop in very early stage like 10 or 20 epochs
In the log, I find, the val loss is caculated by diffierent gpus and early stop is implemented only by gpu 0, which I think is the reason to very early stop, the log is as follows:

[rank: 5] Metric val_loss improved by 0.433 >= min_delta = 0.0. New best score: -11.178
[rank: 0] Metric val_loss improved by 0.333 >= min_delta = 0.0. New best score: -11.104
[rank: 7] Metric val_loss improved by 0.530 >= min_delta = 0.0. New best score: -10.551
[rank: 4] Metric val_loss improved by 0.408 >= min_delta = 0.0. New best score: -10.931
[rank: 1] Metric val_loss improved by 0.287 >= min_delta = 0.0. New best score: -10.971
[rank: 3] Metric val_loss improved by 0.415 >= min_delta = 0.0. New best score: -11.321
[rank: 2] Metric val_loss improved by 0.418 >= min_delta = 0.0. New best score: -10.858
[rank: 6] Metric val_loss improved by 0.504 >= min_delta = 0.0. New best score: -11.375
Epoch 2, global step 1587: 'val_loss' reached -11.10351 (best -11.10351),

LiuSiQi-TJ · 2023-08-15T02:06:28Z

I set CUDA_VISIBLE_DEVICES = 0,1,2,3,4,5,6,7 in run.sh, did I do something wrong?

mpariente · 2023-08-19T17:44:53Z

Hello,

I would say you did not do anything wrong. What is your version of lightning ?

LiuSiQi-TJ added bug Something isn't working help wanted Extra attention is needed labels Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

val loss in distribute training #674

val loss in distribute training #674

LiuSiQi-TJ commented Aug 15, 2023

LiuSiQi-TJ commented Aug 15, 2023

mpariente commented Aug 19, 2023

val loss in distribute training #674

val loss in distribute training #674

Comments

LiuSiQi-TJ commented Aug 15, 2023

LiuSiQi-TJ commented Aug 15, 2023

mpariente commented Aug 19, 2023