Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] EarlyStopping logging on rank 0 only #13162

Closed
austinmw opened this issue May 26, 2022 · 5 comments · Fixed by #13233
Closed

[Feature Request] EarlyStopping logging on rank 0 only #13162

austinmw opened this issue May 26, 2022 · 5 comments · Fixed by #13233
Assignees
Labels
callback: early stopping feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@austinmw
Copy link

austinmw commented May 26, 2022

🚀 Feature

Toggle switch to turn off EarlyStopping logging for processes other than rank 0

Motivation

EarlyStopping logging can be a bit spammy when viewing aggregate logs across all processes. For example, with my custom CloudWatch logger:

xnpww4j62d-algo-1-vr8o9 | 14:17:49 [INFO] Epoch 9: [ Training | 100%  iter# 49/49    19.28 batches/s ] train/loss_step=0.764418, train/loss_epoch=0.773, train/acc=0.68356
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] Epoch 9: [ Validation | 100%  iter# 10/10     2.34 batches/s ] val/loss_step=1.253475, val/loss_epoch=1.278802, val/acc=0.6107
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 0] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 2] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 1] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 3] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 4] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 5] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 6] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:17:55 [INFO] [rank: 7] Metric val/acc improved by 0.195 >= min_delta = 0.0. New best score: 0.611
xnpww4j62d-algo-1-vr8o9 | 14:18:20 [INFO] Epoch 14: [ Training | 100%  iter# 49/49    18.94 batches/s ] train/loss_step=0.611876, train/loss_epoch=0.55, train/acc=0.80096
xnpww4j62d-algo-1-vr8o9 | 14:18:26 [INFO] Epoch 14: [ Validation | 100%  iter# 10/10     2.29 batches/s ] val/loss_step=0.748429, val/loss_epoch=0.828285, val/acc=0.726

Pitch

It would be nice if we could turn off printing of this message on processes other than rank 0. I understand that this is actually useful to monitor in some cases, so maybe this toggle could be set to False by default.

Alternatives

Custom EarlyStopping callback?

cc @Borda @carmocca @awaelchli @rohitgr7

@austinmw austinmw added the needs triage Waiting to be triaged by maintainers label May 26, 2022
@carmocca carmocca added feature Is an improvement or enhancement callback: early stopping and removed needs triage Waiting to be triaged by maintainers labels May 26, 2022
@carmocca carmocca added this to the 1.7 milestone May 26, 2022
@carmocca
Copy link
Contributor

I think we can add this flag. It's useful for metrics logged with sync_dist=True.

The relevant piece of code is here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/dd475183227644a8d22dca3deb18c99fb0a9b2c4/pytorch_lightning/callbacks/early_stopping.py#L256-L261

@carmocca carmocca added the help wanted Open to be worked on label May 26, 2022
@carmocca carmocca modified the milestones: 1.7, future May 26, 2022
@austinmw
Copy link
Author

Thanks!

@ekagra-ranjan
Copy link
Contributor

Hi @carmocca! I would like to take this up.

@ekagra-ranjan
Copy link
Contributor

ekagra-ranjan commented Jun 1, 2022

Hi @carmocca - I have a doubt regarding this:

It's useful for metrics logged with sync_dist=True

The metrics for callbacks are stored during epoch end:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c1f05021ff0093f720770a6065ab62a70c535add/pytorch_lightning/trainer/connectors/logger_connector/result.py#L584-L587

and sync_dist will always be true for epoch level metric computation as per:
https://github.com/PyTorchLightning/pytorch-lightning/blob/c1f05021ff0093f720770a6065ab62a70c535add/pytorch_lightning/trainer/connectors/logger_connector/result.py#L524-L531

so will our new flag be dependent on the value of sync_dist somehow?

Also, if the epoch level metrics will always be reduced across processes, then won't the logs be the same for all the ranks (as seen in the logs shared by @austinmw) ?

@carmocca
Copy link
Contributor

carmocca commented Jun 2, 2022

@ekagra-ranjan For an initial implementation, I suggest you don't try to decide this automatically. Just allow the user to choose manually by passing rank_0_only=True in the constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
callback: early stopping feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants