-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow disabling automatic stopping during fitting #8818
Comments
@awaelchli this is similar to #8210 (comment) |
Just as a follow-up, I still think the default should be jobs are automatically stopped to avoid forgotten jobs racking up lots of server costs. However, I think the change in behavior should specifically clean up how to specify you actually want the job to run forever. |
I think Lightning's 1000 epochs is a safety feature. class Trainer:
def __init__(..., max_epochs: Optional[Union[X, int]] = X): Where X is a sentinel value (is this the right term??) to differentiate from None. And when user sets it to None we have the infinite loop, but if not we keep the previous behavior. Not the most elegant but that's what I can think of now |
@awaelchli do you think maybe setting max_epochs to a negative value? |
that could also work yes ^^ |
Awesome. I'll create a PR |
@awaelchli any recommendations for how to check the trainer will train forever if max_epochs is negative? Also, I was debating whether to allow both max_epochs and max_steps to be set to negative/have the other be None to disable automatic stopping. This would mean the following combos would disable automatic stopping: |
@EricWiener actually I believe we should also introduce max_epochs, max_steps, effect |
Current behavior is: # trainer.py
fit_loop = FitLoop(
min_epochs=(1 if (min_epochs is None and min_steps is None) else min_epochs),
max_epochs=(1000 if (max_epochs is None and max_steps is None) else max_epochs),
)
#fit_loop.py
stop_steps = self.max_steps is not None and self.global_step >= self.max_steps
stop_epochs = self.max_epochs is not None and self.current_epoch >= self.max_epochs So if we were to allow ( |
@EricWiener @awaelchli are there other validations we then need to add for |
What would be the issue with allowing |
Yes you're absolutely right, my mistake. As there are multiple stopping conditions one could set with either I would prefer to raise a misconfiguration exception in case we have conflicting bounds instead of trying to determine some precedence across these conditions. FWIW, another opportunity or infinite training is online training, where new data is continuously streamed in and we don't want to stop even if the dataloader is tempoarily exhausted |
@EricWiener yes certainly. we would specifically check for the value -1. |
For "-1, Any, turns off limits for epochs, still limited by max_steps", this would be the same as currently leaving Got it about specifically checking for |
I'm gonna copy my response from my PR over here (and update it to reflect the further discussion): What are new valid configurations?Original: The only new valid configuration is What are new invalid configurations?
Are there configurations which aren't exceptional but which we should raise some sort of warning for?No. We should not allow any of the above configurations. What priority do we give among steps/epochs/time?We should prioritize whichever occurs first with a user-defined value. This means if the user defines The main intention of this PR was to disable lightning automatically stopping training after a certain number of epochs/steps were reached even if the user didn't want this behavior. You might still want a model to train for a certain amount of time, but not be limited by the number of epochs/steps. For instance, if you wanted a model to train at most for two weeks, but you weren't sure how many epochs/steps it would be able to reach, you could disable automatic stopping for epochs/steps, but still specify a time constraint. Currently You could also still achieve no stopping at all if you set This seems to achieve the best result of:
Do we need to adjust defaults anywhere that depend on max_steps or max_epochs?All the defaults should remain. |
I disagree with the options you declare as invalid. max_epochs, max_steps, effect
No, that is not correct. max_epochs=X, max_steps=Y should stop which ever of these two limits gets satisfied first! We call it max_* for a reason. Apart from that I agree with the rest of your comments but these functionalities will be satisfied given my comments above, I think. |
Thank you for listing that out! Your idea does make more sense. Just to make sure we're all on the same page, I'm gonna flush this out a little more to include all possible values.
The difference between 2a and 3a was my main concern, but I think the behavior of
I think my intention was not understood. I was saying to keep the current implementation:
Here, precedence is given to whatever comes first with a user-defined value. If the user defines Points to addressI think the main questions that need to be answered at this point are:
|
@EricWiener - thank you for listing this out!
I think n00b question: if someone is used to specifying purely time-based training (ie no steps/epochs), they cannot toggle off the same flag to specify infinite training. they'd have to switch to specifying one of
if we can return early or raise an error, i'd prefer doing so. there is a lot of setup that potentially happens before entering the fit loop, and it's a bad UX wasting time for nothing to happen |
Hey @EricWiener, I am assigning you to this issue as it seems we have a proposal :) Best, |
-1, -1, infinite training! IMO, I believe there is no need to through an error. If a user provide 0 for one or all, then nothing should happen as you would expect by putting such values. |
Alright so I'm going to go with: max_epochs, max_steps, effect
And thanks to @ananthsub PR (#9072) we also have better behavior re. |
Great @EricWiener, the proposal sounds great ! |
If we consider the product:
(*): Perhaps this case should equal to (3) in the future Note that all cases that have a This tells me that we should only allow
Pseudocode: # default
max_epochs: Optional[int] = None
max_steps: int = -1
# conversion
if max_epochs is None:
max_epochs = 1000 if max_steps == -1 else -1 |
@carmocca I really like this idea. However, this would be a breaking change since if anyone has explicitly set Could I please get thoughts from @tchaton @awaelchli @ananthsub @justusschock before changing the PR to this implementation. Want to make sure we're all on the same page that we change the default value of Could I please get either a response or a 👍/👎 from everyone tagged. |
True, but does anybody set |
I think we definitely have to deprecate |
I suggest to treat this as a separate issue and keep the PR focused on one thing. disallowing None for steps can be a separate PR imo. |
+1 @awaelchli - we can enable infinite training in #8877 and separately tighten the types offered for the stopping conditions. |
Sounds great! Thanks for all the feedback |
Re-opening to track the type change Are you interested in doing it @EricWiener? |
Sure @carmocca, but might need a bit of time before I get to it |
@awaelchli @carmocca I just realized data published to the trainer through |
At least, i'd recommend always setting But yes, if you still want to compute at a certain batch index, there's probably no reason to |
🚀 Feature
Currently if neither
max_epochs
normax_steps
aren't set, lightning defaults to usingmax_epochs = 1000
. However, in some situations the user actually doesn't wantmax_epochs
ormax_steps
to be set, and automatically stopping after1000
epochs isn't wanted. It would be great if there were a way to specify that no epochs should be set.Motivation
I was running a very large training job over the weekend with a very large dataset. Because the dataset is so large, I set the number of batches per epoch to be very small (relative to the size of the dataset) so that logging would occur more frequently. I set
max_epochs
andmax_steps
to beNone
because I believed this would disable automatic stopping, on Monday when I checked on the model again it had exited early after only a couple of hours.Pitch
Have some way to specify that automatic stopping should be disabled. Having to hard-code a large number like
2**1000
isn't the most elegant (especially compared to how elegant the rest of Lightning is).Alternatives
The user could pass
float("inf")
asmax_epochs
to disable stopping if not using multiple GPUs. However, if using multiple GPUs they would need to use a very large integer instead (ex.2**1000
).If using a float for
max_epochs
, you will get the following error:Additional context
https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1627425271133200
The text was updated successfully, but these errors were encountered: