Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow disabling automatic stopping during fitting #8818

Closed
EricWiener opened this issue Aug 9, 2021 · 34 comments · Fixed by #8877 or #9460
Closed

Allow disabling automatic stopping during fitting #8818

EricWiener opened this issue Aug 9, 2021 · 34 comments · Fixed by #8877 or #9460
Assignees
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on

Comments

@EricWiener
Copy link
Contributor

🚀 Feature

Currently if neither max_epochs nor max_steps aren't set, lightning defaults to using max_epochs = 1000. However, in some situations the user actually doesn't want max_epochs or max_steps to be set, and automatically stopping after 1000 epochs isn't wanted. It would be great if there were a way to specify that no epochs should be set.

Motivation

I was running a very large training job over the weekend with a very large dataset. Because the dataset is so large, I set the number of batches per epoch to be very small (relative to the size of the dataset) so that logging would occur more frequently. I set max_epochs and max_steps to be None because I believed this would disable automatic stopping, on Monday when I checked on the model again it had exited early after only a couple of hours.

Pitch

Have some way to specify that automatic stopping should be disabled. Having to hard-code a large number like 2**1000 isn't the most elegant (especially compared to how elegant the rest of Lightning is).

Alternatives

The user could pass float("inf") as max_epochs to disable stopping if not using multiple GPUs. However, if using multiple GPUs they would need to use a very large integer instead (ex. 2**1000).

If using a float for max_epochs, you will get the following error:

Training: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/home/eric/.../train.runfiles/train.py", line 371, in <module>
    main(sys.argv[1:])
  File "/home/eric/.../train.runfiles/train.py", line 362, in main
    trainer.fit(model, data_module)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/home/eric/.../train.runfiles/pypi__pytorch_lightning_python3_deps/pytorch_lightning/trainer/trainer.py", line 861, in run_train
    epochs = range(self.current_epoch, self.max_epochs) if self.max_epochs else count(self.current_epoch)
TypeError: 'float' object cannot be interpreted as an integer

Additional context

https://pytorch-lightning.slack.com/archives/CRBLFHY79/p1627425271133200

@EricWiener EricWiener added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 9, 2021
@ananthsub
Copy link
Contributor

@awaelchli this is similar to #8210 (comment)

@EricWiener
Copy link
Contributor Author

Just as a follow-up, I still think the default should be jobs are automatically stopped to avoid forgotten jobs racking up lots of server costs. However, I think the change in behavior should specifically clean up how to specify you actually want the job to run forever.

@awaelchli
Copy link
Contributor

I think Lightning's 1000 epochs is a safety feature.
The only idea I have without changing this behavior is to add new special default:

class Trainer:
def __init__(..., max_epochs: Optional[Union[X, int]] = X): 

Where X is a sentinel value (is this the right term??) to differentiate from None. And when user sets it to None we have the infinite loop, but if not we keep the previous behavior. Not the most elegant but that's what I can think of now

@EricWiener
Copy link
Contributor Author

@awaelchli do you think maybe setting max_epochs to a negative value?

@awaelchli
Copy link
Contributor

that could also work yes ^^

@EricWiener
Copy link
Contributor Author

Awesome. I'll create a PR

@EricWiener
Copy link
Contributor Author

@awaelchli any recommendations for how to check the trainer will train forever if max_epochs is negative?

Also, I was debating whether to allow both max_epochs and max_steps to be set to negative/have the other be None to disable automatic stopping. This would mean the following combos would disable automatic stopping: (negative, negative), (negative, None), (None, negative), but this seemed like it would just be a mess to maintain and wasn't worth it, so I left it with just having max_epochs be set to negative + max_steps needs to be None. Thoughts?

@awaelchli
Copy link
Contributor

@EricWiener actually I believe we should also introduce max_steps=-1. Then:

max_epochs, max_steps, effect
-1, -1, trainer will not stop unless external forces
-1, Any, turns off limits for epochs, still limited by max_steps
Any, -1, the behavior we know today

@EricWiener
Copy link
Contributor Author

Current behavior is:

# trainer.py
fit_loop = FitLoop(
    min_epochs=(1 if (min_epochs is None and min_steps is None) else min_epochs),
    max_epochs=(1000 if (max_epochs is None and max_steps is None) else max_epochs),
)

#fit_loop.py
stop_steps = self.max_steps is not None and self.global_step >= self.max_steps
stop_epochs = self.max_epochs is not None and self.current_epoch >= self.max_epochs

So if we were to allow (max_epochs = Any, max_steps = -1), we would immediately stop training. Same would be true if we allow (max_epochs = -1, max_steps = Any). The logic would need to be changed in order to allow both max_epochs and max_steps to be -1 (and not immediately stop training). I was trying to avoid adding additional complexity. What do you think?

@ananthsub
Copy link
Contributor

@EricWiener @awaelchli are there other validations we then need to add for min_steps and min_epochs?
For example, I think we should fail early if someone specifies (min_epochs=number greater than 0, max_epochs=-1)

@EricWiener
Copy link
Contributor Author

@EricWiener @awaelchli are there other validations we then need to add for min_steps and min_epochs?
For example, I think we should fail early if someone specifies (min_epochs=number greater than 0, max_epochs=-1)

What would be the issue with allowing min_epochs = number greater than 0 and having max_epochs = -1? Setting max_epochs = -1 would disable automatic stopping based on max_epochs/max_steps, but it wouldn't disable stopping based on EarlyStopping or max_time. It could still make sense to set a desired minimum number of epochs to be met.

@ananthsub
Copy link
Contributor

ananthsub commented Aug 20, 2021

What would be the issue with allowing min_epochs = number greater than 0 and having max_epochs = -1? Setting max_epochs = -1 would disable automatic stopping based on max_epochs/max_steps, but it wouldn't disable stopping based on EarlyStopping or max_time. It could still make sense to set a desired minimum number of epochs to be met.

Yes you're absolutely right, my mistake.

As there are multiple stopping conditions one could set with either max_epochs, max_steps or max_time - which one takes precedence? What if a user specifies max_epochs=-1 and max_steps = number > 0 ? or vice versa? or if max_epochs=-1 and max_time=timedelta(...) ?

I would prefer to raise a misconfiguration exception in case we have conflicting bounds instead of trying to determine some precedence across these conditions.

FWIW, another opportunity or infinite training is online training, where new data is continuously streamed in and we don't want to stop even if the dataloader is tempoarily exhausted

@awaelchli
Copy link
Contributor

The logic would need to be changed in order to allow both max_epochs and max_steps to be -1

@EricWiener yes certainly. we would specifically check for the value -1.

@EricWiener
Copy link
Contributor Author

EricWiener commented Aug 20, 2021

The logic would need to be changed in order to allow both max_epochs and max_steps to be -1

@EricWiener yes certainly. we would specifically check for the value -1.

For "-1, Any, turns off limits for epochs, still limited by max_steps", this would be the same as currently leaving max_epochs = None and max_epochs = Any. Are we sure we want to add in the additional complexity of needing to allow max_steps = -1 to achieve the same functionality that already exists?

Got it about specifically checking for -1. In my PR, I was just checking if max_epochs < 0, but I will switch this to specifically checking for -1.

@EricWiener
Copy link
Contributor Author

What would be the issue with allowing min_epochs = number greater than 0 and having max_epochs = -1? Setting max_epochs = -1 would disable automatic stopping based on max_epochs/max_steps, but it wouldn't disable stopping based on EarlyStopping or max_time. It could still make sense to set a desired minimum number of epochs to be met.

Yes you're absolutely right, my mistake.

As there are multiple stopping conditions one could set with either max_epochs, max_steps or max_time - which one takes precedence? What if a user specifies max_epochs=-1 and max_steps = number > 0 ? or vice versa? or if max_epochs=-1 and max_time=timedelta(...) ?

I would prefer to raise a misconfiguration exception in case we have conflicting bounds instead of trying to determine some precedence across these conditions.

FWIW, another opportunity or infinite training is online training, where new data is continuously streamed in and we don't want to stop even if the dataloader is tempoarily exhausted

I'm gonna copy my response from my PR over here (and update it to reflect the further discussion):

What are new valid configurations?

Original: The only new valid configuration is max_epochs declared as a negative integer + max_steps = None.
Updated: @awaelchli also wants to allow max_steps to be -1. Also instead of allowing any negative, we will only allow -1.

What are new invalid configurations?

  • max_epochs as a negative integer and max_steps is positive. This should be invalid because specifying max_steps means automatic stopping won't be disabled. This will result in max_epochs being honored and no training occurring.
  • max_steps or max_time as a negative value. This should be invalid because only max_epochs should be allowed to disable automatic stopping to reduce confusion and additional logic needed to maintain multiple flags. Note: pending if we want max_steps to be allowed to equal -1, this will change.
  • Setting any of max_epochs, max_steps, or max_time to be zero. This will result in no training occurring and doesn't make sense.

Are there configurations which aren't exceptional but which we should raise some sort of warning for?

No. We should not allow any of the above configurations.

What priority do we give among steps/epochs/time?

We should prioritize whichever occurs first with a user-defined value. This means if the user defines max_steps, we should only stop at this value even if the default max_epochs value occurs first. Even if automatic stopping is disabled by setting max_epochs < 0 and max_steps = None, we should still honor max_time.

The main intention of this PR was to disable lightning automatically stopping training after a certain number of epochs/steps were reached even if the user didn't want this behavior. You might still want a model to train for a certain amount of time, but not be limited by the number of epochs/steps. For instance, if you wanted a model to train at most for two weeks, but you weren't sure how many epochs/steps it would be able to reach, you could disable automatic stopping for epochs/steps, but still specify a time constraint.

Currently max_time is not really interacting much with max_epochs or max_steps. Even if max_time is specified, max_epochs will still default to 1000. Whichever of max_time or max_epochs/max_steps comes first will decide when the model stops.

You could also still achieve no stopping at all if you set max_epochs < 0, max_steps = None, and max_time = None.

This seems to achieve the best result of:

  • You can disable stopping completely
  • You can train until a certain time limit (regardless of max_epochs/max_steps)

Do we need to adjust defaults anywhere that depend on max_steps or max_epochs?

All the defaults should remain. max_time and max_steps both default to None. max_epochs defaults to 1000 and is the only variable of the three not set to None.

@awaelchli
Copy link
Contributor

I disagree with the options you declare as invalid.
It makes more sense to me to let one disable each stopping criteria individually. As I outlined here, I believe we probably want to have individual control with -1 for each.

max_epochs, max_steps, effect
-1, -1, infinite training! <--- this is what @EricWiener wants
-1, Any, turns off limits for epochs, still limited by max_steps
Any, -1, turns off limits for steps, still limited by epochs

We should prioritize whichever occurs first with a user-defined value. This means if the user defines max_steps, we should only stop at this value even if the default max_epochs value occurs first

No, that is not correct. max_epochs=X, max_steps=Y should stop which ever of these two limits gets satisfied first! We call it max_* for a reason.

Apart from that I agree with the rest of your comments but these functionalities will be satisfied given my comments above, I think.

@EricWiener
Copy link
Contributor Author

I disagree with the options you declare as invalid.
It makes more sense to me to let one disable each stopping criteria individually. As I outlined here, I believe we probably want to have individual control with -1 for each.

max_epochs, max_steps, effect
-1, -1, infinite training! <--- this is what @EricWiener wants
-1, Any, turns off limits for epochs, still limited by max_steps
Any, -1, turns off limits for steps, still limited by epochs

Thank you for listing that out! Your idea does make more sense. Just to make sure we're all on the same page, I'm gonna flush this out a little more to include all possible values.

  1. -1, -1, infinite training!
  2. -1, Any, turns off limits for epochs, still limited by max_steps
    a. if max_steps == None, this would also be infinite training since there isn't a default value for max_steps.
    b. if max_steps == 0, we immediately stop training / possibly throw an error?
    c. if max_steps > 0, we will train up until max_steps.
  3. Any, -1, turns off limits for steps, still limited by epochs
    a. if max_epochs == None, we will use default value of 1000 epochs.
    b. if max_epochs == 0, we will immediately stop training / possibly throw an error
    c. if max_epochs > 0 we will train up until max_epochs

The difference between 2a and 3a was my main concern, but I think the behavior of -1 disabling a certain variable (what @awaelchli suggests) makes more sense/is worth it.

We should prioritize whichever occurs first with a user-defined value. This means if the user defines max_steps, we should only stop at this value even if the default max_epochs value occurs first

No, that is not correct. max_epochs=X, max_steps=Y should stop which ever of these two limits gets satisfied first! We call it max_* for a reason.

I think my intention was not understood. I was saying to keep the current implementation:

max_epochs=(1000 if (max_epochs is None and max_steps is None) else max_epochs)

Here, precedence is given to whatever comes first with a user-defined value. If the user defines max_steps, we don't set a default value for max_epochs, so we will stop at max_steps and not the default value set for max_epochs. This matches the current implementation where max_epochs isn't set if max_steps is not None.

Points to address

I think the main questions that need to be answered at this point are:

  • Do we want to define a default value for max_steps to keep the behavior for max_epochs and max_steps identical (currently we have different behaviors for 2a and 3a). Or are we okay with having the difference in behavior? I don't see a good way to choose a value of max_steps, so it might be best to just live with the difference in behavior (or count how many steps were taken in the first epoch and then multiply this value by the default value for max_epochs).
  • Do we want to throw an error if max_steps = 0, max_epochs = 0, or max_time = 0? Or do we not care since FitLoop.done will handle this and just immediately return from run(). relevant code

@ananthsub
Copy link
Contributor

@EricWiener - thank you for listing this out!

3a. if max_epochs == None, we will use default value of 1000 epochs.

I think (max_steps=-1, max_epochs=None) should also be infinite training and should not default to 1000 epochs. Users who are doing purely step-based training should not need to worry about setting max_epochs=-1 as well. I'm not sure why this would fall into the default max_epoch because max_steps is not None, so we would retain the default value for max_epochs=None

n00b question: if someone is used to specifying purely time-based training (ie no steps/epochs), they cannot toggle off the same flag to specify infinite training. they'd have to switch to specifying one of max_steps=-1 or max_epochs=-1 . are we okay with this asymmetry?

Do we want to throw an error if max_steps = 0, max_epochs = 0, or max_time = 0? Or do we not care since FitLoop.done will handle this and just immediately return from run(). relevant code

if we can return early or raise an error, i'd prefer doing so. there is a lot of setup that potentially happens before entering the fit loop, and it's a bad UX wasting time for nothing to happen

@tchaton
Copy link
Contributor

tchaton commented Aug 27, 2021

Hey @EricWiener,

I am assigning you to this issue as it seems we have a proposal :)

Best,
T.C

@tchaton
Copy link
Contributor

tchaton commented Aug 27, 2021

-1, -1, infinite training!
-1, Any, turns off limits for epochs, still limited by max_steps
a. if max_steps == None, this would also be infinite training since there isn't a default value for max_steps.
b. if max_steps == 0, we immediately stop training / possibly throw an error?
c. if max_steps > 0, we will train up until max_steps.
Any, -1, turns off limits for steps, still limited by epochs
a. if max_epochs == None, we will use default value of 1000 epochs.
b. if max_epochs == 0, we will immediately stop training / possibly throw an error
c. if max_epochs > 0 we will train up until max_epochs

IMO, I believe there is no need to through an error. If a user provide 0 for one or all, then nothing should happen as you would expect by putting such values.

@EricWiener
Copy link
Contributor Author

Alright so I'm going to go with:

max_epochs, max_steps, effect

  1. -1, -1, infinite training!
  2. -1, Any, turns off limits for epochs, still limited by max_steps
    a. if max_steps == None, infinite training
    b. if max_steps == 0, we immediately stop training (no error)
    c. if max_steps > 0, we will train up until max_steps.
  3. Any, -1, turns off limits for steps, still limited by epochs
    a. if max_epochs == None, infinite training
    b. if max_epochs == 0, we will immediately stop training (no error)
    c. if max_epochs > 0 we will train up until max_epochs

And thanks to @ananthsub PR (#9072) we also have better behavior re. max_time.

@tchaton
Copy link
Contributor

tchaton commented Aug 27, 2021

Great @EricWiener, the proposal sounds great !

@carmocca
Copy link
Contributor

carmocca commented Aug 29, 2021

If we consider the product: list(itertools.product(["-1", "0", ">1", "None"], repeat=2)) we get this table:

max_epochs max_steps Effect Same as
1 -1 -1 Infinite training
2 -1 0 Do nothing
3 -1 >1 No epoch limits
4 -1 None Infinite training 1
5 0 -1 Do nothing
6 0 0 Do nothing
7 0 >1 Do nothing*
8 0 None Do nothing 5 or 6
9 >1 -1 No step limits
10 >1 0 Do nothing
11 >1 >1 Stop when one hits
12 >1 None No step limits 9
13 None -1 Run for 1k epochs
14 None 0 Do nothing 2
15 None >1 No epoch limits 3
16 None None Inifinte training 4

(*): Perhaps this case should equal to (3) in the future

Note that all cases that have a None as a value have duplicated functionality to another case.
With the exception of (13) which uses None to set the 1k epochs default.

This tells me that we should only allow None for the max_epochs parameter and in the case (13), otherwise convert it to -1, this way we have:

max_epochs max_steps Effect
1 -1 -1 Infinite training
2 -1 0 Do nothing
3 -1 >1 No epoch limits
4 0 -1 Do nothing
5 0 0 Do nothing
6 0 >1 Do nothing*
7 >1 -1 No step limits
8 >1 0 Do nothing
9 >1 >1 Stop when one hits
10 None -1 Run for 1k epochs

Pseudocode:

# default
max_epochs: Optional[int] = None
max_steps: int = -1

# conversion
if max_epochs is None:
    max_epochs = 1000 if max_steps == -1 else -1

@EricWiener
Copy link
Contributor Author

@carmocca I really like this idea. However, this would be a breaking change since if anyone has explicitly set max_steps = None, their code won't work anymore. I'm all in favor of making breaking changes to improve code, but not sure how everyone else will feel.

Could I please get thoughts from @tchaton @awaelchli @ananthsub @justusschock before changing the PR to this implementation. Want to make sure we're all on the same page that we change the default value of max_steps = -1 and don't allow max_steps to be None.

Could I please get either a response or a 👍/👎 from everyone tagged.

@carmocca
Copy link
Contributor

this would be a breaking change since if anyone has explicitly set max_steps = None, their code won't work anymore.

True, but does anybody set max_steps=None? Usually it's overridden only to set an int value.
I think it'd only be a problem for existing configuration files which will have max_steps: None. But we can add a deprecation path for those.

@justusschock
Copy link
Member

I think we definitely have to deprecate max_steps = None since this is part of the most-public API we offer, but other than that I really like it :)

@awaelchli
Copy link
Contributor

awaelchli commented Sep 1, 2021

I suggest to treat this as a separate issue and keep the PR focused on one thing. disallowing None for steps can be a separate PR imo.

@ananthsub
Copy link
Contributor

ananthsub commented Sep 1, 2021

+1 @awaelchli - we can enable infinite training in #8877 and separately tighten the types offered for the stopping conditions.

@EricWiener
Copy link
Contributor Author

Sounds great! Thanks for all the feedback

@carmocca
Copy link
Contributor

carmocca commented Sep 6, 2021

Re-opening to track the type change

Are you interested in doing it @EricWiener?

@carmocca carmocca reopened this Sep 6, 2021
@carmocca carmocca added the good first issue Good for newcomers label Sep 6, 2021
@EricWiener
Copy link
Contributor Author

Sure @carmocca, but might need a bit of time before I get to it

@EricWiener
Copy link
Contributor Author

@carmocca just made #9460. Not sure what needs to be done to deprecate setting max_steps = None

@ananthsub
Copy link
Contributor

ananthsub commented Sep 14, 2021

@awaelchli @carmocca I just realized data published to the trainer through LightningModule.log is going to accumulate without a reset. This means infinite training will run into OOMs if called this way.
For an online training scenario where data is continuously streamed in, does this means users should not rely on self.log ?

@carmocca
Copy link
Contributor

For an online training scenario where data is continuously streamed in, does this means users should not rely on self.log ?

At least, i'd recommend always setting self.log(on_epoch=False) since there won't be any "on epoch end" reductions happening.

But yes, if you still want to compute at a certain batch index, there's probably no reason to self.log. We need to brainstorm about logging during infinite training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on
Projects
None yet
6 participants