-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Description
🚀 Feature
Currently, a lot of training functionality is tied to the concept of "Epoch".
IMHO, training based on epoch is bad practice, and I always train using the raw number of iterations.
PL makes this approach very difficult.
Motivation
Currently, it is very difficult to use PL with iteration based training, e.g. Trainer currently resumes from checkpoint at the start of next epoch, rather than at the iteration on which checkpoint was made.
In the real world, data composition of the training set is changing pretty frequently and training based on epochs can give false results. e.g. if I create a new dataset by duplicating every item in my original, and train for the same number of epochs, I will probably get better result which might make me think that new data I added improved the results, but in reality I just trained for longer.
Training on very large datasets can take days/weeks. When resources are constrained I need to schedule time for each experiment, and having to manually calculate number of epochs that fit in a time period is very annoying.
Pitch
Add proper support for iteration based training, the easiest would be to just keep track of total number of iterations through the epochs and allow various parts/callbacks to use it.
Alternatives
You can also completely change to iteration based training, and calculate epochs using length of the dataset, but that would probably require quite a large rewrite and be about the same.
Additional context
Current problems:
- Not possible to run validation at frequency greater than the length of epoch
- If the dataset is very large and training is only a few epochs, then resume can significantly worsen the results by starting from next epoch
- When
max_steps
is passed topl.Trainer
would be nice if it was used intqdm
.