Skip to content

Properly support iteration-based training #7629

@Rizhiy

Description

@Rizhiy

🚀 Feature

Currently, a lot of training functionality is tied to the concept of "Epoch".
IMHO, training based on epoch is bad practice, and I always train using the raw number of iterations.
PL makes this approach very difficult.

Motivation

Currently, it is very difficult to use PL with iteration based training, e.g. Trainer currently resumes from checkpoint at the start of next epoch, rather than at the iteration on which checkpoint was made.

In the real world, data composition of the training set is changing pretty frequently and training based on epochs can give false results. e.g. if I create a new dataset by duplicating every item in my original, and train for the same number of epochs, I will probably get better result which might make me think that new data I added improved the results, but in reality I just trained for longer.

Training on very large datasets can take days/weeks. When resources are constrained I need to schedule time for each experiment, and having to manually calculate number of epochs that fit in a time period is very annoying.

Pitch

Add proper support for iteration based training, the easiest would be to just keep track of total number of iterations through the epochs and allow various parts/callbacks to use it.

Alternatives

You can also completely change to iteration based training, and calculate epochs using length of the dataset, but that would probably require quite a large rewrite and be about the same.

Additional context

Current problems:

  • Not possible to run validation at frequency greater than the length of epoch
  • If the dataset is very large and training is only a few epochs, then resume can significantly worsen the results by starting from next epoch
  • When max_steps is passed to pl.Trainer would be nice if it was used in tqdm.

Metadata

Metadata

Labels

featureIs an improvement or enhancement

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions