resume state of data loader and data module from mid-epoch checkpoint #4

theissenhelen · 2024-12-13T11:10:15Z

Is your feature request related to a problem? Please describe.

When resuming a run mid-epoch, we currently go through the whole dataset again making the epoch longer. This is quite expensive in rollout training. This was shown in PR when training with max_steps.

Describe the solution you'd like

Ideally, we want to restore the state of our dataloader and datamodule when loading from a checkpoint. Torchdata offers the torchdata stateful dataloader. However, it does no (yet?) support distributed training.

There are four stages we need to consider:

Dataloader
Datamodule
Dataset
Sampler

Things to brainstorm and discuss:

We need to understand limitations for workers/ranks - might require to design the interface as a flag so if this too limitating people can switch off
An intermediate solution could be limited to using the same number of workers/ranks.
Is torchdata maintained!?
Keep an eye on this issue
How does pytorch usesa stateful dataset in its dataloader.
How pytorch lightning connect the stateful dataloader with the stateful dataset (Is there functionality for this?)

First steps:

create a number of toy examples to understand the workings of the stateful dataloader/data-module (@anaprietonem, @theissenhelen)

Describe alternatives you've considered

No response

Additional context

No response

Organisation

ECMWF

theissenhelen added the enhancement New feature or request label Dec 13, 2024

anaprietonem assigned theissenhelen and anaprietonem Dec 18, 2024

Rilwan-Adewoyin self-assigned this Dec 18, 2024

JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024

JesperDramsch added the training label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume state of data loader and data module from mid-epoch checkpoint #4

resume state of data loader and data module from mid-epoch checkpoint #4

theissenhelen commented Dec 13, 2024

resume state of data loader and data module from mid-epoch checkpoint #4

resume state of data loader and data module from mid-epoch checkpoint #4

Comments

theissenhelen commented Dec 13, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Organisation