Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume state of data loader and data module from mid-epoch checkpoint #4

Open
theissenhelen opened this issue Dec 13, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request training

Comments

@theissenhelen
Copy link
Collaborator

Is your feature request related to a problem? Please describe.

When resuming a run mid-epoch, we currently go through the whole dataset again making the epoch longer. This is quite expensive in rollout training. This was shown in PR when training with max_steps.

Describe the solution you'd like

Ideally, we want to restore the state of our dataloader and datamodule when loading from a checkpoint. Torchdata offers the torchdata stateful dataloader. However, it does no (yet?) support distributed training.

There are four stages we need to consider:

  1. Dataloader
  2. Datamodule
  3. Dataset
  4. Sampler

Things to brainstorm and discuss:

  • We need to understand limitations for workers/ranks - might require to design the interface as a flag so if this too limitating people can switch off
  • An intermediate solution could be limited to using the same number of workers/ranks.
  • Is torchdata maintained!?
  • Keep an eye on this issue
  • How does pytorch usesa stateful dataset in its dataloader.
  • How pytorch lightning connect the stateful dataloader with the stateful dataset (Is there functionality for this?)

First steps:

Describe alternatives you've considered

No response

Additional context

No response

Organisation

ECMWF

@theissenhelen theissenhelen added the enhancement New feature or request label Dec 13, 2024
@Rilwan-Adewoyin Rilwan-Adewoyin self-assigned this Dec 18, 2024
@JesperDramsch JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request training
Projects
None yet
Development

No branches or pull requests

4 participants