You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When resuming a run mid-epoch, we currently go through the whole dataset again making the epoch longer. This is quite expensive in rollout training. This was shown in PR when training with max_steps.
Describe the solution you'd like
Ideally, we want to restore the state of our dataloader and datamodule when loading from a checkpoint. Torchdata offers the torchdata stateful dataloader. However, it does no (yet?) support distributed training.
There are four stages we need to consider:
Dataloader
Datamodule
Dataset
Sampler
Things to brainstorm and discuss:
We need to understand limitations for workers/ranks - might require to design the interface as a flag so if this too limitating people can switch off
An intermediate solution could be limited to using the same number of workers/ranks.
Is your feature request related to a problem? Please describe.
When resuming a run mid-epoch, we currently go through the whole dataset again making the epoch longer. This is quite expensive in rollout training. This was shown in PR when training with max_steps.
Describe the solution you'd like
Ideally, we want to restore the state of our dataloader and datamodule when loading from a checkpoint. Torchdata offers the torchdata stateful dataloader. However, it does no (yet?) support distributed training.
There are four stages we need to consider:
Things to brainstorm and discuss:
First steps:
Describe alternatives you've considered
No response
Additional context
No response
Organisation
ECMWF
The text was updated successfully, but these errors were encountered: