You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The pretrain command currently saves and loads (on resume) the train_dataloader as part of model state. The issue is that CycleIterator is initialized after reloading the train_dataloader:
and it takes the current state as the "starting point" for all future epochs. What this means in practice is lets say an experiment crashed 90% of the way through the dataset, on resume the first epoch will run as expected, but the subsequent epochs will only go through the last 10% of the data repeatedly. The other smaller inconvenience is that CycleIterator is how the current code tracks epochs, and since this state is not saved, if the crash happens anywhere other than Epoch 1, the epoch count will be incorrect in resumed runs.
As far as I see, there are four potential solutions, but I'm not a 100% sure which are possible:
Do the same thing we do in finetune_full, which is not save the train_dataloader state and just skip all the data that the training has seen in the previous run (easily done by looking at initial_iter and if its 0. This is however quite inefficient, especially at pretraining data scales.
Save the CycleIterator state instead of the train_dataloader. I am not sure what lightning fabric save/load expect, so I am not sure if this is even possible.
Supply CycleIterator with both the saved train_dataloader state and a "fresh" train_dataloader, so it can continue from the first but always use the second for resetting. This feels a bit hacky and clunky however.
Somehow "reset" the train_dataloader to the beginning inside CycleIterator when reseting the iterator. I am not sure how this can be done (without changing the semantics of shuffling etc).
I'm happy to implement any of these if they seem like the right solution!
What operating system are you using?
Linux
LitGPT Version
Version: 0.4.10
The text was updated successfully, but these errors were encountered:
Bug description
The
pretrain
command currently saves and loads (on resume) thetrain_dataloader
as part of model state. The issue is that CycleIterator is initialized after reloading thetrain_dataloader
:litgpt/litgpt/pretrain.py
Line 282 in 1d37f9a
and it takes the current state as the "starting point" for all future epochs. What this means in practice is lets say an experiment crashed 90% of the way through the dataset, on resume the first epoch will run as expected, but the subsequent epochs will only go through the last 10% of the data repeatedly. The other smaller inconvenience is that CycleIterator is how the current code tracks epochs, and since this state is not saved, if the crash happens anywhere other than Epoch 1, the epoch count will be incorrect in resumed runs.
As far as I see, there are four potential solutions, but I'm not a 100% sure which are possible:
train_dataloader
state and just skip all the data that the training has seen in the previous run (easily done by looking atinitial_iter
and if its 0. This is however quite inefficient, especially at pretraining data scales.train_dataloader
. I am not sure what lightning fabric save/load expect, so I am not sure if this is even possible.train_dataloader
state and a "fresh"train_dataloader
, so it can continue from the first but always use the second for resetting. This feels a bit hacky and clunky however.train_dataloader
to the beginning inside CycleIterator when reseting the iterator. I am not sure how this can be done (without changing the semantics of shuffling etc).I'm happy to implement any of these if they seem like the right solution!
What operating system are you using?
Linux
LitGPT Version
The text was updated successfully, but these errors were encountered: