-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug report in resume training using Trainer and FSDP #26159
Comments
Hey! Could you share a reproducer here? (even if it is the same) |
Sorry, I cannot share the log because I am using the machines in a company. But I am using the 4.33.0 version transformers and the up-to-date accelerator. The training is a distributed training on 4 nodes with 8 gpus on each. I am using the FSDP in a full shard and auto wrap manner. The saved checkpoint is saved by the Trainer setting the saving strategy as 'steps'. Then when I am using the Trainer to resume training, which is setting the args True in the train() function, I will encounter the error. I think this can be easily reproduced. I figure this out according the mentioned modification. The reason there would be a Type error of None is that the FSDP function of load parameter will check is the sharding strategy of the saved weights is the same as the current model to load to. In Trainer, this all happens before prepare the model with accelerate. |
also ccing @muellerzr for trainer 😉 |
Hi, I'm also running into the exact same error. Would be great if there is a permanent fix. I did check the I temporarily implemented this fix by @winglian (https://github.com/OpenAccess-AI-Collective/axolotl/pull/400/files#diff-0b142e48f0c0b4bdf2677ce86ee6352c3a5e5a3a9ddf22020a2920f496f74d2eR29). It gets past the error, but hangs on actually resuming the run. Moreover, I'm also wondering if it's possible to resume training on a different machine. For example, if I saved the previous FSDP using distributed on 2 nodes, can I resume the checkpoint on 3 nodes? Thanks. |
Yes, at first it checks the fsdp strategy of your program about to use (ie, fsdp_plugin.state_dict_type). Then the type becomes none because the program starts to check the model's FSDP strategy and the accelerator hasn't prepared it with FSDP. I think It would be an issue if you save with 2 nodes and resume with 3 nodes, since the rng_state would have a mismatching issue of partitions. |
Got it. Is it possible to reset the rng_states to resume on the 3 nodes? |
I have never tried it. But I think there would be a way...... |
cc @pacman100 for fsdp |
Might be relevant to check #26180 as well |
Fixed in PR #26180 |
@pacman100 I'm also running into a similar error with the latest main branch:
Originally posted by @howard-yen in #25100 (comment)
I met the exact same error when resuming checkpoint for continuous training when using FSDP. This happens because in the pytorch FSDP load checkpoint function. The function has to check the model to load weights to is using the same FSDP strategy. But in the below lines in src/transformers/trainer.py (in the _inner_training_loop() func line 1963-1964)
these lines happen before the accelerator prepares the models. After I move the lines after the accelerator.prepare(model), the resuming works fine for me.
Hope this can be fixed properly, because I don't know naively moving these lines below accelerator.prepare would cause any troubles for sagmakes_mp.
The text was updated successfully, but these errors were encountered: