Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report in resume training using Trainer and FSDP #26159

Closed
Yuanhy1997 opened this issue Sep 14, 2023 · 10 comments
Closed

Bug report in resume training using Trainer and FSDP #26159

Yuanhy1997 opened this issue Sep 14, 2023 · 10 comments

Comments

@Yuanhy1997
Copy link

Yuanhy1997 commented Sep 14, 2023

@pacman100 I'm also running into a similar error with the latest main branch:

File "/home/hyen/.conda/envs/cross/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 608, in set_state_dict_type
    state_dict_config_type = _state_dict_type_to_config[state_dict_type]
KeyError: None

Originally posted by @howard-yen in #25100 (comment)

I met the exact same error when resuming checkpoint for continuous training when using FSDP. This happens because in the pytorch FSDP load checkpoint function. The function has to check the model to load weights to is using the same FSDP strategy. But in the below lines in src/transformers/trainer.py (in the _inner_training_loop() func line 1963-1964)

if (is_sagemaker_mp_enabled() or self.is_fsdp_enabled) and resume_from_checkpoint is not None:
            self._load_from_checkpoint(resume_from_checkpoint, model)

these lines happen before the accelerator prepares the models. After I move the lines after the accelerator.prepare(model), the resuming works fine for me.

Hope this can be fixed properly, because I don't know naively moving these lines below accelerator.prepare would cause any troubles for sagmakes_mp.

@ArthurZucker
Copy link
Collaborator

Hey! Could you share a reproducer here? (even if it is the same)

@Yuanhy1997
Copy link
Author

Sorry, I cannot share the log because I am using the machines in a company. But I am using the 4.33.0 version transformers and the up-to-date accelerator. The training is a distributed training on 4 nodes with 8 gpus on each. I am using the FSDP in a full shard and auto wrap manner. The saved checkpoint is saved by the Trainer setting the saving strategy as 'steps'.

Then when I am using the Trainer to resume training, which is setting the args True in the train() function, I will encounter the error. I think this can be easily reproduced. I figure this out according the mentioned modification.

The reason there would be a Type error of None is that the FSDP function of load parameter will check is the sharding strategy of the saved weights is the same as the current model to load to. In Trainer, this all happens before prepare the model with accelerate.

@ArthurZucker
Copy link
Collaborator

also ccing @muellerzr for trainer 😉

@jmzeng
Copy link

jmzeng commented Sep 15, 2023

Hi, I'm also running into the exact same error. Would be great if there is a permanent fix. I did check the fsdp_plugin.state_dict_type that is passed in through https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/fsdp_utils.py and then into https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.set_state_dict_type. It was passed in correctly at first but after a few iterations it becomes None, which may be causing the issue.

I temporarily implemented this fix by @winglian (https://github.com/OpenAccess-AI-Collective/axolotl/pull/400/files#diff-0b142e48f0c0b4bdf2677ce86ee6352c3a5e5a3a9ddf22020a2920f496f74d2eR29). It gets past the error, but hangs on actually resuming the run.

Moreover, I'm also wondering if it's possible to resume training on a different machine. For example, if I saved the previous FSDP using distributed on 2 nodes, can I resume the checkpoint on 3 nodes?

Thanks.

@Yuanhy1997
Copy link
Author

Yes, at first it checks the fsdp strategy of your program about to use (ie, fsdp_plugin.state_dict_type). Then the type becomes none because the program starts to check the model's FSDP strategy and the accelerator hasn't prepared it with FSDP.

I think It would be an issue if you save with 2 nodes and resume with 3 nodes, since the rng_state would have a mismatching issue of partitions.

@jmzeng
Copy link

jmzeng commented Sep 15, 2023

Got it. Is it possible to reset the rng_states to resume on the 3 nodes?

@Yuanhy1997
Copy link
Author

I have never tried it. But I think there would be a way......

@muellerzr
Copy link
Contributor

cc @pacman100 for fsdp

@ArthurZucker
Copy link
Collaborator

Might be relevant to check #26180 as well

@pacman100
Copy link
Contributor

Fixed in PR #26180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants