Bug report in resume training using Trainer and FSDP #26159

Yuanhy1997 · 2023-09-14T06:14:22Z

@pacman100 I'm also running into a similar error with the latest main branch:

File "/home/hyen/.conda/envs/cross/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 608, in set_state_dict_type
    state_dict_config_type = _state_dict_type_to_config[state_dict_type]
KeyError: None

Originally posted by @howard-yen in #25100 (comment)

I met the exact same error when resuming checkpoint for continuous training when using FSDP. This happens because in the pytorch FSDP load checkpoint function. The function has to check the model to load weights to is using the same FSDP strategy. But in the below lines in src/transformers/trainer.py (in the _inner_training_loop() func line 1963-1964)

if (is_sagemaker_mp_enabled() or self.is_fsdp_enabled) and resume_from_checkpoint is not None:
            self._load_from_checkpoint(resume_from_checkpoint, model)

these lines happen before the accelerator prepares the models. After I move the lines after the accelerator.prepare(model), the resuming works fine for me.

Hope this can be fixed properly, because I don't know naively moving these lines below accelerator.prepare would cause any troubles for sagmakes_mp.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-09-14T14:13:50Z

Hey! Could you share a reproducer here? (even if it is the same)

Yuanhy1997 · 2023-09-14T14:29:06Z

Sorry, I cannot share the log because I am using the machines in a company. But I am using the 4.33.0 version transformers and the up-to-date accelerator. The training is a distributed training on 4 nodes with 8 gpus on each. I am using the FSDP in a full shard and auto wrap manner. The saved checkpoint is saved by the Trainer setting the saving strategy as 'steps'.

Then when I am using the Trainer to resume training, which is setting the args True in the train() function, I will encounter the error. I think this can be easily reproduced. I figure this out according the mentioned modification.

The reason there would be a Type error of None is that the FSDP function of load parameter will check is the sharding strategy of the saved weights is the same as the current model to load to. In Trainer, this all happens before prepare the model with accelerate.

ArthurZucker · 2023-09-14T15:26:14Z

also ccing @muellerzr for trainer 😉

jmzeng · 2023-09-15T18:08:56Z

Hi, I'm also running into the exact same error. Would be great if there is a permanent fix. I did check the fsdp_plugin.state_dict_type that is passed in through https://github.com/huggingface/accelerate/blob/main/src/accelerate/utils/fsdp_utils.py and then into https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.set_state_dict_type. It was passed in correctly at first but after a few iterations it becomes None, which may be causing the issue.

I temporarily implemented this fix by @winglian (https://github.com/OpenAccess-AI-Collective/axolotl/pull/400/files#diff-0b142e48f0c0b4bdf2677ce86ee6352c3a5e5a3a9ddf22020a2920f496f74d2eR29). It gets past the error, but hangs on actually resuming the run.

Moreover, I'm also wondering if it's possible to resume training on a different machine. For example, if I saved the previous FSDP using distributed on 2 nodes, can I resume the checkpoint on 3 nodes?

Thanks.

Yuanhy1997 · 2023-09-15T18:15:30Z

Yes, at first it checks the fsdp strategy of your program about to use (ie, fsdp_plugin.state_dict_type). Then the type becomes none because the program starts to check the model's FSDP strategy and the accelerator hasn't prepared it with FSDP.

I think It would be an issue if you save with 2 nodes and resume with 3 nodes, since the rng_state would have a mismatching issue of partitions.

jmzeng · 2023-09-15T18:17:14Z

Got it. Is it possible to reset the rng_states to resume on the 3 nodes?

Yuanhy1997 · 2023-09-15T18:19:10Z

I have never tried it. But I think there would be a way......

muellerzr · 2023-09-15T18:33:43Z

cc @pacman100 for fsdp

ArthurZucker · 2023-09-15T19:16:39Z

Might be relevant to check #26180 as well

pacman100 · 2023-09-29T11:06:21Z

Fixed in PR #26180

Yuanhy1997 mentioned this issue Sep 14, 2023

Trainer/accelerate crashes when loading checkpoint using FSDP: sync_module_states ValueError #25100

Closed

4 tasks

pacman100 closed this as completed Sep 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug report in resume training using Trainer and FSDP #26159

Bug report in resume training using Trainer and FSDP #26159

Yuanhy1997 commented Sep 14, 2023 •

edited by ArthurZucker

Loading

ArthurZucker commented Sep 14, 2023

Yuanhy1997 commented Sep 14, 2023

ArthurZucker commented Sep 14, 2023

jmzeng commented Sep 15, 2023

Yuanhy1997 commented Sep 15, 2023

jmzeng commented Sep 15, 2023

Yuanhy1997 commented Sep 15, 2023

muellerzr commented Sep 15, 2023

ArthurZucker commented Sep 15, 2023

pacman100 commented Sep 29, 2023

Bug report in resume training using Trainer and FSDP #26159

Bug report in resume training using Trainer and FSDP #26159

Comments

Yuanhy1997 commented Sep 14, 2023 • edited by ArthurZucker Loading

ArthurZucker commented Sep 14, 2023

Yuanhy1997 commented Sep 14, 2023

ArthurZucker commented Sep 14, 2023

jmzeng commented Sep 15, 2023

Yuanhy1997 commented Sep 15, 2023

jmzeng commented Sep 15, 2023

Yuanhy1997 commented Sep 15, 2023

muellerzr commented Sep 15, 2023

ArthurZucker commented Sep 15, 2023

pacman100 commented Sep 29, 2023

Yuanhy1997 commented Sep 14, 2023 •

edited by ArthurZucker

Loading