Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Unexpected key(s) in state_dict when loading OFAModel #34

Open
aaaaaannie opened this issue Sep 12, 2024 · 5 comments
Open

Comments

@aaaaaannie
Copy link

aaaaaannie commented Sep 12, 2024

Hi,I am encountering an issue when trying to load a pre-trained OFAModel. The error message I receive is as follows:
image

Environment Details:
pip version: 21.2.4
Fairseq version: Installed from the OFA repository

Steps to Reproduce:
Installed Fairseq from the OFA repository.
Configured the environment and downloaded the necessary pre-trained datasets.
Attempted to load the OFAModel using the provided scripts.

@taokz
Copy link
Owner

taokz commented Sep 13, 2024

Could you let me know which scripts and checkpoints you are using?

@aaaaaannie
Copy link
Author

Could you let me know which scripts and checkpoints you are using?

For the scripts, I used 'pretrain_tiny.sh' located in the scripts/pretrain/ . I only made two modifications to this script:

  1. Changed CUDA_VISIBLE_DEVICES to 0,1,2,3
  2. Set GPUS_PER_NODE to 4

For the checkpoint, I used 'biomedgpt_tiny.pt' which I downloaded from the Dropbox link provided in the checkpoints.md (https://www.dropbox.com/sh/cu2r5zkj2r0e6zu/AADZ-KHn-emsICawm9CM4MqVa?dl=0).

These were the key components I utilized for my setup. Let me know if you need any clarification or have additional questions about the configuration.

@taokz
Copy link
Owner

taokz commented Sep 19, 2024

Could you try installing Fairseq from this repository instead of OFA and re-run the code? Additionally, could you please share the entire error log?

@aaaaaannie
Copy link
Author

hi i've install fairseq from your repository, but still get the same error as below:

2024-09-23 02:08:07 - train.py[line:154] - INFO: training on 4 devices (GPUs/TPUs)
2024-09-23 02:08:07 - train.py[line:160] - INFO: max tokens per device = None and max sentences per device = 16
2024-09-23 02:08:07 - trainer.py[line:458] - INFO: Preparing to load checkpoint ../../scripts/biomedgpt_tiny.pt
Traceback (most recent call last):
File "/mypath/trainer.py", line 519, in load_checkpoint
Traceback (most recent call last):
File "/mypath/trainer.py", line 519, in load_checkpoint
state["model"], strict=True, model_cfg=self.cfg.model state["model"], strict=True, model_cfg=self.cfg.model

File "/mypath/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 52, in load_state_dict
File "/mypath/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 52, in load_state_dict
return self.module.module.load_state_dict(*args, **kwargs)
File "/mypath/fairseq/fairseq/models/fairseq_model.py", line 125, in load_state_dict
return self.module.module.load_state_dict(*args, **kwargs)
File "/mypath/fairseq/fairseq/models/fairseq_model.py", line 125, in load_state_dict
Traceback (most recent call last):
File "/mypath/trainer.py", line 519, in load_checkpoint
state["model"], strict=True, model_cfg=self.cfg.model
File "/mypath/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 52, in load_state_dict
return self.module.module.load_state_dict(*args, **kwargs)
File "/mypath/fairseq/fairseq/models/fairseq_model.py", line 125, in load_state_dict
Traceback (most recent call last):
File "/mypath/trainer.py", line 519, in load_checkpoint
state["model"], strict=True, model_cfg=self.cfg.model
File "/mypath/fairseq/fairseq/distributed/module_proxy_wrapper.py", line 52, in load_state_dict
return self.module.module.load_state_dict(*args, **kwargs)
File "/mypath/fairseq/fairseq/models/fairseq_model.py", line 125, in load_state_dict
return super().load_state_dict(new_state_dict, strict)return super().load_state_dict(new_state_dict, strict)

  File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1668, in load_state_dict

return super().load_state_dict(new_state_dict, strict) File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1668, in load_state_dict
return super().load_state_dict(new_state_dict, strict)
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1668, in load_state_dict

File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1668, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
self.class.name, "\n\t".join(error_msgs)))
RuntimeErrorRuntimeError: Error(s) in loading state_dict for OFAModel:
Unexpected key(s) in state_dict: "encoder.layers.0.attn_ln.weight", "encoder.layers.0.attn_ln.bias", "encoder.layers.0.ffn_layernorm.weight", "encoder.layers.0.ffn_layernorm.bias", "encoder.layers.0.self_attn.c_attn", "encoder.layers.1.attn_ln.weight", "encoder.layers.1.attn_ln.bias", "encoder.layers.1.ffn_layernorm.weight", "encoder.layers.1.ffn_layernorm.bias", "encoder.layers.1.self_attn.c_attn", "encoder.layers.2.attn_ln.weight", "encoder.layers.2.attn_ln.bias", "encoder.layers.2.ffn_layernorm.weight", "encoder.layers.2.ffn_layernorm.bias", "encoder.layers.2.self_attn.c_attn", "encoder.layers.3.attn_ln.weight", "encoder.layers.3.attn_ln.bias", "encoder.layers.3.ffn_layernorm.weight", "encoder.layers.3.ffn_layernorm.bias", "encoder.layers.3.self_attn.c_attn", "decoder.layers.0.self_attn_ln.weight", "decoder.layers.0.self_attn_ln.bias", "decoder.layers.0.cross_attn_ln.weight", "decoder.layers.0.cross_attn_ln.bias", "decoder.layers.0.ffn_layernorm.weight", "decoder.layers.0.ffn_layernorm.bias", "decoder.layers.0.self_attn.c_attn", "decoder.layers.0.encoder_attn.c_attn", "decoder.layers.1.self_attn_ln.weight", "decoder.layers.1.self_attn_ln.bias", "decoder.layers.1.cross_attn_ln.weight", "decoder.layers.1.cross_attn_ln.bias", "decoder.layers.1.ffn_layernorm.weight", "decoder.layers.1.ffn_layernorm.bias", "decoder.layers.1.self_attn.c_attn", "decoder.layers.1.encoder_attn.c_attn", "decoder.layers.2.self_attn_ln.weight", "decoder.layers.2.self_attn_ln.bias", "decoder.layers.2.cross_attn_ln.weight", "decoder.layers.2.cross_attn_ln.bias", "decoder.layers.2.ffn_layernorm.weight", "decoder.layers.2.ffn_layernorm.bias", "decoder.layers.2.self_attn.c_attn", "decoder.layers.2.encoder_attn.c_attn", "decoder.layers.3.self_attn_ln.weight", "decoder.layers.3.self_attn_ln.bias", "decoder.layers.3.cross_attn_ln.weight", "decoder.layers.3.cross_attn_ln.bias", "decoder.layers.3.ffn_layernorm.weight", "decoder.layers.3.ffn_layernorm.bias", "decoder.layers.3.self_attn.c_attn", "decoder.layers.3.encoder_attn.c_attn". :

During handling of the above exception, another exception occurred:

Error(s) in loading state_dict for OFAModel:
Unexpected key(s) in state_dict: "encoder.layers.0.attn_ln.weight", "encoder.layers.0.attn_ln.bias", "encoder.layers.0.ffn_layernorm.weight", "encoder.layers.0.ffn_layernorm.bias", "encoder.layers.0.self_attn.c_attn", "encoder.layers.1.attn_ln.weight", "encoder.layers.1.attn_ln.bias", "encoder.layers.1.ffn_layernorm.weight", "encoder.layers.1.ffn_layernorm.bias", "encoder.layers.1.self_attn.c_attn", "encoder.layers.2.attn_ln.weight", "encoder.layers.2.attn_ln.bias", "encoder.layers.2.ffn_layernorm.weight", "encoder.layers.2.ffn_layernorm.bias", "encoder.layers.2.self_attn.c_attn", "encoder.layers.3.attn_ln.weight", "encoder.layers.3.attn_ln.bias", "encoder.layers.3.ffn_layernorm.weight", "encoder.layers.3.ffn_layernorm.bias", "encoder.layers.3.self_attn.c_attn", "decoder.layers.0.self_attn_ln.weight", "decoder.layers.0.self_attn_ln.bias", "decoder.layers.0.cross_attn_ln.weight", "decoder.layers.0.cross_attn_ln.bias", "decoder.layers.0.ffn_layernorm.weight", "decoder.layers.0.ffn_layernorm.bias", "decoder.layers.0.self_attn.c_attn", "decoder.layers.0.encoder_attn.c_attn", "decoder.layers.1.self_attn_ln.weight", "decoder.layers.1.self_attn_ln.bias", "decoder.layers.1.cross_attn_ln.weight", "decoder.layers.1.cross_attn_ln.bias", "decoder.layers.1.ffn_layernorm.weight", "decoder.layers.1.ffn_layernorm.bias", "decoder.layers.1.self_attn.c_attn", "decoder.layers.1.encoder_attn.c_attn", "decoder.layers.2.self_attn_ln.weight", "decoder.layers.2.self_attn_ln.bias", "decoder.layers.2.cross_attn_ln.weight", "decoder.layers.2.cross_attn_ln.bias", "decoder.layers.2.ffn_layernorm.weight", "decoder.layers.2.ffn_layernorm.bias", "decoder.layers.2.self_attn.c_attn", "decoder.layers.2.encoder_attn.c_attn", "decoder.layers.3.self_attn_ln.weight", "decoder.layers.3.self_attn_ln.bias", "decoder.layers.3.cross_attn_ln.weight", "decoder.layers.3.cross_attn_ln.bias", "decoder.layers.3.ffn_layernorm.weight", "decoder.layers.3.ffn_layernorm.bias", "decoder.layers.3.self_attn.c_attn", "decoder.layers.3.encoder_attn.c_attn".

During handling of the above exception, another exception occurred:

self.__class__.__name__, "\n\t".join(error_msgs)))Traceback (most recent call last):

self.class.name, "\n\t".join(error_msgs))) File "../../train.py", line 537, in

Traceback (most recent call last):
RuntimeError File "../../train.py", line 537, in
RuntimeError: Error(s) in loading state_dict for OFAModel:
Unexpected key(s) in state_dict: "encoder.layers.0.attn_ln.weight", "encoder.layers.0.attn_ln.bias", "encoder.layers.0.ffn_layernorm.weight", "encoder.layers.0.ffn_layernorm.bias", "encoder.layers.0.self_attn.c_attn", "encoder.layers.1.attn_ln.weight", "encoder.layers.1.attn_ln.bias", "encoder.layers.1.ffn_layernorm.weight", "encoder.layers.1.ffn_layernorm.bias", "encoder.layers.1.self_attn.c_attn", "encoder.layers.2.attn_ln.weight", "encoder.layers.2.attn_ln.bias", "encoder.layers.2.ffn_layernorm.weight", "encoder.layers.2.ffn_layernorm.bias", "encoder.layers.2.self_attn.c_attn", "encoder.layers.3.attn_ln.weight", "encoder.layers.3.attn_ln.bias", "encoder.layers.3.ffn_layernorm.weight", "encoder.layers.3.ffn_layernorm.bias", "encoder.layers.3.self_attn.c_attn", "decoder.layers.0.self_attn_ln.weight", "decoder.layers.0.self_attn_ln.bias", "decoder.layers.0.cross_attn_ln.weight", "decoder.layers.0.cross_attn_ln.bias", "decoder.layers.0.ffn_layernorm.weight", "decoder.layers.0.ffn_layernorm.bias", "decoder.layers.0.self_attn.c_attn", "decoder.layers.0.encoder_attn.c_attn", "decoder.layers.1.self_attn_ln.weight", "decoder.layers.1.self_attn_ln.bias", "decoder.layers.1.cross_attn_ln.weight", "decoder.layers.1.cross_attn_ln.bias", "decoder.layers.1.ffn_layernorm.weight", "decoder.layers.1.ffn_layernorm.bias", "decoder.layers.1.self_attn.c_attn", "decoder.layers.1.encoder_attn.c_attn", "decoder.layers.2.self_attn_ln.weight", "decoder.layers.2.self_attn_ln.bias", "decoder.layers.2.cross_attn_ln.weight", "decoder.layers.2.cross_attn_ln.bias", "decoder.layers.2.ffn_layernorm.weight", "decoder.layers.2.ffn_layernorm.bias", "decoder.layers.2.self_attn.c_attn", "decoder.layers.2.encoder_attn.c_attn", "decoder.layers.3.self_attn_ln.weight", "decoder.layers.3.self_attn_ln.bias", "decoder.layers.3.cross_attn_ln.weight", "decoder.layers.3.cross_attn_ln.bias", "decoder.layers.3.ffn_layernorm.weight", "decoder.layers.3.ffn_layernorm.bias", "decoder.layers.3.self_attn.c_attn", "decoder.layers.3.encoder_attn.c_attn".
:
During handling of the above exception, another exception occurred:

Error(s) in loading state_dict for OFAModel:
Unexpected key(s) in state_dict: "encoder.layers.0.attn_ln.weight", "encoder.layers.0.attn_ln.bias", "encoder.layers.0.ffn_layernorm.weight", "encoder.layers.0.ffn_layernorm.bias", "encoder.layers.0.self_attn.c_attn", "encoder.layers.1.attn_ln.weight", "encoder.layers.1.attn_ln.bias", "encoder.layers.1.ffn_layernorm.weight", "encoder.layers.1.ffn_layernorm.bias", "encoder.layers.1.self_attn.c_attn", "encoder.layers.2.attn_ln.weight", "encoder.layers.2.attn_ln.bias", "encoder.layers.2.ffn_layernorm.weight", "encoder.layers.2.ffn_layernorm.bias", "encoder.layers.2.self_attn.c_attn", "encoder.layers.3.attn_ln.weight", "encoder.layers.3.attn_ln.bias", "encoder.layers.3.ffn_layernorm.weight", "encoder.layers.3.ffn_layernorm.bias", "encoder.layers.3.self_attn.c_attn", "decoder.layers.0.self_attn_ln.weight", "decoder.layers.0.self_attn_ln.bias", "decoder.layers.0.cross_attn_ln.weight", "decoder.layers.0.cross_attn_ln.bias", "decoder.layers.0.ffn_layernorm.weight", "decoder.layers.0.ffn_layernorm.bias", "decoder.layers.0.self_attn.c_attn", "decoder.layers.0.encoder_attn.c_attn", "decoder.layers.1.self_attn_ln.weight", "decoder.layers.1.self_attn_ln.bias", "decoder.layers.1.cross_attn_ln.weight", "decoder.layers.1.cross_attn_ln.bias", "decoder.layers.1.ffn_layernorm.weight", "decoder.layers.1.ffn_layernorm.bias", "decoder.layers.1.self_attn.c_attn", "decoder.layers.1.encoder_attn.c_attn", "decoder.layers.2.self_attn_ln.weight", "decoder.layers.2.self_attn_ln.bias", "decoder.layers.2.cross_attn_ln.weight", "decoder.layers.2.cross_attn_ln.bias", "decoder.layers.2.ffn_layernorm.weight", "decoder.layers.2.ffn_layernorm.bias", "decoder.layers.2.self_attn.c_attn", "decoder.layers.2.encoder_attn.c_attn", "decoder.layers.3.self_attn_ln.weight", "decoder.layers.3.self_attn_ln.bias", "decoder.layers.3.cross_attn_ln.weight", "decoder.layers.3.cross_attn_ln.bias", "decoder.layers.3.ffn_layernorm.weight", "decoder.layers.3.ffn_layernorm.bias", "decoder.layers.3.self_attn.c_attn", "decoder.layers.3.encoder_attn.c_attn". Traceback (most recent call last):

During handling of the above exception, another exception occurred:

File "../../train.py", line 537, in
Traceback (most recent call last):
File "../../train.py", line 537, in
cli_main()
File "../../train.py", line 530, in cli_main
cli_main()
File "../../train.py", line 530, in cli_main
cli_main()
File "../../train.py", line 530, in cli_main
cli_main()
File "../../train.py", line 530, in cli_main
distributed_utils.call_main(cfg, main)
File "/mypath/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_utils.call_main(cfg, main)
File "/mypath/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_utils.call_main(cfg, main)
File "/mypath/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_utils.call_main(cfg, main)
File "/mypath/fairseq/fairseq/distributed/utils.py", line 374, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/mypath/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs) File "/mypath/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
File "/mypath/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main

File "/mypath/fairseq/fairseq/distributed/utils.py", line 348, in distributed_main
main(cfg, **kwargs)
File "../../train.py", line 170, in main
main(cfg, **kwargs)
File "../../train.py", line 170, in main
main(cfg, **kwargs)
File "../../train.py", line 170, in main
disable_iterator_cache=True,
File "/mypath/utils/checkpoint_utils.py", line 254, in load_checkpoint
main(cfg, **kwargs)
File "../../train.py", line 170, in main
disable_iterator_cache=True,disable_iterator_cache=True,

File "/mypath/utils/checkpoint_utils.py", line 254, in load_checkpoint
File "/mypath/utils/checkpoint_utils.py", line 254, in load_checkpoint
reset_meters=reset_meters,
File "/mypath/trainer.py", line 533, in load_checkpoint
disable_iterator_cache=True,
File "/mypath/utils/checkpoint_utils.py", line 254, in load_checkpoint
reset_meters=reset_meters,
reset_meters=reset_meters,
File "/mypath/trainer.py", line 533, in load_checkpoint
File "/mypath/trainer.py", line 533, in load_checkpoint
"please ensure that the architectures match.".format(filename)
Exception: Cannot load model parameters from checkpoint ../../scripts/biomedgpt_tiny.pt; please ensure that the architectures match.
reset_meters=reset_meters,
File "/mypath/trainer.py", line 533, in load_checkpoint
"please ensure that the architectures match.".format(filename)"please ensure that the architectures match.".format(filename)

ExceptionException: : Cannot load model parameters from checkpoint ../../scripts/biomedgpt_tiny.pt; please ensure that the architectures match.Cannot load model parameters from checkpoint ../../scripts/biomedgpt_tiny.pt; please ensure that the architectures match.

"please ensure that the architectures match.".format(filename)

Exception: Cannot load model parameters from checkpoint ../../scripts/biomedgpt_tiny.pt; please ensure that the architectures match.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 239) of binary: /root/miniconda3/envs/biomedgpt/bin/python3
Traceback (most recent call last):
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in
main()
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/biomedgpt/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

../../train.py FAILED

Failures:
[1]:
time : 2024-09-23_02:08:13
host : 7ctiahmo5rmk1-0
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 240)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-09-23_02:08:13
host : 7ctiahmo5rmk1-0
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 241)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-09-23_02:08:13
host : 7ctiahmo5rmk1-0
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 242)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-09-23_02:08:13
host : 7ctiahmo5rmk1-0
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 239)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@taokz
Copy link
Owner

taokz commented Oct 24, 2024

@aaaaaannie Apologies I missed your response earlier. It seems that arch=ofa_tiny might not be set up correctly. Which checkpoint did you download? We currently have three models available publicly, and if you downloaded the base model, you’ll need to set arch=ofa_base instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants