You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I cloned the repo, and ran the provided training command from here on 1 node, 2 GPUs, and it failed with the stack trace below. I've made no changes to the repo. Running on an 8xA100 GPU machine. It does work fine on a single GPU.
Expected Behavior
The training run should work correctly.
Current Behavior
Crashes when loading the language model. Full logs here: logs
Excerpt:
Traceback (most recent call last):
File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 484, in <module>
main()
File "/home/fsuser/open_flamingo/open_flamingo/train/train.py", line 260, in main
model, image_processor, tokenizer = create_model_and_transforms(
File "/home/fsuser/open_flamingo/open_flamingo/src/factory.py", line 57, in create_model_and_transforms
lang_encoder = AutoModelForCausalLM.from_pretrained(
File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 511, in from_pretrained
return model_class.from_pretrained(
File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3084, in from_pretrained
) = cls._load_pretrained_model(
File "/home/fsuser/miniconda3/envs/dothings/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3525, in _load_pretrained_model
raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MosaicGPT:
While copying the parameter named "transformer.wte.weight", whose dimensions in the model are torch.Size([50432, 2048]) and whose dimensions in the checkpoint are torch.Size([50432, 2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
While copying the parameter named "transformer.blocks.0.ln_1.weight", whose dimensions in the model are torch.Size([2048]) and whose dimensions in the checkpoint are torch.Size([2048]), an exception occurred : ('Cannot copy out of meta tensor; no data!',).
I cloned the repo, and ran the provided training command from here on 1 node, 2 GPUs, and it failed with the stack trace below. I've made no changes to the repo. Running on an 8xA100 GPU machine. It does work fine on a single GPU.
Expected Behavior
The training run should work correctly.
Current Behavior
Crashes when loading the language model. Full logs here: logs
Excerpt:
Steps to Reproduce
Environment
The text was updated successfully, but these errors were encountered: