Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error in batch_convert_ckpt #32

Open
hanjr92 opened this issue Jun 28, 2024 · 7 comments
Open

error in batch_convert_ckpt #32

hanjr92 opened this issue Jun 28, 2024 · 7 comments

Comments

@hanjr92
Copy link

hanjr92 commented Jun 28, 2024

when i use bash neo/scripts/batch_convert_ckpt.sh:

received transformer layer 17
received final norm
received output layer
Saving model to disk ...
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/megatron/tools/checkpoint/saver_llama2_hf_bf.py", line 108, in save_checkpoint
    model = AutoModelForCausalLM.from_pretrained(None, config=llama_conf, state_dict=state_dict, torch_dtype=torch_dtype)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3278, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
	size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
                                                                                      
@hanjr92
Copy link
Author

hanjr92 commented Jun 28, 2024

i wish you can provide 2b hf-verison checkpoints.

@Kevinstone-199898
Copy link

Excuse me, I wonder where do you get the checkpoint? from huggingface?

@hanjr92
Copy link
Author

hanjr92 commented Jul 1, 2024

Excuse me, I wonder where do you get the checkpoint? from huggingface?

yes, i got 2b checkponits from huggingface. The checkpoints look like megatron version. So, i encountered this error when i used tools to convert it.

@Kevinstone-199898
Copy link

You just direcctly run bash neo/scripts/batch_convert_ckpt.sh without any modification and then encountered this error?It seems that the loader runs correctly and the saver part is wrong

@hanjr92
Copy link
Author

hanjr92 commented Jul 1, 2024

You just direcctly run bash neo/scripts/batch_convert_ckpt.sh without any modification and then encountered this error?It seems that the loader runs correctly and the saver part is wrong

yes,i didn't modifiy any files. Maybe neo/scripts/batch_convert_ckpt.sh can only works on the 7b model?

@Kevinstone-199898
Copy link

No, I tried to convert 7B checkpoints, and the error occurred in the loader part

@hanjr92
Copy link
Author

hanjr92 commented Jul 1, 2024

No, I tried to convert 7B checkpoints, and the error occurred in the loader part

ok, it looks like there are some bugs in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants