error in batch_convert_ckpt #32

hanjr92 · 2024-06-28T08:34:21Z

when i use bash neo/scripts/batch_convert_ckpt.sh:

received transformer layer 17
received final norm
received output layer
Saving model to disk ...
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/megatron/tools/checkpoint/saver_llama2_hf_bf.py", line 108, in save_checkpoint
    model = AutoModelForCausalLM.from_pretrained(None, config=llama_conf, state_dict=state_dict, torch_dtype=torch_dtype)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 484, in from_pretrained
    return model_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3278, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
	size mismatch for model.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.1.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.1.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.2.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.2.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.3.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.3.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.4.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.4.self_attn.v_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).
	size mismatch for model.layers.5.self_attn.k_proj.weight: copying a param with shape torch.Size([256, 2048]) from checkpoint, the shape in current model is torch.Size([2048, 2048]).

The text was updated successfully, but these errors were encountered:

hanjr92 · 2024-06-28T08:36:16Z

i wish you can provide 2b hf-verison checkpoints.

Kevinstone-199898 · 2024-06-30T14:37:38Z

Excuse me, I wonder where do you get the checkpoint? from huggingface?

hanjr92 · 2024-07-01T01:46:49Z

Excuse me, I wonder where do you get the checkpoint? from huggingface?

yes, i got 2b checkponits from huggingface. The checkpoints look like megatron version. So, i encountered this error when i used tools to convert it.

Kevinstone-199898 · 2024-07-01T08:39:34Z

You just direcctly run bash neo/scripts/batch_convert_ckpt.sh without any modification and then encountered this error?It seems that the loader runs correctly and the saver part is wrong

hanjr92 · 2024-07-01T08:43:48Z

You just direcctly run bash neo/scripts/batch_convert_ckpt.sh without any modification and then encountered this error?It seems that the loader runs correctly and the saver part is wrong

yes，i didn't modifiy any files. Maybe neo/scripts/batch_convert_ckpt.sh can only works on the 7b model？

Kevinstone-199898 · 2024-07-01T08:46:57Z

No, I tried to convert 7B checkpoints, and the error occurred in the loader part

hanjr92 · 2024-07-01T08:50:28Z

No, I tried to convert 7B checkpoints, and the error occurred in the loader part

ok， it looks like there are some bugs in it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error in batch_convert_ckpt #32

error in batch_convert_ckpt #32

hanjr92 commented Jun 28, 2024 •

edited

Loading

hanjr92 commented Jun 28, 2024

Kevinstone-199898 commented Jun 30, 2024

hanjr92 commented Jul 1, 2024

Kevinstone-199898 commented Jul 1, 2024

hanjr92 commented Jul 1, 2024

Kevinstone-199898 commented Jul 1, 2024

hanjr92 commented Jul 1, 2024

error in batch_convert_ckpt #32

error in batch_convert_ckpt #32

Comments

hanjr92 commented Jun 28, 2024 • edited Loading

hanjr92 commented Jun 28, 2024

Kevinstone-199898 commented Jun 30, 2024

hanjr92 commented Jul 1, 2024

Kevinstone-199898 commented Jul 1, 2024

hanjr92 commented Jul 1, 2024

Kevinstone-199898 commented Jul 1, 2024

hanjr92 commented Jul 1, 2024

hanjr92 commented Jun 28, 2024 •

edited

Loading