Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can NOT resume pretraining from last check-point !!! #559

Closed
thusinh1969 opened this issue Jun 10, 2023 · 2 comments
Closed

Can NOT resume pretraining from last check-point !!! #559

thusinh1969 opened this issue Jun 10, 2023 · 2 comments
Labels

Comments

@thusinh1969
Copy link

Describe the issue in detail

Training for 5 steps only, saving at steps 2 for testing purpose.
Remove --overwrite_output_dir and start training again, after it loaded almost everything seem fine, right beforit posed an error os missing dick keys:

Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00039505958557128906 seconds
[INFO|deepspeed.py:390] 2023-06-10 18:22:44,370 >> Attempting to resume from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2
[2023-06-10 18:22:44,707] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt...
[2023-06-10 18:24:21,873] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt.
[2023-06-10 18:24:22,705] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt...
[2023-06-10 18:24:35,849] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/drive/MyDrive/IMPORTANTS/LLM/Chinese-LLaMA-Alpaca/scripts/training/ │
│ run_clm_pt_with_peft.py:635 in │
│ │
│ 632 │
│ 633 │
│ 634 if name == "main": │
│ ❱ 635 │ main() │
│ 636 │
│ │
│ /content/drive/MyDrive/IMPORTANTS/LLM/Chinese-LLaMA-Alpaca/scripts/training/ │
│ run_clm_pt_with_peft.py:587 in main │
│ │
│ 584 │ │ │ checkpoint = training_args.resume_from_checkpoint │
│ 585 │ │ elif last_checkpoint is not None: │
│ 586 │ │ │ checkpoint = last_checkpoint │
│ ❱ 587 │ │ train_result = trainer.train(resume_from_checkpoint=checkpoint │
│ 588 │ │ trainer.save_model() │
│ 589 │ │ │
│ 590 │ │ metrics = train_result.metrics │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1662 in │
│ train │
│ │
│ 1659 │ │ inner_training_loop = find_executable_batch_size( │
│ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │
│ 1661 │ │ ) │
│ ❱ 1662 │ │ return inner_training_loop( │
│ 1663 │ │ │ args=args, │
│ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1665 │ │ │ trial=trial, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1731 in │
│ _inner_training_loop │
│ │
│ 1728 │ │ │ or self.fsdp is not None │
│ 1729 │ │ ) │
│ 1730 │ │ if args.deepspeed: │
│ ❱ 1731 │ │ │ deepspeed_engine, optimizer, lr_scheduler = deepspeed_ini │
│ 1732 │ │ │ │ self, num_training_steps=max_steps, resume_from_check │
│ 1733 │ │ │ ) │
│ 1734 │ │ │ self.model = deepspeed_engine.module │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:392 in │
│ deepspeed_init │
│ │
│ 389 │ │ if len(deepspeed_checkpoint_dirs) > 0: │
│ 390 │ │ │ logger.info(f"Attempting to resume from {resume_from_check │
│ 391 │ │ │ # this magically updates self.optimizer and self.lr_schedu │
│ ❱ 392 │ │ │ load_path, _ = deepspeed_engine.load_checkpoint( │
│ 393 │ │ │ │ resume_from_checkpoint, load_optimizer_states=True, lo │
│ 394 │ │ │ ) │
│ 395 │ │ │ if load_path is None: │
│ │
│ /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py:2605 in │
│ load_checkpoint │
│ │
│ 2602 │ │ │ # Prepare for checkpoint load by ensuring all parameters │
│ 2603 │ │ │ self.optimizer.checkpoint_event_prologue() │
│ 2604 │ │ │
│ ❱ 2605 │ │ load_path, client_states = self.load_checkpoint(load_dir, │
│ 2606 │ │ │ │ │ │ │ │ │ │ │ │ │ │ tag, │
│ 2607 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_module

│ 2608 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_optimiz │
│ │
│ /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py:2664 in │
load_checkpoint │
│ │
│ 2661 │ │ │ │ │ │ │ │ │ │ │ │ num_experts=self.num

│ 2662 │ │ │ │ │ │ │ │ │ │ │ │ checkpoint_engine=sel │
│ 2663 │ │ if not self.load_universal_checkpoint(): │
│ ❱ 2664 │ │ │ self.load_module_state_dict(checkpoint=checkpoint, │
│ 2665 │ │ │ │ │ │ │ │ │ │ strict=load_module_strict, │
│ 2666 │ │ │ │ │ │ │ │ │ │ custom_load_fn=custom_load_fn │
│ 2667 │
│ │
│ /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py:2468 in │
│ load_module_state_dict │
│ │
│ 2465 │ │ if custom_load_fn: │
│ 2466 │ │ │ custom_load_fn(src=module_state_dict, dst=self.module) │
│ 2467 │ │ else: │
│ ❱ 2468 │ │ │ self.module.load_state_dict( │
│ 2469 │ │ │ │ module_state_dict, # TODO │
│ 2470 │ │ │ │ strict=strict) │
│ 2471 │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1671 in │
│ load_state_dict │
│ │
│ 1668 │ │ │ │ │ │ ', '.join('"{}"'.format(k) for k in missing_k │
│ 1669 │ │ │
│ 1670 │ │ if len(error_msgs) > 0: │
│ ❱ 1671 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {} │
│ 1672 │ │ │ │ │ │ │ self.class.name, "\n\t".join(e │
│ 1673 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │
│ 1674 │
╰──────────────────────────────────────────────────────────────────────────────╯

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict:
"base_model.model.model.layers.0.self_attn.q_proj.weight",
"base_model.model.model.layers.0.self_attn.k_proj.weight",
"base_model.model.model.layers.0.self_attn.v_proj.weight",
"base_model.model.model.layers.0.self_attn.o_proj.weight",
"base_model.model.model.layers.0.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.0.mlp.gate_proj.weight",
"base_model.model.model.layers.0.mlp.down_proj.weight",
"base_model.model.model.layers.0.mlp.up_proj.weight",
"base_model.model.model.layers.0.input_layernorm.weight",
"base_model.model.model.layers.0.post_attention_layernorm.weight",
"base_model.model.model.layers.1.self_attn.q_proj.weight",
"base_model.model.model.layers.1.self_attn.k_proj.weight",
"base_model.model.model.layers.1.self_attn.v_proj.weight",
"base_model.model.model.layers.1.self_attn.o_proj.weight",
"base_model.model.model.layers.1.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.1.mlp.gate_proj.weight",
"base_model.model.model.layers.1.mlp.down_proj.weight",
"base_model.model.model.layers.1.mlp.up_proj.weight",
"base_model.model.model.layers.1.input_layernorm.weight",
"base_model.model.model.layers.1.post_attention_layernorm.weight",
"base_model.model.model.layers.2.self_attn.q_proj.weight",
"base_model.model.model.layers.2.self_attn.k_proj.weight",
"base_model.model.model.layers.2.self_attn.v_proj.weight",
"base_model.model.model.layers.2.self_attn.o_proj.weight",
"base_model.model.model.layers.2.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.2.mlp.gate_proj.weight",
"base_model.model.model.layers.2.mlp.down_proj.weight",
"base_model.model.model.layers.2.mlp.up_proj.weight",
"base_model.model.model.layers.2.input_layernorm.weight",
"base_model.model.model.layers.2.post_attention_layernorm.weight",
"base_model.model.model.layers.3.self_attn.q_proj.weight",
"base_model.model.model.layers.3.self_attn.k_proj.weight",
"base_model.model.model.layers.3.self_attn.v_proj.weight",
"base_model.model.model.layers.3.self_attn.o_proj.weight",
"base_model.model.model.layers.3.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.3.mlp.gate_proj.weight",
"base_model.model.model.layers.3.mlp.down_proj.weight",
"base_model.model.model.layers.3.mlp.up_proj.weight",
"base_model.model.model.layers.3.input_layernorm.weight",
"base_model.model.model.layers.3.post_attention_layernorm.weight",
"base_model.model.model.layers.4.self_attn.q_proj.weight",
"base_model.model.model.layers.4.self_attn.k_proj.weight",
"base_model.model.model.layers.4.self_attn.v_proj.weight",
"base_model.model.model.layers.4.self_attn.o_proj.weight",
"base_model.model.model.layers.4.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.4.mlp.gate_proj.weight",
"base_model.model.model.layers.4.mlp.down_proj.weight",
"base_model.model.model.layers.4.mlp.up_proj.weight",
"base_model.model.model.layers.4.input_layernorm.weight",
"base_model.model.model.layers.4.post_attention_layernorm.weight",
"base_model.model.model.layers.5.self_attn.q_proj.weight",
"base_model.model.model.layers.5.self_attn.k_proj.weight",
"base_model.model.model.layers.5.self_attn.v_proj.weight",
"base_model.model.model.layers.5.self_attn.o_proj.weight",
"base_model.model.model.layers.5.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.5.mlp.gate_proj.weight",
"base_model.model.model.layers.5.mlp.down_proj.weight",
"base_model.model.model.layers.5.mlp.up_proj.weight",
"base_model.model.model.layers.5.input_layernorm.weight",
"base_model.model.model.layers.5.post_attention_layernorm.weight",
"base_model.model.model.layers.6.self_attn.q_proj.weight",
"base_model.model.model.layers.6.self_attn.k_proj.weight",
"base_model.model.model.layers.6.self_attn.v_proj.weight",
"base_model.model.model.layers.6.self_attn.o_proj.weight",
"base_model.model.model.layers.6.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.6.mlp.gate_proj.weight",
"base_model.model.model.layers.6.mlp.down_proj.weight",
"base_model.model.model.layers.6.mlp.up_proj.weight",
"base_model.model.model.layers.6.input_layernorm.weight",
"base_model.model.model.layers.6.post_attention_layernorm.weight",
"base_model.model.model.layers.7.self_attn.q_proj.weight",
"base_model.model.model.layers.7.self_attn.k_proj.weight",
"base_model.model.model.layers.7.self_attn.v_proj.weight",
"base_model.model.model.layers.7.self_attn.o_proj.weight",
"base_model.model.model.layers.7.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.7.mlp.gate_proj.weight",
"base_model.model.model.layers.7.mlp.down_proj.weight",
"base_model.model.model.layers.7.mlp.up_proj.weight",
"base_model.model.model.layers.7.input_layernorm.weight",
"base_model.model.model.layers.7.post_attention_layernorm.weight",
"base_model.model.model.layers.8.self_attn.q_proj.weight",
"base_model.model.model.layers.8.self_attn.k_proj.weight",
"base_model.model.model.layers.8.self_attn.v_proj.weight",
"base_model.model.model.layers.8.self_attn.o_proj.weight",
"base_model.model.model.layers.8.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.8.mlp.gate_proj.weight",
"base_model.model.model.layers.8.mlp.down_proj.weight",
"base_model.model.model.layers.8.mlp.up_proj.weight",
"base_model.model.model.layers.8.input_layernorm.weight",
"base_model.model.model.layers.8.post_attention_layernorm.weight",
"base_model.model.model.layers.9.self_attn.q_proj.weight",
"base_model.model.model.layers.9.self_attn.k_proj.weight",
"base_model.model.model.layers.9.self_attn.v_proj.weight",
"base_model.model.model.layers.9.self_attn.o_proj.weight",
"base_model.model.model.layers.9.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.9.mlp.gate_proj.weight",
"base_model.model.model.layers.9.mlp.down_proj.weight",
"base_model.model.model.layers.9.mlp.up_proj.weight",
"base_model.model.model.layers.9.input_layernorm.weight",
"base_model.model.model.layers.9.post_attention_layernorm.weight",
"base_model.model.model.layers.10.self_attn.q_proj.weight",
"base_model.model.model.layers.10.self_attn.k_proj.weight",
"base_model.model.model.layers.10.self_attn.v_proj.weight",
"base_model.model.model.layers.10.self_attn.o_proj.weight",
"base_model.model.model.layers.10.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.10.mlp.gate_proj.weight",
"base_model.model.model.layers.10.mlp.down_proj.weight",
"base_model.model.model.layers.10.mlp.up_proj.weight",
"base_model.model.model.layers.10.input_layernorm.weight",
"base_model.model.model.layers.10.post_attention_layernorm.weight",
"base_model.model.model.layers.11.self_attn.q_proj.weight",
"base_model.model.model.layers.11.self_attn.k_proj.weight",
"base_model.model.model.layers.11.self_attn.v_proj.weight",
"base_model.model.model.layers.11.self_attn.o_proj.weight",
"base_model.model.model.layers.11.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.11.mlp.gate_proj.weight",
"base_model.model.model.layers.11.mlp.down_proj.weight",
"base_model.model.model.layers.11.mlp.up_proj.weight",
"base_model.model.model.layers.11.input_layernorm.weight",
"base_model.model.model.layers.11.post_attention_layernorm.weight",
"base_model.model.model.layers.12.self_attn.q_proj.weight",
"base_model.model.model.layers.12.self_attn.k_proj.weight",
"base_model.model.model.layers.12.self_attn.v_proj.weight",
"base_model.model.model.layers.12.self_attn.o_proj.weight",
"base_model.model.model.layers.12.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.12.mlp.gate_proj.weight",
"base_model.model.model.layers.12.mlp.down_proj.weight",
"base_model.model.model.layers.12.mlp.up_proj.weight",
"base_model.model.model.layers.12.input_layernorm.weight",
"base_model.model.model.layers.12.post_attention_layernorm.weight",
"base_model.model.model.layers.13.self_attn.q_proj.weight",
"base_model.model.model.layers.13.self_attn.k_proj.weight",
"base_model.model.model.layers.13.self_attn.v_proj.weight",
"base_model.model.model.layers.13.self_attn.o_proj.weight",
"base_model.model.model.layers.13.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.13.mlp.gate_proj.weight",
"base_model.model.model.layers.13.mlp.down_proj.weight",
"base_model.model.model.layers.13.mlp.up_proj.weight",
"base_model.model.model.layers.13.input_layernorm.weight",
"base_model.model.model.layers.13.post_attention_layernorm.weight",
"base_model.model.model.layers.14.self_attn.q_proj.weight",
"base_model.model.model.layers.14.self_attn.k_proj.weight",
"base_model.model.model.layers.14.self_attn.v_proj.weight",
"base_model.model.model.layers.14.self_attn.o_proj.weight",
"base_model.model.model.layers.14.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.14.mlp.gate_proj.weight",
"base_model.model.model.layers.14.mlp.down_proj.weight",
"base_model.model.model.layers.14.mlp.up_proj.weight",
"base_model.model.model.layers.14.input_layernorm.weight",
"base_model.model.model.layers.14.post_attention_layernorm.weight",
"base_model.model.model.layers.15.self_attn.q_proj.weight",
"base_model.model.model.layers.15.self_attn.k_proj.weight",
"base_model.model.model.layers.15.self_attn.v_proj.weight",
"base_model.model.model.layers.15.self_attn.o_proj.weight",
"base_model.model.model.layers.15.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.15.mlp.gate_proj.weight",
"base_model.model.model.layers.15.mlp.down_proj.weight",
"base_model.model.model.layers.15.mlp.up_proj.weight",
"base_model.model.model.layers.15.input_layernorm.weight",
"base_model.model.model.layers.15.post_attention_layernorm.weight",
"base_model.model.model.layers.16.self_attn.q_proj.weight",
"base_model.model.model.layers.16.self_attn.k_proj.weight",
"base_model.model.model.layers.16.self_attn.v_proj.weight",
"base_model.model.model.layers.16.self_attn.o_proj.weight",
"base_model.model.model.layers.16.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.16.mlp.gate_proj.weight",
"base_model.model.model.layers.16.mlp.down_proj.weight",
"base_model.model.model.layers.16.mlp.up_proj.weight",
"base_model.model.model.layers.16.input_layernorm.weight",
"base_model.model.model.layers.16.post_attention_layernorm.weight",
"base_model.model.model.layers.17.self_attn.q_proj.weight",
"base_model.model.model.layers.17.self_attn.k_proj.weight",
"base_model.model.model.layers.17.self_attn.v_proj.weight",
"base_model.model.model.layers.17.self_attn.o_proj.weight",
"base_model.model.model.layers.17.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.17.mlp.gate_proj.weight",
"base_model.model.model.layers.17.mlp.down_proj.weight",
"base_model.model.model.layers.17.mlp.up_proj.weight",
"base_model.model.model.layers.17.input_layernorm.weight",
"base_model.model.model.layers.17.post_attention_layernorm.weight",
"base_model.model.model.layers.18.self_attn.q_proj.weight",
"base_model.model.model.layers.18.self_attn.k_proj.weight",
"base_model.model.model.layers.18.self_attn.v_proj.weight",
"base_model.model.model.layers.18.self_attn.o_proj.weight",
"base_model.model.model.layers.18.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.18.mlp.gate_proj.weight",
"base_model.model.model.layers.18.mlp.down_proj.weight",
"base_model.model.model.layers.18.mlp.up_proj.weight",
"base_model.model.model.layers.18.input_layernorm.weight",
"base_model.model.model.layers.18.post_attention_layernorm.weight",
"base_model.model.model.layers.19.self_attn.q_proj.weight",
"base_model.model.model.layers.19.self_attn.k_proj.weight",
"base_model.model.model.layers.19.self_attn.v_proj.weight",
"base_model.model.model.layers.19.self_attn.o_proj.weight",
"base_model.model.model.layers.19.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.19.mlp.gate_proj.weight",
"base_model.model.model.layers.19.mlp.down_proj.weight",
"base_model.model.model.layers.19.mlp.up_proj.weight",
"base_model.model.model.layers.19.input_layernorm.weight",
"base_model.model.model.layers.19.post_attention_layernorm.weight",
"base_model.model.model.layers.20.self_attn.q_proj.weight",
"base_model.model.model.layers.20.self_attn.k_proj.weight",
"base_model.model.model.layers.20.self_attn.v_proj.weight",
"base_model.model.model.layers.20.self_attn.o_proj.weight",
"base_model.model.model.layers.20.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.20.mlp.gate_proj.weight",
"base_model.model.model.layers.20.mlp.down_proj.weight",
"base_model.model.model.layers.20.mlp.up_proj.weight",
"base_model.model.model.layers.20.input_layernorm.weight",
"base_model.model.model.layers.20.post_attention_layernorm.weight",
"base_model.model.model.layers.21.self_attn.q_proj.weight",
"base_model.model.model.layers.21.self_attn.k_proj.weight",
"base_model.model.model.layers.21.self_attn.v_proj.weight",
"base_model.model.model.layers.21.self_attn.o_proj.weight",
"base_model.model.model.layers.21.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.21.mlp.gate_proj.weight",
"base_model.model.model.layers.21.mlp.down_proj.weight",
"base_model.model.model.layers.21.mlp.up_proj.weight",
"base_model.model.model.layers.21.input_layernorm.weight",
"base_model.model.model.layers.21.post_attention_layernorm.weight",
"base_model.model.model.layers.22.self_attn.q_proj.weight",
"base_model.model.model.layers.22.self_attn.k_proj.weight",
"base_model.model.model.layers.22.self_attn.v_proj.weight",
"base_model.model.model.layers.22.self_attn.o_proj.weight",
"base_model.model.model.layers.22.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.22.mlp.gate_proj.weight",
"base_model.model.model.layers.22.mlp.down_proj.weight",
"base_model.model.model.layers.22.mlp.up_proj.weight",
"base_model.model.model.layers.22.input_layernorm.weight",
"base_model.model.model.layers.22.post_attention_layernorm.weight",
"base_model.model.model.layers.23.self_attn.q_proj.weight",
"base_model.model.model.layers.23.self_attn.k_proj.weight",
"base_model.model.model.layers.23.self_attn.v_proj.weight",
"base_model.model.model.layers.23.self_attn.o_proj.weight",
"base_model.model.model.layers.23.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.23.mlp.gate_proj.weight",
"base_model.model.model.layers.23.mlp.down_proj.weight",
"base_model.model.model.layers.23.mlp.up_proj.weight",
"base_model.model.model.layers.23.input_layernorm.weight",
"base_model.model.model.layers.23.post_attention_layernorm.weight",
"base_model.model.model.layers.24.self_attn.q_proj.weight",
"base_model.model.model.layers.24.self_attn.k_proj.weight",
"base_model.model.model.layers.24.self_attn.v_proj.weight",
"base_model.model.model.layers.24.self_attn.o_proj.weight",
"base_model.model.model.layers.24.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.24.mlp.gate_proj.weight",
"base_model.model.model.layers.24.mlp.down_proj.weight",
"base_model.model.model.layers.24.mlp.up_proj.weight",
"base_model.model.model.layers.24.input_layernorm.weight",
"base_model.model.model.layers.24.post_attention_layernorm.weight",
"base_model.model.model.layers.25.self_attn.q_proj.weight",
"base_model.model.model.layers.25.self_attn.k_proj.weight",
"base_model.model.model.layers.25.self_attn.v_proj.weight",
"base_model.model.model.layers.25.self_attn.o_proj.weight",
"base_model.model.model.layers.25.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.25.mlp.gate_proj.weight",
"base_model.model.model.layers.25.mlp.down_proj.weight",
"base_model.model.model.layers.25.mlp.up_proj.weight",
"base_model.model.model.layers.25.input_layernorm.weight",
"base_model.model.model.layers.25.post_attention_layernorm.weight",
"base_model.model.model.layers.26.self_attn.q_proj.weight",
"base_model.model.model.layers.26.self_attn.k_proj.weight",
"base_model.model.model.layers.26.self_attn.v_proj.weight",
"base_model.model.model.layers.26.self_attn.o_proj.weight",
"base_model.model.model.layers.26.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.26.mlp.gate_proj.weight",
"base_model.model.model.layers.26.mlp.down_proj.weight",
"base_model.model.model.layers.26.mlp.up_proj.weight",
"base_model.model.model.layers.26.input_layernorm.weight",
"base_model.model.model.layers.26.post_attention_layernorm.weight",
"base_model.model.model.layers.27.self_attn.q_proj.weight",
"base_model.model.model.layers.27.self_attn.k_proj.weight",
"base_model.model.model.layers.27.self_attn.v_proj.weight",
"base_model.model.model.layers.27.self_attn.o_proj.weight",
"base_model.model.model.layers.27.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.27.mlp.gate_proj.weight",
"base_model.model.model.layers.27.mlp.down_proj.weight",
"base_model.model.model.layers.27.mlp.up_proj.weight",
"base_model.model.model.layers.27.input_layernorm.weight",
"base_model.model.model.layers.27.post_attention_layernorm.weight",
"base_model.model.model.layers.28.self_attn.q_proj.weight",
"base_model.model.model.layers.28.self_attn.k_proj.weight",
"base_model.model.model.layers.28.self_attn.v_proj.weight",
"base_model.model.model.layers.28.self_attn.o_proj.weight",
"base_model.model.model.layers.28.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.28.mlp.gate_proj.weight",
"base_model.model.model.layers.28.mlp.down_proj.weight",
"base_model.model.model.layers.28.mlp.up_proj.weight",
"base_model.model.model.layers.28.input_layernorm.weight",
"base_model.model.model.layers.28.post_attention_layernorm.weight",
"base_model.model.model.layers.29.self_attn.q_proj.weight",
"base_model.model.model.layers.29.self_attn.k_proj.weight",
"base_model.model.model.layers.29.self_attn.v_proj.weight",
"base_model.model.model.layers.29.self_attn.o_proj.weight",
"base_model.model.model.layers.29.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.29.mlp.gate_proj.weight",
"base_model.model.model.layers.29.mlp.down_proj.weight",
"base_model.model.model.layers.29.mlp.up_proj.weight",
"base_model.model.model.layers.29.input_layernorm.weight",
"base_model.model.model.layers.29.post_attention_layernorm.weight",
"base_model.model.model.layers.30.self_attn.q_proj.weight",
"base_model.model.model.layers.30.self_attn.k_proj.weight",
"base_model.model.model.layers.30.self_attn.v_proj.weight",
"base_model.model.model.layers.30.self_attn.o_proj.weight",
"base_model.model.model.layers.30.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.30.mlp.gate_proj.weight",
"base_model.model.model.layers.30.mlp.down_proj.weight",
"base_model.model.model.layers.30.mlp.up_proj.weight",
"base_model.model.model.layers.30.input_layernorm.weight",
"base_model.model.model.layers.30.post_attention_layernorm.weight",
"base_model.model.model.layers.31.self_attn.q_proj.weight",
"base_model.model.model.layers.31.self_attn.k_proj.weight",
"base_model.model.model.layers.31.self_attn.v_proj.weight",
"base_model.model.model.layers.31.self_attn.o_proj.weight",
"base_model.model.model.layers.31.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.31.mlp.gate_proj.weight",
"base_model.model.model.layers.31.mlp.down_proj.weight",
"base_model.model.model.layers.31.mlp.up_proj.weight",
"base_model.model.model.layers.31.input_layernorm.weight",
"base_model.model.model.layers.31.post_attention_layernorm.weight",
"base_model.model.model.norm.weight".

Tried everytime. Same dataset, same script, even able to merge into 7B base OK after training so the script works but CHECKPOINT saving must be wrong somewhere.

Any idea ?
Steve

@airaria
Copy link
Contributor

airaria commented Jun 11, 2023

I think you have checked #464
Here is the detailed explanation:
Since only the LoRA part (and embed_tokens and lm_head) of the full model are saved to the ckpt, when resuming training and loading the saved ckpt, we must allow missing keys.
When using Transformers with Deepspeed, Transformers doesn't provide any parameter to set allowing missing keys. See here
Therefore we have to modify the source code.

There are two ways to fix the problem:

  1. Modify Deepspeed:
    基于上一次保存checkpoint继续训练 #464 (comment)
  2. Modify Transformers:
    Change the original resuming-from-ckpt-with-deepspeed-code to
          load_path, _ = deepspeed_engine.load_checkpoint(
              resume_from_checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True,
              load_module_strict=False
          )

If you find a more convenient method, you are welcome to share it with us.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants