Can NOT resume pretraining from last check-point !!! #559

thusinh1969 · 2023-06-10T18:29:49Z

Describe the issue in detail

Training for 5 steps only, saving at steps 2 for testing purpose.
Remove --overwrite_output_dir and start training again, after it loaded almost everything seem fine, right beforit posed an error os missing dick keys:

Using /root/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00039505958557128906 seconds
[INFO|deepspeed.py:390] 2023-06-10 18:22:44,370 >> Attempting to resume from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2
[2023-06-10 18:22:44,707] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt...
[2023-06-10 18:24:21,873] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt.
[2023-06-10 18:24:22,705] [INFO] [torch_checkpoint_engine.py:27:load] [Torch] Loading checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt...
[2023-06-10 18:24:35,849] [INFO] [torch_checkpoint_engine.py:29:load] [Torch] Loaded checkpoint from /content/drive/MyDrive/IMPORTANTS/LLM/Pre-train/VN-LLaMA-Lora-V1.0-PreTrain_02/checkpoint-2/global_step2/mp_rank_00_model_states.pt.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /content/drive/MyDrive/IMPORTANTS/LLM/Chinese-LLaMA-Alpaca/scripts/training/ │
│ run_clm_pt_with_peft.py:635 in │
│ │
│ 632 │
│ 633 │
│ 634 if name == "main": │
│ ❱ 635 │ main() │
│ 636 │
│ │
│ /content/drive/MyDrive/IMPORTANTS/LLM/Chinese-LLaMA-Alpaca/scripts/training/ │
│ run_clm_pt_with_peft.py:587 in main │
│ │
│ 584 │ │ │ checkpoint = training_args.resume_from_checkpoint │
│ 585 │ │ elif last_checkpoint is not None: │
│ 586 │ │ │ checkpoint = last_checkpoint │
│ ❱ 587 │ │ train_result = trainer.train(resume_from_checkpoint=checkpoint │
│ 588 │ │ trainer.save_model() │
│ 589 │ │ │
│ 590 │ │ metrics = train_result.metrics │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1662 in │
│ train │
│ │
│ 1659 │ │ inner_training_loop = find_executable_batch_size( │
│ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │
│ 1661 │ │ ) │
│ ❱ 1662 │ │ return inner_training_loop( │
│ 1663 │ │ │ args=args, │
│ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │
│ 1665 │ │ │ trial=trial, │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1731 in │
│ _inner_training_loop │
│ │
│ 1728 │ │ │ or self.fsdp is not None │
│ 1729 │ │ ) │
│ 1730 │ │ if args.deepspeed: │
│ ❱ 1731 │ │ │ deepspeed_engine, optimizer, lr_scheduler = deepspeed_ini │
│ 1732 │ │ │ │ self, num_training_steps=max_steps, resume_from_check │
│ 1733 │ │ │ ) │
│ 1734 │ │ │ self.model = deepspeed_engine.module │
│ │
│ /usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:392 in │
│ deepspeed_init │
│ │
│ 389 │ │ if len(deepspeed_checkpoint_dirs) > 0: │
│ 390 │ │ │ logger.info(f"Attempting to resume from {resume_from_check │
│ 391 │ │ │ # this magically updates self.optimizer and self.lr_schedu │
│ ❱ 392 │ │ │ load_path, _ = deepspeed_engine.load_checkpoint( │
│ 393 │ │ │ │ resume_from_checkpoint, load_optimizer_states=True, lo │
│ 394 │ │ │ ) │
│ 395 │ │ │ if load_path is None: │
│ │
│ /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py:2605 in │
│ load_checkpoint │
│ │
│ 2602 │ │ │ # Prepare for checkpoint load by ensuring all parameters │
│ 2603 │ │ │ self.optimizer.checkpoint_event_prologue() │
│ 2604 │ │ │
│ ❱ 2605 │ │ load_path, client_states = self.load_checkpoint(load_dir, │
│ 2606 │ │ │ │ │ │ │ │ │ │ │ │ │ │ tag, │
│ 2607 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_module │
│ 2608 │ │ │ │ │ │ │ │ │ │ │ │ │ │ load_optimiz │
│ │
│ /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py:2664 in │
│ load_checkpoint │
│ │
│ 2661 │ │ │ │ │ │ │ │ │ │ │ │ num_experts=self.num │
│ 2662 │ │ │ │ │ │ │ │ │ │ │ │ checkpoint_engine=sel │
│ 2663 │ │ if not self.load_universal_checkpoint(): │
│ ❱ 2664 │ │ │ self.load_module_state_dict(checkpoint=checkpoint, │
│ 2665 │ │ │ │ │ │ │ │ │ │ strict=load_module_strict, │
│ 2666 │ │ │ │ │ │ │ │ │ │ custom_load_fn=custom_load_fn │
│ 2667 │
│ │
│ /usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py:2468 in │
│ load_module_state_dict │
│ │
│ 2465 │ │ if custom_load_fn: │
│ 2466 │ │ │ custom_load_fn(src=module_state_dict, dst=self.module) │
│ 2467 │ │ else: │
│ ❱ 2468 │ │ │ self.module.load_state_dict( │
│ 2469 │ │ │ │ module_state_dict, # TODO │
│ 2470 │ │ │ │ strict=strict) │
│ 2471 │
│ │
│ /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1671 in │
│ load_state_dict │
│ │
│ 1668 │ │ │ │ │ │ ', '.join('"{}"'.format(k) for k in missing_k │
│ 1669 │ │ │
│ 1670 │ │ if len(error_msgs) > 0: │
│ ❱ 1671 │ │ │ raise RuntimeError('Error(s) in loading state_dict for {} │
│ 1672 │ │ │ │ │ │ │ self.class.name, "\n\t".join(e │
│ 1673 │ │ return _IncompatibleKeys(missing_keys, unexpected_keys) │
│ 1674 │
╰──────────────────────────────────────────────────────────────────────────────╯

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
Missing key(s) in state_dict:
"base_model.model.model.layers.0.self_attn.q_proj.weight",
"base_model.model.model.layers.0.self_attn.k_proj.weight",
"base_model.model.model.layers.0.self_attn.v_proj.weight",
"base_model.model.model.layers.0.self_attn.o_proj.weight",
"base_model.model.model.layers.0.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.0.mlp.gate_proj.weight",
"base_model.model.model.layers.0.mlp.down_proj.weight",
"base_model.model.model.layers.0.mlp.up_proj.weight",
"base_model.model.model.layers.0.input_layernorm.weight",
"base_model.model.model.layers.0.post_attention_layernorm.weight",
"base_model.model.model.layers.1.self_attn.q_proj.weight",
"base_model.model.model.layers.1.self_attn.k_proj.weight",
"base_model.model.model.layers.1.self_attn.v_proj.weight",
"base_model.model.model.layers.1.self_attn.o_proj.weight",
"base_model.model.model.layers.1.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.1.mlp.gate_proj.weight",
"base_model.model.model.layers.1.mlp.down_proj.weight",
"base_model.model.model.layers.1.mlp.up_proj.weight",
"base_model.model.model.layers.1.input_layernorm.weight",
"base_model.model.model.layers.1.post_attention_layernorm.weight",
"base_model.model.model.layers.2.self_attn.q_proj.weight",
"base_model.model.model.layers.2.self_attn.k_proj.weight",
"base_model.model.model.layers.2.self_attn.v_proj.weight",
"base_model.model.model.layers.2.self_attn.o_proj.weight",
"base_model.model.model.layers.2.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.2.mlp.gate_proj.weight",
"base_model.model.model.layers.2.mlp.down_proj.weight",
"base_model.model.model.layers.2.mlp.up_proj.weight",
"base_model.model.model.layers.2.input_layernorm.weight",
"base_model.model.model.layers.2.post_attention_layernorm.weight",
"base_model.model.model.layers.3.self_attn.q_proj.weight",
"base_model.model.model.layers.3.self_attn.k_proj.weight",
"base_model.model.model.layers.3.self_attn.v_proj.weight",
"base_model.model.model.layers.3.self_attn.o_proj.weight",
"base_model.model.model.layers.3.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.3.mlp.gate_proj.weight",
"base_model.model.model.layers.3.mlp.down_proj.weight",
"base_model.model.model.layers.3.mlp.up_proj.weight",
"base_model.model.model.layers.3.input_layernorm.weight",
"base_model.model.model.layers.3.post_attention_layernorm.weight",
"base_model.model.model.layers.4.self_attn.q_proj.weight",
"base_model.model.model.layers.4.self_attn.k_proj.weight",
"base_model.model.model.layers.4.self_attn.v_proj.weight",
"base_model.model.model.layers.4.self_attn.o_proj.weight",
"base_model.model.model.layers.4.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.4.mlp.gate_proj.weight",
"base_model.model.model.layers.4.mlp.down_proj.weight",
"base_model.model.model.layers.4.mlp.up_proj.weight",
"base_model.model.model.layers.4.input_layernorm.weight",
"base_model.model.model.layers.4.post_attention_layernorm.weight",
"base_model.model.model.layers.5.self_attn.q_proj.weight",
"base_model.model.model.layers.5.self_attn.k_proj.weight",
"base_model.model.model.layers.5.self_attn.v_proj.weight",
"base_model.model.model.layers.5.self_attn.o_proj.weight",
"base_model.model.model.layers.5.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.5.mlp.gate_proj.weight",
"base_model.model.model.layers.5.mlp.down_proj.weight",
"base_model.model.model.layers.5.mlp.up_proj.weight",
"base_model.model.model.layers.5.input_layernorm.weight",
"base_model.model.model.layers.5.post_attention_layernorm.weight",
"base_model.model.model.layers.6.self_attn.q_proj.weight",
"base_model.model.model.layers.6.self_attn.k_proj.weight",
"base_model.model.model.layers.6.self_attn.v_proj.weight",
"base_model.model.model.layers.6.self_attn.o_proj.weight",
"base_model.model.model.layers.6.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.6.mlp.gate_proj.weight",
"base_model.model.model.layers.6.mlp.down_proj.weight",
"base_model.model.model.layers.6.mlp.up_proj.weight",
"base_model.model.model.layers.6.input_layernorm.weight",
"base_model.model.model.layers.6.post_attention_layernorm.weight",
"base_model.model.model.layers.7.self_attn.q_proj.weight",
"base_model.model.model.layers.7.self_attn.k_proj.weight",
"base_model.model.model.layers.7.self_attn.v_proj.weight",
"base_model.model.model.layers.7.self_attn.o_proj.weight",
"base_model.model.model.layers.7.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.7.mlp.gate_proj.weight",
"base_model.model.model.layers.7.mlp.down_proj.weight",
"base_model.model.model.layers.7.mlp.up_proj.weight",
"base_model.model.model.layers.7.input_layernorm.weight",
"base_model.model.model.layers.7.post_attention_layernorm.weight",
"base_model.model.model.layers.8.self_attn.q_proj.weight",
"base_model.model.model.layers.8.self_attn.k_proj.weight",
"base_model.model.model.layers.8.self_attn.v_proj.weight",
"base_model.model.model.layers.8.self_attn.o_proj.weight",
"base_model.model.model.layers.8.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.8.mlp.gate_proj.weight",
"base_model.model.model.layers.8.mlp.down_proj.weight",
"base_model.model.model.layers.8.mlp.up_proj.weight",
"base_model.model.model.layers.8.input_layernorm.weight",
"base_model.model.model.layers.8.post_attention_layernorm.weight",
"base_model.model.model.layers.9.self_attn.q_proj.weight",
"base_model.model.model.layers.9.self_attn.k_proj.weight",
"base_model.model.model.layers.9.self_attn.v_proj.weight",
"base_model.model.model.layers.9.self_attn.o_proj.weight",
"base_model.model.model.layers.9.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.9.mlp.gate_proj.weight",
"base_model.model.model.layers.9.mlp.down_proj.weight",
"base_model.model.model.layers.9.mlp.up_proj.weight",
"base_model.model.model.layers.9.input_layernorm.weight",
"base_model.model.model.layers.9.post_attention_layernorm.weight",
"base_model.model.model.layers.10.self_attn.q_proj.weight",
"base_model.model.model.layers.10.self_attn.k_proj.weight",
"base_model.model.model.layers.10.self_attn.v_proj.weight",
"base_model.model.model.layers.10.self_attn.o_proj.weight",
"base_model.model.model.layers.10.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.10.mlp.gate_proj.weight",
"base_model.model.model.layers.10.mlp.down_proj.weight",
"base_model.model.model.layers.10.mlp.up_proj.weight",
"base_model.model.model.layers.10.input_layernorm.weight",
"base_model.model.model.layers.10.post_attention_layernorm.weight",
"base_model.model.model.layers.11.self_attn.q_proj.weight",
"base_model.model.model.layers.11.self_attn.k_proj.weight",
"base_model.model.model.layers.11.self_attn.v_proj.weight",
"base_model.model.model.layers.11.self_attn.o_proj.weight",
"base_model.model.model.layers.11.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.11.mlp.gate_proj.weight",
"base_model.model.model.layers.11.mlp.down_proj.weight",
"base_model.model.model.layers.11.mlp.up_proj.weight",
"base_model.model.model.layers.11.input_layernorm.weight",
"base_model.model.model.layers.11.post_attention_layernorm.weight",
"base_model.model.model.layers.12.self_attn.q_proj.weight",
"base_model.model.model.layers.12.self_attn.k_proj.weight",
"base_model.model.model.layers.12.self_attn.v_proj.weight",
"base_model.model.model.layers.12.self_attn.o_proj.weight",
"base_model.model.model.layers.12.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.12.mlp.gate_proj.weight",
"base_model.model.model.layers.12.mlp.down_proj.weight",
"base_model.model.model.layers.12.mlp.up_proj.weight",
"base_model.model.model.layers.12.input_layernorm.weight",
"base_model.model.model.layers.12.post_attention_layernorm.weight",
"base_model.model.model.layers.13.self_attn.q_proj.weight",
"base_model.model.model.layers.13.self_attn.k_proj.weight",
"base_model.model.model.layers.13.self_attn.v_proj.weight",
"base_model.model.model.layers.13.self_attn.o_proj.weight",
"base_model.model.model.layers.13.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.13.mlp.gate_proj.weight",
"base_model.model.model.layers.13.mlp.down_proj.weight",
"base_model.model.model.layers.13.mlp.up_proj.weight",
"base_model.model.model.layers.13.input_layernorm.weight",
"base_model.model.model.layers.13.post_attention_layernorm.weight",
"base_model.model.model.layers.14.self_attn.q_proj.weight",
"base_model.model.model.layers.14.self_attn.k_proj.weight",
"base_model.model.model.layers.14.self_attn.v_proj.weight",
"base_model.model.model.layers.14.self_attn.o_proj.weight",
"base_model.model.model.layers.14.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.14.mlp.gate_proj.weight",
"base_model.model.model.layers.14.mlp.down_proj.weight",
"base_model.model.model.layers.14.mlp.up_proj.weight",
"base_model.model.model.layers.14.input_layernorm.weight",
"base_model.model.model.layers.14.post_attention_layernorm.weight",
"base_model.model.model.layers.15.self_attn.q_proj.weight",
"base_model.model.model.layers.15.self_attn.k_proj.weight",
"base_model.model.model.layers.15.self_attn.v_proj.weight",
"base_model.model.model.layers.15.self_attn.o_proj.weight",
"base_model.model.model.layers.15.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.15.mlp.gate_proj.weight",
"base_model.model.model.layers.15.mlp.down_proj.weight",
"base_model.model.model.layers.15.mlp.up_proj.weight",
"base_model.model.model.layers.15.input_layernorm.weight",
"base_model.model.model.layers.15.post_attention_layernorm.weight",
"base_model.model.model.layers.16.self_attn.q_proj.weight",
"base_model.model.model.layers.16.self_attn.k_proj.weight",
"base_model.model.model.layers.16.self_attn.v_proj.weight",
"base_model.model.model.layers.16.self_attn.o_proj.weight",
"base_model.model.model.layers.16.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.16.mlp.gate_proj.weight",
"base_model.model.model.layers.16.mlp.down_proj.weight",
"base_model.model.model.layers.16.mlp.up_proj.weight",
"base_model.model.model.layers.16.input_layernorm.weight",
"base_model.model.model.layers.16.post_attention_layernorm.weight",
"base_model.model.model.layers.17.self_attn.q_proj.weight",
"base_model.model.model.layers.17.self_attn.k_proj.weight",
"base_model.model.model.layers.17.self_attn.v_proj.weight",
"base_model.model.model.layers.17.self_attn.o_proj.weight",
"base_model.model.model.layers.17.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.17.mlp.gate_proj.weight",
"base_model.model.model.layers.17.mlp.down_proj.weight",
"base_model.model.model.layers.17.mlp.up_proj.weight",
"base_model.model.model.layers.17.input_layernorm.weight",
"base_model.model.model.layers.17.post_attention_layernorm.weight",
"base_model.model.model.layers.18.self_attn.q_proj.weight",
"base_model.model.model.layers.18.self_attn.k_proj.weight",
"base_model.model.model.layers.18.self_attn.v_proj.weight",
"base_model.model.model.layers.18.self_attn.o_proj.weight",
"base_model.model.model.layers.18.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.18.mlp.gate_proj.weight",
"base_model.model.model.layers.18.mlp.down_proj.weight",
"base_model.model.model.layers.18.mlp.up_proj.weight",
"base_model.model.model.layers.18.input_layernorm.weight",
"base_model.model.model.layers.18.post_attention_layernorm.weight",
"base_model.model.model.layers.19.self_attn.q_proj.weight",
"base_model.model.model.layers.19.self_attn.k_proj.weight",
"base_model.model.model.layers.19.self_attn.v_proj.weight",
"base_model.model.model.layers.19.self_attn.o_proj.weight",
"base_model.model.model.layers.19.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.19.mlp.gate_proj.weight",
"base_model.model.model.layers.19.mlp.down_proj.weight",
"base_model.model.model.layers.19.mlp.up_proj.weight",
"base_model.model.model.layers.19.input_layernorm.weight",
"base_model.model.model.layers.19.post_attention_layernorm.weight",
"base_model.model.model.layers.20.self_attn.q_proj.weight",
"base_model.model.model.layers.20.self_attn.k_proj.weight",
"base_model.model.model.layers.20.self_attn.v_proj.weight",
"base_model.model.model.layers.20.self_attn.o_proj.weight",
"base_model.model.model.layers.20.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.20.mlp.gate_proj.weight",
"base_model.model.model.layers.20.mlp.down_proj.weight",
"base_model.model.model.layers.20.mlp.up_proj.weight",
"base_model.model.model.layers.20.input_layernorm.weight",
"base_model.model.model.layers.20.post_attention_layernorm.weight",
"base_model.model.model.layers.21.self_attn.q_proj.weight",
"base_model.model.model.layers.21.self_attn.k_proj.weight",
"base_model.model.model.layers.21.self_attn.v_proj.weight",
"base_model.model.model.layers.21.self_attn.o_proj.weight",
"base_model.model.model.layers.21.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.21.mlp.gate_proj.weight",
"base_model.model.model.layers.21.mlp.down_proj.weight",
"base_model.model.model.layers.21.mlp.up_proj.weight",
"base_model.model.model.layers.21.input_layernorm.weight",
"base_model.model.model.layers.21.post_attention_layernorm.weight",
"base_model.model.model.layers.22.self_attn.q_proj.weight",
"base_model.model.model.layers.22.self_attn.k_proj.weight",
"base_model.model.model.layers.22.self_attn.v_proj.weight",
"base_model.model.model.layers.22.self_attn.o_proj.weight",
"base_model.model.model.layers.22.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.22.mlp.gate_proj.weight",
"base_model.model.model.layers.22.mlp.down_proj.weight",
"base_model.model.model.layers.22.mlp.up_proj.weight",
"base_model.model.model.layers.22.input_layernorm.weight",
"base_model.model.model.layers.22.post_attention_layernorm.weight",
"base_model.model.model.layers.23.self_attn.q_proj.weight",
"base_model.model.model.layers.23.self_attn.k_proj.weight",
"base_model.model.model.layers.23.self_attn.v_proj.weight",
"base_model.model.model.layers.23.self_attn.o_proj.weight",
"base_model.model.model.layers.23.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.23.mlp.gate_proj.weight",
"base_model.model.model.layers.23.mlp.down_proj.weight",
"base_model.model.model.layers.23.mlp.up_proj.weight",
"base_model.model.model.layers.23.input_layernorm.weight",
"base_model.model.model.layers.23.post_attention_layernorm.weight",
"base_model.model.model.layers.24.self_attn.q_proj.weight",
"base_model.model.model.layers.24.self_attn.k_proj.weight",
"base_model.model.model.layers.24.self_attn.v_proj.weight",
"base_model.model.model.layers.24.self_attn.o_proj.weight",
"base_model.model.model.layers.24.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.24.mlp.gate_proj.weight",
"base_model.model.model.layers.24.mlp.down_proj.weight",
"base_model.model.model.layers.24.mlp.up_proj.weight",
"base_model.model.model.layers.24.input_layernorm.weight",
"base_model.model.model.layers.24.post_attention_layernorm.weight",
"base_model.model.model.layers.25.self_attn.q_proj.weight",
"base_model.model.model.layers.25.self_attn.k_proj.weight",
"base_model.model.model.layers.25.self_attn.v_proj.weight",
"base_model.model.model.layers.25.self_attn.o_proj.weight",
"base_model.model.model.layers.25.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.25.mlp.gate_proj.weight",
"base_model.model.model.layers.25.mlp.down_proj.weight",
"base_model.model.model.layers.25.mlp.up_proj.weight",
"base_model.model.model.layers.25.input_layernorm.weight",
"base_model.model.model.layers.25.post_attention_layernorm.weight",
"base_model.model.model.layers.26.self_attn.q_proj.weight",
"base_model.model.model.layers.26.self_attn.k_proj.weight",
"base_model.model.model.layers.26.self_attn.v_proj.weight",
"base_model.model.model.layers.26.self_attn.o_proj.weight",
"base_model.model.model.layers.26.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.26.mlp.gate_proj.weight",
"base_model.model.model.layers.26.mlp.down_proj.weight",
"base_model.model.model.layers.26.mlp.up_proj.weight",
"base_model.model.model.layers.26.input_layernorm.weight",
"base_model.model.model.layers.26.post_attention_layernorm.weight",
"base_model.model.model.layers.27.self_attn.q_proj.weight",
"base_model.model.model.layers.27.self_attn.k_proj.weight",
"base_model.model.model.layers.27.self_attn.v_proj.weight",
"base_model.model.model.layers.27.self_attn.o_proj.weight",
"base_model.model.model.layers.27.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.27.mlp.gate_proj.weight",
"base_model.model.model.layers.27.mlp.down_proj.weight",
"base_model.model.model.layers.27.mlp.up_proj.weight",
"base_model.model.model.layers.27.input_layernorm.weight",
"base_model.model.model.layers.27.post_attention_layernorm.weight",
"base_model.model.model.layers.28.self_attn.q_proj.weight",
"base_model.model.model.layers.28.self_attn.k_proj.weight",
"base_model.model.model.layers.28.self_attn.v_proj.weight",
"base_model.model.model.layers.28.self_attn.o_proj.weight",
"base_model.model.model.layers.28.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.28.mlp.gate_proj.weight",
"base_model.model.model.layers.28.mlp.down_proj.weight",
"base_model.model.model.layers.28.mlp.up_proj.weight",
"base_model.model.model.layers.28.input_layernorm.weight",
"base_model.model.model.layers.28.post_attention_layernorm.weight",
"base_model.model.model.layers.29.self_attn.q_proj.weight",
"base_model.model.model.layers.29.self_attn.k_proj.weight",
"base_model.model.model.layers.29.self_attn.v_proj.weight",
"base_model.model.model.layers.29.self_attn.o_proj.weight",
"base_model.model.model.layers.29.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.29.mlp.gate_proj.weight",
"base_model.model.model.layers.29.mlp.down_proj.weight",
"base_model.model.model.layers.29.mlp.up_proj.weight",
"base_model.model.model.layers.29.input_layernorm.weight",
"base_model.model.model.layers.29.post_attention_layernorm.weight",
"base_model.model.model.layers.30.self_attn.q_proj.weight",
"base_model.model.model.layers.30.self_attn.k_proj.weight",
"base_model.model.model.layers.30.self_attn.v_proj.weight",
"base_model.model.model.layers.30.self_attn.o_proj.weight",
"base_model.model.model.layers.30.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.30.mlp.gate_proj.weight",
"base_model.model.model.layers.30.mlp.down_proj.weight",
"base_model.model.model.layers.30.mlp.up_proj.weight",
"base_model.model.model.layers.30.input_layernorm.weight",
"base_model.model.model.layers.30.post_attention_layernorm.weight",
"base_model.model.model.layers.31.self_attn.q_proj.weight",
"base_model.model.model.layers.31.self_attn.k_proj.weight",
"base_model.model.model.layers.31.self_attn.v_proj.weight",
"base_model.model.model.layers.31.self_attn.o_proj.weight",
"base_model.model.model.layers.31.self_attn.rotary_emb.inv_freq",
"base_model.model.model.layers.31.mlp.gate_proj.weight",
"base_model.model.model.layers.31.mlp.down_proj.weight",
"base_model.model.model.layers.31.mlp.up_proj.weight",
"base_model.model.model.layers.31.input_layernorm.weight",
"base_model.model.model.layers.31.post_attention_layernorm.weight",
"base_model.model.model.norm.weight".

Tried everytime. Same dataset, same script, even able to merge into 7B base OK after training so the script works but CHECKPOINT saving must be wrong somewhere.

Any idea ?
Steve

airaria · 2023-06-11T01:30:39Z

I think you have checked #464
Here is the detailed explanation:
Since only the LoRA part (and embed_tokens and lm_head) of the full model are saved to the ckpt, when resuming training and loading the saved ckpt, we must allow missing keys.
When using Transformers with Deepspeed, Transformers doesn't provide any parameter to set allowing missing keys. See here
Therefore we have to modify the source code.

There are two ways to fix the problem:

Modify Deepspeed:
基于上一次保存checkpoint继续训练 #464 (comment)
Modify Transformers:
Change the original resuming-from-ckpt-with-deepspeed-code to

          load_path, _ = deepspeed_engine.load_checkpoint(
              resume_from_checkpoint, load_optimizer_states=True, load_lr_scheduler_states=True,
              load_module_strict=False
          )

If you find a more convenient method, you are welcome to share it with us.

github-actions · 2023-06-18T22:01:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.

github-actions bot added the stale label Jun 18, 2023

ymcui closed this as completed Jun 19, 2023

kazemf78 mentioned this issue Jul 26, 2023

Please fix Lora model resume in transformers when using DeepSpeed huggingface/peft#746

Closed

lh0x00 mentioned this issue Jan 28, 2024

Resolve DeepSpeed cannot resume training with PeftModel huggingface/transformers#28746

Merged

stephen-nju mentioned this issue Mar 7, 2024

奖励模型断点续训报错 hiyouga/LLaMA-Factory#2351

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can NOT resume pretraining from last check-point !!! #559

Can NOT resume pretraining from last check-point !!! #559

thusinh1969 commented Jun 10, 2023

airaria commented Jun 11, 2023

github-actions bot commented Jun 18, 2023

Can NOT resume pretraining from last check-point !!! #559

Can NOT resume pretraining from last check-point !!! #559

Comments

thusinh1969 commented Jun 10, 2023

Describe the issue in detail

airaria commented Jun 11, 2023

github-actions bot commented Jun 18, 2023