Skip to content

add ckpt load save ci#1104

Merged
zhuzilin merged 2 commits intoTHUDM:mainfrom
lilei199908:add_ckpt_ci
Dec 13, 2025
Merged

add ckpt load save ci#1104
zhuzilin merged 2 commits intoTHUDM:mainfrom
lilei199908:add_ckpt_ci

Conversation

@lilei199908
Copy link
Collaborator

No description provided.

Copilot AI review requested due to automatic review settings December 12, 2025 13:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new CI test to verify checkpoint save and load functionality for the Qwen3-4B model. The test exercises the ability to save checkpoints during training and subsequently load them for resumption.

Key changes:

  • New test file test_qwen3_4B_ckpt.py that tests checkpoint save/load by running training twice with different modes
  • Updated GitHub Actions workflow configuration to include the new test in the short e2e test suite

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/test_qwen3_4B_ckpt.py New test file that prepares Qwen3-4B model and runs training with checkpoint save and load modes to verify checkpoint functionality
.github/workflows/pr-test.yml.j2 Template file updated to include the new checkpoint test in the e2e-test-short job configuration
.github/workflows/pr-test.yml Generated workflow file updated with the new test added to the test matrix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +118 to +122
U.execute_train(
train_args=train_args,
num_gpus_per_node=NUM_GPUS,
megatron_model_type=MODEL_TYPE,
)
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execute function contains two identical calls to U.execute_train with the same train_args. This appears to be a copy-paste error. The second call should likely be removed since the checkpoint save/load testing is already handled by calling execute twice with different modes ("save" and "load") from the main block.

Suggested change
U.execute_train(
train_args=train_args,
num_gpus_per_node=NUM_GPUS,
megatron_model_type=MODEL_TYPE,
)

Copilot uses AI. Check for mistakes.
@lilei199908 lilei199908 requested a review from zhuzilin December 12, 2025 17:39
@zhuzilin zhuzilin merged commit c525704 into THUDM:main Dec 13, 2025
10 checks passed
Fengzdadi pushed a commit to Fengzdadi/slime that referenced this pull request Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants