Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new CI test to verify checkpoint save and load functionality for the Qwen3-4B model. The test exercises the ability to save checkpoints during training and subsequently load them for resumption.
Key changes:
- New test file
test_qwen3_4B_ckpt.pythat tests checkpoint save/load by running training twice with different modes - Updated GitHub Actions workflow configuration to include the new test in the short e2e test suite
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/test_qwen3_4B_ckpt.py | New test file that prepares Qwen3-4B model and runs training with checkpoint save and load modes to verify checkpoint functionality |
| .github/workflows/pr-test.yml.j2 | Template file updated to include the new checkpoint test in the e2e-test-short job configuration |
| .github/workflows/pr-test.yml | Generated workflow file updated with the new test added to the test matrix |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| U.execute_train( | ||
| train_args=train_args, | ||
| num_gpus_per_node=NUM_GPUS, | ||
| megatron_model_type=MODEL_TYPE, | ||
| ) |
There was a problem hiding this comment.
The execute function contains two identical calls to U.execute_train with the same train_args. This appears to be a copy-paste error. The second call should likely be removed since the checkpoint save/load testing is already handled by calling execute twice with different modes ("save" and "load") from the main block.
| U.execute_train( | |
| train_args=train_args, | |
| num_gpus_per_node=NUM_GPUS, | |
| megatron_model_type=MODEL_TYPE, | |
| ) |
No description provided.