Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
fb763af
feat: add EN usage docs for fsdp
lin0303-siyuan Dec 11, 2025
6a026a0
fix_load_ckpt
lilei199908 Dec 12, 2025
3645cfa
Merge pull request #1095 from lilei199908/fix_load_ckpt
lilei199908 Dec 12, 2025
ba83b3a
fix actor init bugs (#1098)
lilei199908 Dec 12, 2025
8ef0187
Fix gqa model tflops compute (#1099)
zhuzilin Dec 12, 2025
96c2ed7
Fix bug for convert_hf_to_torch_dist.py (#1100)
zhuzilin Dec 12, 2025
8ca9cd5
Add FSDP conversion repro and verification scripts
cklxx Dec 12, 2025
c8135b8
Remove FSDP repro helper scripts
cklxx Dec 12, 2025
3024ea0
Add dedicated FSDP conversion script
cklxx Dec 12, 2025
cd7a7d1
Scope pickle unpickling in FSDP converter
cklxx Dec 12, 2025
06bc49b
Remove FSDP rationale note from Chinese quick start
cklxx Dec 12, 2025
ca0b64f
Merge pull request #3 from cklxx/codex/fix-issue-1094-in-slime-reposi…
cklxx Dec 12, 2025
9af059a
Refactor model loading and import statements
cklxx Dec 12, 2025
7989163
0.2.1
lilei199908 Dec 12, 2025
44340e3
fix acotr init bugs
lilei199908 Dec 12, 2025
6e1ea5e
fix acotr init bugs
lilei199908 Dec 12, 2025
8f1a398
fix acotr init bugs
lilei199908 Dec 12, 2025
54cc85b
fix acotr init bugs
lilei199908 Dec 12, 2025
a6b687a
bump to 0.2.1
lilei199908 Dec 12, 2025
0934a0e
Merge pull request #1096 from lilei199908/0.2.1
lilei199908 Dec 12, 2025
79c51ad
fix: address MR reviews
lin0303-siyuan Dec 12, 2025
c525704
add ckpt load save ci (#1104)
lilei199908 Dec 13, 2025
ab3eb3d
Add --rollout-all-samples-process-path for RLVE (#1107)
zhuzilin Dec 13, 2025
91bcf34
feat: support Qwen3 Moe BackEnd Kernel (#1071)
attack204 Dec 13, 2025
aed8ab4
Include model state tensors when reading FSDP checkpoints
cklxx Dec 15, 2025
20af4f1
Merge branch 'codex/fix-issue-1094-in-slime-repository-r1xqo4' into f…
cklxx Dec 15, 2025
b81f035
Merge pull request #4 from cklxx/fix-issue-1094-in-slime-repository
cklxx Dec 15, 2025
ac5328b
fix max response/context/prompt len (#1110)
lilei199908 Dec 15, 2025
9a206e9
fix max len
lilei199908 Dec 15, 2025
3537ec5
Merge pull request #1112 from lilei199908/update
lilei199908 Dec 15, 2025
bf59b94
Add prefix override and test FSDP conversion
cklxx Dec 15, 2025
431c591
Merge pull request #5 from cklxx/fix-model-state-key-errors
cklxx Dec 15, 2025
9df2964
[docker] remove amem and support deepep + r3 (#1115)
zhuzilin Dec 15, 2025
3486fb9
[Fix] Fix early return in init rollout engine (#1118)
yitianlian Dec 15, 2025
8f579d6
[Fix] Add sglang patch for weight version update (#1119)
yitianlian Dec 15, 2025
016c85d
fix: improve tokenization (#1113)
nanjiangwill Dec 16, 2025
e0e840d
[Feature] Add CI test for weight version update (#1120)
yitianlian Dec 16, 2025
8a825f7
[docker] optimize r3 with base64 encode (#1124)
zhuzilin Dec 16, 2025
1069e46
Delete tests/test_convert_fsdp_to_hf.py
cklxx Dec 16, 2025
2924d85
[docker] fix r3 gather buffer (#1129)
zhuzilin Dec 16, 2025
cecba06
[docker] support mtp for r3 (#1131)
zhuzilin Dec 16, 2025
ef9a2ec
[Fix] Fix some bugs in retool example (#1130)
yitianlian Dec 16, 2025
b1daf2b
Add finalize_model_grads_with_empty_cache (#1133)
zhuzilin Dec 16, 2025
24b70fe
fix: address MR reviews
lin0303-siyuan Dec 17, 2025
602c077
feat: add zh docs
lin0303-siyuan Dec 17, 2025
12b35ee
fix: address markdown formats
lin0303-siyuan Dec 18, 2025
4eaee1e
Merge pull request #1092 from lin0303-siyuan/feat/fsdp-doc
PopSoda2002 Dec 18, 2025
461fc8a
Reserve more ports for new sglang dp attn impl (#1142)
zhuzilin Dec 18, 2025
d2e5531
Simplify FSDP prefix handling
cklxx Dec 18, 2025
598d7e0
Merge pull request #6 from cklxx/remove-prefix-flag-from-cli
cklxx Dec 18, 2025
583690d
Blog: fix the path of the Blog's architecture image (#1125)
ShanningZhuang Dec 18, 2025
98d7f46
Format convert_fsdp_to_hf with black
cklxx Dec 18, 2025
75db65d
Merge pull request #7 from cklxx/fix-code-formatting-issues-with-black
cklxx Dec 18, 2025
0fe58b4
Support async save and add extra save at the end of the training (#1143)
zhuzilin Dec 18, 2025
bc0e70a
fix: fix GemmeRMSNorm.forward() bug (#1121)
nanjiangwill Dec 18, 2025
b23fcd1
[WIP][FSDP] Support FSDP for Qwen3Next (#1116)
rucnyz Dec 18, 2025
0850fbe
Megatron VLM Support (1/N) (#1123)
Zhuohao-Li Dec 19, 2025
34bb0cd
Update deprecated huggingface-cli and fix broken links (#1147)
Lyken17 Dec 19, 2025
54762b7
Merge pull request #1101 from cklxx/codex/fix-issue-1094-in-slime-rep…
PopSoda2002 Dec 19, 2025
b4399e8
minor fix for megatron compatibility (#1149)
zhuzilin Dec 19, 2025
9ac37f9
Remove config_mapping to use megatron-bridge (#1166)
zhuzilin Dec 21, 2025
f3a96f8
Avoids repeated work. (#1163)
qqwqqw689 Dec 21, 2025
87cbed6
Make tools/convert_torch_dist_to_hf.py not rely on megatron (#1167)
zhuzilin Dec 21, 2025
ece2624
support converting dpsk mtp layer (#1169)
zhuzilin Dec 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions .github/workflows/pr-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -156,3 +156,47 @@ jobs:
- name: Execute
shell: bash
run: python tests/ci/gpu_lock_exec.py --count ${{ matrix.info.num_gpus }} -- python tests/${{ matrix.info.test_file }}

e2e-test-ckpt:
if: (github.event_name == 'workflow_dispatch') || (github.event.pull_request && contains(github.event.pull_request.labels.*.name, 'run-ci-ckpt'))
runs-on: self-hosted
container:
image: slimerl/slime:latest
options: >
--gpus all
--ipc=host
--shm-size=16g
--ulimit memlock=-1
--ulimit stack=67108864
--memory=0
--memory-swap=0
-e http_proxy=$http_proxy
-e https_proxy=$https_proxy
-e HTTP_PROXY=$HTTP_PROXY
-e HTTPS_PROXY=$HTTPS_PROXY
-v /mnt/nvme0n1/slime_ci:/data/slime_ci
-v /mnt/nvme0n1/slime_ci/models:/root/models
-v /mnt/nvme0n1/slime_ci/datasets:/root/datasets
strategy:
fail-fast: false
matrix:
info: [{"num_gpus": 8, "test_file": "test_qwen3_4B_ckpt.py"}, {"num_gpus": 8, "test_file": "test_qwen3_4B_ckpt.py --async-save"}]
defaults:
run:
working-directory: ${{ github.workspace }}
env:
GITHUB_COMMIT_NAME: ${{ github.sha }}_${{ github.event.pull_request.number || 'non-pr' }}
WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
SLIME_TEST_ENABLE_INFINITE_RUN: ${{ (github.event_name == 'workflow_dispatch' && github.event.inputs.infinite_run) || 'false' }}

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Install
shell: bash
run: cd $GITHUB_WORKSPACE && pip install -e . --break-system-packages

- name: Execute
shell: bash
run: python tests/ci/gpu_lock_exec.py --count ${{ matrix.info.num_gpus }} -- python tests/${{ matrix.info.test_file }}
7 changes: 7 additions & 0 deletions .github/workflows/pr-test.yml.j2
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,13 @@
{'test_file': 'test_qwen3_0.6B_parallel_check.py', 'num_gpus': 8},
],
},
'e2e-test-ckpt': {
'label': 'run-ci-ckpt',
'tests': [
{'test_file': 'test_qwen3_4B_ckpt.py', 'num_gpus': 8},
{'test_file': 'test_qwen3_4B_ckpt.py --async-save', 'num_gpus': 8},
],
},
} %>
name: PR Test

Expand Down
17 changes: 10 additions & 7 deletions build_conda.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,13 @@ micromamba install -n slime cuda cuda-nvtx cuda-nvtx-dev nccl -c nvidia/label/cu
micromamba install -n slime -c conda-forge cudnn -y

# prevent installing cuda 13.0 for sglang
pip install cuda-python==12.9.1
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu129
pip install cuda-python==13.1.0
pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu129

# install sglang
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout 303cc957e62384044dfa8e52d7d8af8abe12f0ac
git checkout 5e2cda6158e670e64b926a9985d65826c537ac82
# Install the python packages
pip install -e "python[all]"

Expand All @@ -39,7 +39,7 @@ pip install cmake ninja
MAX_JOBS=64 pip -v install flash-attn==2.7.4.post1 --no-build-isolation

pip install git+https://github.com/ISEEKYAN/mbridge.git@89eb10887887bc74853f89a4de258c0702932a1c --no-deps
pip install --no-build-isolation "transformer_engine[pytorch]==2.8.0"
pip install --no-build-isolation "transformer_engine[pytorch]==2.10.0"
pip install flash-linear-attention==0.4.0
NVCC_APPEND_FLAGS="--threads 4" \
pip -v install --disable-pip-version-check --no-cache-dir \
Expand All @@ -50,7 +50,7 @@ git clone https://github.com/NVIDIA/Megatron-LM.git --recursive && \
cd Megatron-LM && git checkout ${MEGATRON_COMMIT} && \
pip install -e .

pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@9b8b788fdeb9c2ee528183214cef65a99b71e7d5 --no-cache-dir --force-reinstall
pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@dc6876905830430b5054325fa4211ff302169c6b --no-cache-dir --force-reinstall
pip install git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl --no-build-isolation
pip install nvidia-modelopt[torch]>=0.37.0 --no-build-isolation

Expand All @@ -60,6 +60,9 @@ git clone https://github.com/NVIDIA/Megatron-LM.git --recursive && \
cd Megatron-LM/ && git checkout core_v0.14.0 && \
pip install -e .

# https://github.com/pytorch/pytorch/issues/168167
pip install nvidia-cudnn-cu12==9.16.0.29

# install slime and apply patches

# if slime does not exist locally, clone it
Expand All @@ -76,6 +79,6 @@ fi

# apply patch
cd $BASE_DIR/sglang
git apply $SLIME_DIR/docker/patch/v0.5.5.post1/sglang.patch
git apply $SLIME_DIR/docker/patch/v0.5.6/sglang.patch
cd $BASE_DIR/Megatron-LM
git apply $SLIME_DIR/docker/patch/v0.5.5.post1/megatron.patch
git apply $SLIME_DIR/docker/patch/v0.5.6/megatron.patch
16 changes: 1 addition & 15 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -71,25 +71,11 @@ RUN if [ "$ENABLE_CUDA_13" = "1" ]; then \
python3 -m pip install https://github.com/sgl-project/whl/releases/download/v${SGL_KERNEL_VERSION}/sgl_kernel-${SGL_KERNEL_VERSION}+cu130-cp310-abi3-manylinux2014_$(uname -m).whl --force-reinstall --no-deps; \
fi

# AMEM
# we need to create a fake libcuda.so.1 to make the linker happy when building AMEM
ENV CUDA_DIR=/usr/local/cuda
ENV CUDA_STUBS=${CUDA_DIR}/lib64/stubs
RUN ln -s ${CUDA_STUBS}/libcuda.so ${CUDA_STUBS}/libcuda.so.1 && \
echo "${CUDA_STUBS}" > /etc/ld.so.conf.d/z-cuda-stubs.conf && \
ldconfig
RUN git clone https://github.com/inclusionAI/asystem-amem.git && \
cd asystem-amem && git checkout 6483bb17c9a98b51c3a94b7048467d5b50fbad4b && \
git submodule init && git submodule update && \
MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ ./build.sh && \
mv /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2 /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2.bak && \
cp -r third_party/nccl/build/lib/* /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/

# https://github.com/pytorch/pytorch/issues/168167
RUN pip install nvidia-cudnn-cu12==9.16.0.29

RUN rm /root/.tmux.conf
RUN rm -rf /root/.cache/pip /root/asystem-amem /root/flash-attention
RUN rm -rf /root/.cache/pip /root/flash-attention

# ====================================== Patches ============================================

Expand Down
4 changes: 2 additions & 2 deletions docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@ We will publish 2 kinds of docker images:
2. latest version, which aligns to `lmsysorg/sglang:latest`.

current stable version is:
- sglang v0.5.5.post1 (303cc957e62384044dfa8e52d7d8af8abe12f0ac), megatron v0.14.0 (23e00ed0963c35382dfe8a5a94fb3cda4d21e133)
- sglang nightly-dev-20251208-5e2cda61 (5e2cda6158e670e64b926a9985d65826c537ac82), megatron v0.14.0 (23e00ed0963c35382dfe8a5a94fb3cda4d21e133)

history versions:
- sglang v0.5.0rc0-cu126 (8ecf6b9d2480c3f600826c7d8fef6a16ed603c3f), megatron 48406695c4efcf1026a7ed70bb390793918dd97b
- sglang v0.5.5.post1 (303cc957e62384044dfa8e52d7d8af8abe12f0ac), megatron v0.14.0 (23e00ed0963c35382dfe8a5a94fb3cda4d21e133)

The command to build:

Expand Down
4 changes: 2 additions & 2 deletions docker/patch/latest/megatron.patch
Original file line number Diff line number Diff line change
Expand Up @@ -219,14 +219,14 @@ index 6aec66e6d..6ca48b55f 100644
mtp_loss = loss_mask * mtp_loss
if self.training:
diff --git a/megatron/core/optimizer/distrib_optimizer.py b/megatron/core/optimizer/distrib_optimizer.py
index a36b67364..8739270f2 100644
index a36b67364..ed8883e32 100644
--- a/megatron/core/optimizer/distrib_optimizer.py
+++ b/megatron/core/optimizer/distrib_optimizer.py
@@ -657,6 +657,8 @@ class DistributedOptimizer(MixedPrecisionOptimizer):
# TE FusedAdam will not accumulate step for empty param groups, so we need to
# align the step across param groups.
param_group["step"] = int(step)
+ if param_group["step"] is None:
+ if "step" in param_group and param_group["step"] is None:
+ del param_group["step"]

# Grad scaler state.
Expand Down
Loading