merge #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

cklxx merged 64 commits into cklxx:codex/optimize-training-time-for-context-parallelism from THUDM:main

Dec 21, 2025

.github/workflows/pr-test.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -156,3 +156,47 @@ jobs: @@
           - name: Execute
             shell: bash
             run: python tests/ci/gpu_lock_exec.py --count ${{ matrix.info.num_gpus }} -- python tests/${{ matrix.info.test_file }}
+      e2e-test-ckpt:
+        if: (github.event_name == 'workflow_dispatch') || (github.event.pull_request && contains(github.event.pull_request.labels.*.name, 'run-ci-ckpt'))
+        runs-on: self-hosted
+        container:
+          image: slimerl/slime:latest
+          options: >
+            --gpus all
+            --ipc=host
+            --shm-size=16g
+            --ulimit memlock=-1
+            --ulimit stack=67108864
+            --memory=0
+            --memory-swap=0
+            -e http_proxy=$http_proxy
+            -e https_proxy=$https_proxy
+            -e HTTP_PROXY=$HTTP_PROXY
+            -e HTTPS_PROXY=$HTTPS_PROXY
+            -v /mnt/nvme0n1/slime_ci:/data/slime_ci
+            -v /mnt/nvme0n1/slime_ci/models:/root/models
+            -v /mnt/nvme0n1/slime_ci/datasets:/root/datasets
+        strategy:
+          fail-fast: false
+          matrix:
+            info: [{"num_gpus": 8, "test_file": "test_qwen3_4B_ckpt.py"}, {"num_gpus": 8, "test_file": "test_qwen3_4B_ckpt.py --async-save"}]
+        defaults:
+          run:
+            working-directory: ${{ github.workspace }}
+        env:
+          GITHUB_COMMIT_NAME: ${{ github.sha }}_${{ github.event.pull_request.number || 'non-pr' }}
+          WANDB_API_KEY: ${{ secrets.WANDB_API_KEY }}
+          SLIME_TEST_ENABLE_INFINITE_RUN: ${{ (github.event_name == 'workflow_dispatch' && github.event.inputs.infinite_run) || 'false' }}
+        steps:
+          - name: Checkout repository
+            uses: actions/checkout@v4
+          - name: Install
+            shell: bash
+            run: cd $GITHUB_WORKSPACE && pip install -e . --break-system-packages
+          - name: Execute
+            shell: bash
+            run: python tests/ci/gpu_lock_exec.py --count ${{ matrix.info.num_gpus }} -- python tests/${{ matrix.info.test_file }}

.github/workflows/pr-test.yml.j2

-Original file line number
+Diff line change
@@ Expand Up / @@ -24,6 +24,13 @@ @@
             {'test_file': 'test_qwen3_0.6B_parallel_check.py', 'num_gpus': 8},
           ],
         },
+        'e2e-test-ckpt': {
+          'label': 'run-ci-ckpt',
+          'tests': [
+            {'test_file': 'test_qwen3_4B_ckpt.py', 'num_gpus': 8},
+            {'test_file': 'test_qwen3_4B_ckpt.py --async-save', 'num_gpus': 8},
+          ],
+        },
     } %>
     name: PR Test
@@ Expand Down @@

build_conda.sh

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -21,13 +21,13 @@ micromamba install -n slime cuda cuda-nvtx cuda-nvtx-dev nccl -c nvidia/label/cu
  
    micromamba install -n slime -c conda-forge cudnn -y

    # prevent installing cuda 13.0 for sglang

    pip install cuda-python==12.9.1

    pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/cu129

    pip install cuda-python==13.1.0

    pip install torch==2.9.1 torchvision==0.24.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu129

    # install sglang

    git clone https://github.com/sgl-project/sglang.git

    cd sglang

    git checkout 303cc957e62384044dfa8e52d7d8af8abe12f0ac

    git checkout 5e2cda6158e670e64b926a9985d65826c537ac82

    # Install the python packages

    pip install -e "python[all]"

    @@ -39,7 +39,7 @@ pip install cmake ninja
  
    MAX_JOBS=64 pip -v install flash-attn==2.7.4.post1 --no-build-isolation

    pip install git+https://github.com/ISEEKYAN/mbridge.git@89eb10887887bc74853f89a4de258c0702932a1c --no-deps

    pip install --no-build-isolation "transformer_engine[pytorch]==2.8.0"

    pip install --no-build-isolation "transformer_engine[pytorch]==2.10.0"

    pip install flash-linear-attention==0.4.0

    NVCC_APPEND_FLAGS="--threads 4" \

      pip -v install --disable-pip-version-check --no-cache-dir \

    @@ -50,7 +50,7 @@ git clone https://github.com/NVIDIA/Megatron-LM.git --recursive && \
  
        cd Megatron-LM && git checkout ${MEGATRON_COMMIT} && \

        pip install -e .

    pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@9b8b788fdeb9c2ee528183214cef65a99b71e7d5 --no-cache-dir --force-reinstall

    pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@dc6876905830430b5054325fa4211ff302169c6b --no-cache-dir --force-reinstall

    pip install git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl --no-build-isolation

    pip install nvidia-modelopt[torch]>=0.37.0 --no-build-isolation

    @@ -60,6 +60,9 @@ git clone https://github.com/NVIDIA/Megatron-LM.git --recursive && \
  
      cd Megatron-LM/ && git checkout core_v0.14.0 && \

      pip install -e .

    # https://github.com/pytorch/pytorch/issues/168167

    pip install nvidia-cudnn-cu12==9.16.0.29

    # install slime and apply patches

    # if slime does not exist locally, clone it

    @@ -76,6 +79,6 @@ fi
  
    # apply patch

    cd $BASE_DIR/sglang

    git apply $SLIME_DIR/docker/patch/v0.5.5.post1/sglang.patch

    git apply $SLIME_DIR/docker/patch/v0.5.6/sglang.patch

    cd $BASE_DIR/Megatron-LM

    git apply $SLIME_DIR/docker/patch/v0.5.5.post1/megatron.patch

    git apply $SLIME_DIR/docker/patch/v0.5.6/megatron.patch

docker/Dockerfile

-Original file line number
+Diff line change
@@ Expand Up / @@ -71,25 +71,11 @@ RUN if [ "$ENABLE_CUDA_13" = "1" ]; then \ @@
         python3 -m pip install https://github.com/sgl-project/whl/releases/download/v${SGL_KERNEL_VERSION}/sgl_kernel-${SGL_KERNEL_VERSION}+cu130-cp310-abi3-manylinux2014_$(uname -m).whl --force-reinstall --no-deps; \
       fi
-    # AMEM
-    # we need to create a fake libcuda.so.1 to make the linker happy when building AMEM
-    ENV CUDA_DIR=/usr/local/cuda
-    ENV CUDA_STUBS=${CUDA_DIR}/lib64/stubs
-    RUN ln -s ${CUDA_STUBS}/libcuda.so ${CUDA_STUBS}/libcuda.so.1 && \
-        echo "${CUDA_STUBS}" > /etc/ld.so.conf.d/z-cuda-stubs.conf && \
-        ldconfig
-    RUN git clone https://github.com/inclusionAI/asystem-amem.git && \
-        cd asystem-amem && git checkout 6483bb17c9a98b51c3a94b7048467d5b50fbad4b && \
-        git submodule init && git submodule update && \
-        MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi/ ./build.sh && \
-        mv /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2 /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/libnccl.so.2.bak && \
-        cp -r third_party/nccl/build/lib/* /usr/local/lib/python3.12/dist-packages/nvidia/nccl/lib/
     # https://github.com/pytorch/pytorch/issues/168167
     RUN pip install nvidia-cudnn-cu12==9.16.0.29
     RUN rm /root/.tmux.conf
-    RUN rm -rf /root/.cache/pip /root/asystem-amem /root/flash-attention
+    RUN rm -rf /root/.cache/pip /root/flash-attention
     # ====================================== Patches ============================================
@@ Expand Down @@

docker/README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -5,10 +5,10 @@ We will publish 2 kinds of docker images:
  
    2. latest version, which aligns to `lmsysorg/sglang:latest`.

    current stable version is:

    - sglang v0.5.5.post1 (303cc957e62384044dfa8e52d7d8af8abe12f0ac), megatron v0.14.0 (23e00ed0963c35382dfe8a5a94fb3cda4d21e133)

    - sglang nightly-dev-20251208-5e2cda61 (5e2cda6158e670e64b926a9985d65826c537ac82), megatron v0.14.0 (23e00ed0963c35382dfe8a5a94fb3cda4d21e133)

    history versions:

    - sglang v0.5.0rc0-cu126 (8ecf6b9d2480c3f600826c7d8fef6a16ed603c3f), megatron 48406695c4efcf1026a7ed70bb390793918dd97b

    - sglang v0.5.5.post1 (303cc957e62384044dfa8e52d7d8af8abe12f0ac), megatron v0.14.0 (23e00ed0963c35382dfe8a5a94fb3cda4d21e133)

    The command to build:

docker/patch/latest/megatron.patch

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -219,14 +219,14 @@ index 6aec66e6d..6ca48b55f 100644
  
                     mtp_loss = loss_mask * mtp_loss

                     if self.training:

    diff --git a/megatron/core/optimizer/distrib_optimizer.py b/megatron/core/optimizer/distrib_optimizer.py

    index a36b67364..8739270f2 100644

    index a36b67364..ed8883e32 100644

    --- a/megatron/core/optimizer/distrib_optimizer.py

    +++ b/megatron/core/optimizer/distrib_optimizer.py

    @@ -657,6 +657,8 @@ class DistributedOptimizer(MixedPrecisionOptimizer):

                     # TE FusedAdam will not accumulate step for empty param groups, so we need to

                     # align the step across param groups.

                     param_group["step"] = int(step)

    +            if param_group["step"] is None:

    +            if "step" in param_group and param_group["step"] is None:

    +                del param_group["step"]

             # Grad scaler state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge #8

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!