[release] bump to v0.2.1 #1096

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

lilei199908 merged 6 commits into THUDM:main from lilei199908:0.2.1

Dec 12, 2025

+2,935 −10

Collaborator

lilei199908 commented Dec 12, 2025

No description provided.

Copilot AI review requested due to automatic review settings

December 12, 2025 03:57

Copilot started reviewing on behalf of lilei199908

December 12, 2025 03:57

lilei199908 added release run-ci-short run-ci-precision labels

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

This PR updates the project to version 0.2.1, which involves updating patches for sglang and megatron dependencies, upgrading PyTorch and related libraries, and documenting the new stable version.

Key Changes:

Updates PyTorch from 2.8.0 to 2.9.1 and related packages (cuda-python, transformer_engine, etc.)
Updates sglang to nightly-dev-20251208 build (5e2cda6158e670e64b926a9985d65826c537ac82)
Adds new patch files for sglang v0.5.6 and megatron with extensive changes including MTP support with packed sequences, security fixes, and API compatibility updates
Updates documentation to reflect new stable version

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
build_conda.sh	Updates dependency versions (PyTorch 2.9.1, cuda-python 13.1.0, transformer_engine 2.10.0), changes sglang checkout to newer nightly, updates torch_memory_saver commit, adds cudnn fix, and updates patch paths to v0.5.6
docker/README.md	Updates current stable version to nightly-dev-20251208 and moves previous v0.5.5.post1 to history
docker/patch/v0.5.6/sglang.patch	Large patch file (2053 lines) adding routed experts capture, MTP improvements, RMSNorm changes, VLM optimizations, and various bug fixes
docker/patch/v0.5.6/megatron.patch	Large patch file (869 lines) adding MTP packed sequence support, gated attention, checkpoint compatibility fixes, and API updates for PyTorch 2.9.1

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

build_conda.sh

		@@ -21,13 +21,13 @@ micromamba install -n slime cuda cuda-nvtx cuda-nvtx-dev nccl -c nvidia/label/cu
		micromamba install -n slime -c conda-forge cudnn -y

		# prevent installing cuda 13.0 for sglang

Copilot AI Dec 12, 2025

The comment on line 23 says "prevent installing cuda 13.0 for sglang" but the actual package being installed is cuda-python==13.1.0, which appears to be version 13.1.0, not preventing cuda 13.0. This comment is misleading and should either be updated to reflect the actual version being installed, or if the intent is to prevent cuda 13.0 installation, the package version should be corrected.

Suggested change

      
            # prevent installing cuda 13.0 for sglang
          
            # install cuda-python==13.1.0 to ensure compatibility with sglang and avoid cuda 13.0

Copilot uses AI. Check for mistakes.

docker/patch/v0.5.6/megatron.patch

+                               return msc.torch.load(load_path, map_location='cpu')
+                           else:
+              -                return torch.load(load_path, map_location='cpu')
+              +                return torch.load(load_path, map_location='cpu', weights_only=False)

Copilot AI Dec 12, 2025

Setting weights_only=False in torch.load is a security concern. This allows arbitrary Python code execution during deserialization and is explicitly discouraged by PyTorch. Unless there's a specific reason to load pickled Python objects, consider using weights_only=True or adding a comment explaining why this is necessary and safe in this context.

Suggested change

      
            +                return torch.load(load_path, map_location='cpu', weights_only=False)
          
            +                return torch.load(load_path, map_location='cpu', weights_only=True)

Copilot uses AI. Check for mistakes.

docker/patch/v0.5.6/megatron.patch

Comment on lines +50 to +59

+              -                raise KeyError(
+              -                    f"{sh_ten.key} from model not in state dict:"
+              -                    f" {sorted(metadata.state_dict_metadata.keys())}"
+              -                )
+              +                # raise KeyError(
+              +                #     f"{sh_ten.key} from model not in state dict:"
+              +                #     f" {sorted(metadata.state_dict_metadata.keys())}"
+              +                # )
+              +                print(f"{sh_ten.key} from model not in state dict, will skip")
+              +                continue

Copilot AI Dec 12, 2025

Commented-out error handling that is replaced with a print statement and continue. This silently skips missing keys in the state dict instead of raising an error. This could mask important issues during model loading. Consider using a warning instead of print, or providing a configuration option to control this behavior rather than always silently skipping.

Copilot uses AI. Check for mistakes.

docker/patch/v0.5.6/megatron.patch

Comment on lines +255 to +273

+              +            torch.distributed.isend, tensor_send_prev, prev_pipeline_rank,
+                       )
+                       ops.append(send_prev_op)
+                   if tensor_recv_prev is not None:
+                       recv_prev_op = torch.distributed.P2POp(
+              -            torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group
+              +            torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank,
+                       )
+                       ops.append(recv_prev_op)
+                   if tensor_send_next is not None:
+                       send_next_op = torch.distributed.P2POp(
+              -            torch.distributed.isend, tensor_send_next, next_pipeline_rank, group
+              +            torch.distributed.isend, tensor_send_next, next_pipeline_rank,
+                       )
+                       ops.append(send_next_op)
+                   if tensor_recv_next is not None:
+                       recv_next_op = torch.distributed.P2POp(
+              -            torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group
+              +            torch.distributed.irecv, tensor_recv_next, next_pipeline_rank,

Copilot AI Dec 12, 2025

The 'group' parameter is being removed from all torch.distributed.P2POp calls. This change might break distributed pipeline parallelism if the group parameter is required by newer versions of PyTorch. Ensure that the default group behavior matches the intended communication group, or verify that this API change is compatible with the target PyTorch version (2.9.1).

Suggested change

      
            +            torch.distributed.isend, tensor_send_prev, prev_pipeline_rank,
          
                     )
          
                     ops.append(send_prev_op)
          
                 if tensor_recv_prev is not None:
          
                     recv_prev_op = torch.distributed.P2POp(
          
            -            torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group
          
            +            torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank,
          
                     )
          
                     ops.append(recv_prev_op)
          
                 if tensor_send_next is not None:
          
                     send_next_op = torch.distributed.P2POp(
          
            -            torch.distributed.isend, tensor_send_next, next_pipeline_rank, group
          
            +            torch.distributed.isend, tensor_send_next, next_pipeline_rank,
          
                     )
          
                     ops.append(send_next_op)
          
                 if tensor_recv_next is not None:
          
                     recv_next_op = torch.distributed.P2POp(
          
            -            torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group
          
            +            torch.distributed.irecv, tensor_recv_next, next_pipeline_rank,
          
            +            torch.distributed.isend, tensor_send_prev, prev_pipeline_rank, group
          
                     )
          
                     ops.append(send_prev_op)
          
                 if tensor_recv_prev is not None:
          
                     recv_prev_op = torch.distributed.P2POp(
          
            -            torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group
          
            +            torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group
          
                     )
          
                     ops.append(recv_prev_op)
          
                 if tensor_send_next is not None:
          
                     send_next_op = torch.distributed.P2POp(
          
            -            torch.distributed.isend, tensor_send_next, next_pipeline_rank, group
          
            +            torch.distributed.isend, tensor_send_next, next_pipeline_rank, group
          
                     )
          
                     ops.append(send_next_op)
          
                 if tensor_recv_next is not None:
          
                     recv_next_op = torch.distributed.P2POp(
          
            -            torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group
          
            +            torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group

Copilot uses AI. Check for mistakes.

docker/patch/v0.5.6/megatron.patch

+              +        decoder_input = decoder_input.detach()
+              -        hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=True)
+              +        hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=False)

Copilot AI Dec 12, 2025

The change from keep_graph=True to keep_graph=False in make_viewless_tensor could break gradient computation if the computational graph needs to be retained for backward passes. This is a potentially breaking change that should be carefully reviewed to ensure it doesn't affect gradient backpropagation in multi-token prediction scenarios.

Suggested change

      
            +        hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=False)
          
            +        hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=True)

Copilot uses AI. Check for mistakes.

docker/patch/v0.5.6/megatron.patch

Comment on lines +722 to +724

+              -        assert (
+              -            packed_seq_params is None
+              -        ), f"multi token prediction + sequence packing is not yet supported."

Copilot AI Dec 12, 2025

The assertion that prevented MTP from working with sequence packing has been removed. While this enables the feature, the comment removal doesn't indicate that the underlying implementation has been fully tested with packed sequences. Consider adding a warning or documentation about the testing status of this feature combination.

Suggested change

      
            -        assert (
          
            -            packed_seq_params is None
          
            -        ), f"multi token prediction + sequence packing is not yet supported."
          
            +        if packed_seq_params is not None:
          
            +            import warnings
          
            +            warnings.warn(
          
            +                "multi token prediction + sequence packing is enabled, but this combination has not been fully tested. Use with caution.",
          
            +                UserWarning
          
            +            )

Copilot uses AI. Check for mistakes.

docker/patch/v0.5.6/megatron.patch

Comment on lines +848 to +852

+              +    group.add_argument('--post-self-attn-layernorm', action='store_true',
+              +                       help='If set, use post self attention layernorm.')
+              +    group.add_argument('--post-mlp-layernorm', action='store_true',
+              +                       help='If set, use post MLP layernorm.')
+              +    group.add_argument('--use-gated-attention', action='store_true',

Copilot AI Dec 12, 2025

Setting trust_remote_code=True by default when constructing the Hugging Face tokenizer means AutoTokenizer.from_pretrained will execute arbitrary Python code from any remote model repository passed in pretrained_model_name_or_path. If an attacker can point this code at a malicious or compromised Hugging Face repo (or if an upstream repo is compromised), they can achieve remote code execution in the training/inference environment. To mitigate this, keep trust_remote_code disabled by default (or restrict it to a small allowlist of trusted repos) and require explicit, security-reviewed opt-in before enabling remote code execution for tokenizers.

Copilot uses AI. Check for mistakes.

build_conda.sh

-              pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@9b8b788fdeb9c2ee528183214cef65a99b71e7d5 --no-cache-dir --force-reinstall
+              pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@dc6876905830430b5054325fa4211ff302169c6b --no-cache-dir --force-reinstall
               pip install git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl --no-build-isolation

Copilot AI Dec 12, 2025

The build script installs Megatron-Bridge directly from a mutable Git branch (git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl) with --no-build-isolation, so every build executes whatever code the remote branch currently contains. If that GitHub repository or branch is ever compromised, an attacker can inject arbitrary code into the build environment with access to credentials and artifacts. To reduce this supply-chain risk, pin this dependency to an immutable identifier (e.g., a specific commit SHA or a signed release artifact) and avoid installing from mutable branches in automated builds.

Copilot uses AI. Check for mistakes.

lilei199908 changed the title ~~0.2.1~~ [release] bump to v0.2.1

lilei199908 force-pushed the 0.2.1 branch from 172181e to b2fc009 Compare

December 12, 2025 05:40

zhuzilin approved these changes

View reviewed changes

lilei199908 added 5 commits

December 12, 2025 09:51


          0.2.1


          fix acotr init bugs

44340e3


          fix acotr init bugs

6e1ea5e


          fix acotr init bugs

8f1a398


          fix acotr init bugs

54cc85b

lilei199908 force-pushed the 0.2.1 branch from 57c915a to 54cc85b Compare

December 12, 2025 09:52


          bump to 0.2.1

a6b687a

lilei199908 merged commit 0934a0e into THUDM:main

8 of 9 checks passed

Collaborator

fzyzcjy commented Dec 15, 2025 •

edited

Loading

@lilei199908 Hi, a tiny nit: it would be great to use squash instead of merge when clicking the merge button in github to be consistent w/ other commits :)

Collaborator Author

lilei199908 commented Dec 15, 2025

@lilei199908 Hi, a tiny nit: it would be great to use squash instead of merge when clicking the merge button in github to be consistent w/ other commits :)

Sorry about that. Good to know! Thanks.

Collaborator

fzyzcjy commented Dec 15, 2025

It's OK and no worries!

Fengzdadi pushed a commit to Fengzdadi/slime that referenced this pull request


          Merge pull request THUDM#1096 from lilei199908/0.2.1

4fb0773

[release] bump to v0.2.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release run-ci-precision run-ci-short