-
Notifications
You must be signed in to change notification settings - Fork 500
[release] bump to v0.2.1 #1096
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release] bump to v0.2.1 #1096
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR updates the project to version 0.2.1, which involves updating patches for sglang and megatron dependencies, upgrading PyTorch and related libraries, and documenting the new stable version.
Key Changes:
- Updates PyTorch from 2.8.0 to 2.9.1 and related packages (cuda-python, transformer_engine, etc.)
- Updates sglang to nightly-dev-20251208 build (5e2cda6158e670e64b926a9985d65826c537ac82)
- Adds new patch files for sglang v0.5.6 and megatron with extensive changes including MTP support with packed sequences, security fixes, and API compatibility updates
- Updates documentation to reflect new stable version
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| build_conda.sh | Updates dependency versions (PyTorch 2.9.1, cuda-python 13.1.0, transformer_engine 2.10.0), changes sglang checkout to newer nightly, updates torch_memory_saver commit, adds cudnn fix, and updates patch paths to v0.5.6 |
| docker/README.md | Updates current stable version to nightly-dev-20251208 and moves previous v0.5.5.post1 to history |
| docker/patch/v0.5.6/sglang.patch | Large patch file (2053 lines) adding routed experts capture, MTP improvements, RMSNorm changes, VLM optimizations, and various bug fixes |
| docker/patch/v0.5.6/megatron.patch | Large patch file (869 lines) adding MTP packed sequence support, gated attention, checkpoint compatibility fixes, and API updates for PyTorch 2.9.1 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -21,13 +21,13 @@ micromamba install -n slime cuda cuda-nvtx cuda-nvtx-dev nccl -c nvidia/label/cu | |||
| micromamba install -n slime -c conda-forge cudnn -y | |||
|
|
|||
| # prevent installing cuda 13.0 for sglang | |||
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 23 says "prevent installing cuda 13.0 for sglang" but the actual package being installed is cuda-python==13.1.0, which appears to be version 13.1.0, not preventing cuda 13.0. This comment is misleading and should either be updated to reflect the actual version being installed, or if the intent is to prevent cuda 13.0 installation, the package version should be corrected.
| # prevent installing cuda 13.0 for sglang | |
| # install cuda-python==13.1.0 to ensure compatibility with sglang and avoid cuda 13.0 |
| return msc.torch.load(load_path, map_location='cpu') | ||
| else: | ||
| - return torch.load(load_path, map_location='cpu') | ||
| + return torch.load(load_path, map_location='cpu', weights_only=False) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting weights_only=False in torch.load is a security concern. This allows arbitrary Python code execution during deserialization and is explicitly discouraged by PyTorch. Unless there's a specific reason to load pickled Python objects, consider using weights_only=True or adding a comment explaining why this is necessary and safe in this context.
| + return torch.load(load_path, map_location='cpu', weights_only=False) | |
| + return torch.load(load_path, map_location='cpu', weights_only=True) |
| - raise KeyError( | ||
| - f"{sh_ten.key} from model not in state dict:" | ||
| - f" {sorted(metadata.state_dict_metadata.keys())}" | ||
| - ) | ||
| + # raise KeyError( | ||
| + # f"{sh_ten.key} from model not in state dict:" | ||
| + # f" {sorted(metadata.state_dict_metadata.keys())}" | ||
| + # ) | ||
| + print(f"{sh_ten.key} from model not in state dict, will skip") | ||
| + continue |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out error handling that is replaced with a print statement and continue. This silently skips missing keys in the state dict instead of raising an error. This could mask important issues during model loading. Consider using a warning instead of print, or providing a configuration option to control this behavior rather than always silently skipping.
| + torch.distributed.isend, tensor_send_prev, prev_pipeline_rank, | ||
| ) | ||
| ops.append(send_prev_op) | ||
| if tensor_recv_prev is not None: | ||
| recv_prev_op = torch.distributed.P2POp( | ||
| - torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group | ||
| + torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, | ||
| ) | ||
| ops.append(recv_prev_op) | ||
| if tensor_send_next is not None: | ||
| send_next_op = torch.distributed.P2POp( | ||
| - torch.distributed.isend, tensor_send_next, next_pipeline_rank, group | ||
| + torch.distributed.isend, tensor_send_next, next_pipeline_rank, | ||
| ) | ||
| ops.append(send_next_op) | ||
| if tensor_recv_next is not None: | ||
| recv_next_op = torch.distributed.P2POp( | ||
| - torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group | ||
| + torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'group' parameter is being removed from all torch.distributed.P2POp calls. This change might break distributed pipeline parallelism if the group parameter is required by newer versions of PyTorch. Ensure that the default group behavior matches the intended communication group, or verify that this API change is compatible with the target PyTorch version (2.9.1).
| + torch.distributed.isend, tensor_send_prev, prev_pipeline_rank, | |
| ) | |
| ops.append(send_prev_op) | |
| if tensor_recv_prev is not None: | |
| recv_prev_op = torch.distributed.P2POp( | |
| - torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group | |
| + torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, | |
| ) | |
| ops.append(recv_prev_op) | |
| if tensor_send_next is not None: | |
| send_next_op = torch.distributed.P2POp( | |
| - torch.distributed.isend, tensor_send_next, next_pipeline_rank, group | |
| + torch.distributed.isend, tensor_send_next, next_pipeline_rank, | |
| ) | |
| ops.append(send_next_op) | |
| if tensor_recv_next is not None: | |
| recv_next_op = torch.distributed.P2POp( | |
| - torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group | |
| + torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, | |
| + torch.distributed.isend, tensor_send_prev, prev_pipeline_rank, group | |
| ) | |
| ops.append(send_prev_op) | |
| if tensor_recv_prev is not None: | |
| recv_prev_op = torch.distributed.P2POp( | |
| - torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group | |
| + torch.distributed.irecv, tensor_recv_prev, prev_pipeline_rank, group | |
| ) | |
| ops.append(recv_prev_op) | |
| if tensor_send_next is not None: | |
| send_next_op = torch.distributed.P2POp( | |
| - torch.distributed.isend, tensor_send_next, next_pipeline_rank, group | |
| + torch.distributed.isend, tensor_send_next, next_pipeline_rank, group | |
| ) | |
| ops.append(send_next_op) | |
| if tensor_recv_next is not None: | |
| recv_next_op = torch.distributed.P2POp( | |
| - torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group | |
| + torch.distributed.irecv, tensor_recv_next, next_pipeline_rank, group |
| + decoder_input = decoder_input.detach() | ||
|
|
||
| - hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=True) | ||
| + hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=False) |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from keep_graph=True to keep_graph=False in make_viewless_tensor could break gradient computation if the computational graph needs to be retained for backward passes. This is a potentially breaking change that should be carefully reviewed to ensure it doesn't affect gradient backpropagation in multi-token prediction scenarios.
| + hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=False) | |
| + hidden_states = make_viewless_tensor(inp=hidden_states, requires_grad=True, keep_graph=True) |
| - assert ( | ||
| - packed_seq_params is None | ||
| - ), f"multi token prediction + sequence packing is not yet supported." |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion that prevented MTP from working with sequence packing has been removed. While this enables the feature, the comment removal doesn't indicate that the underlying implementation has been fully tested with packed sequences. Consider adding a warning or documentation about the testing status of this feature combination.
| - assert ( | |
| - packed_seq_params is None | |
| - ), f"multi token prediction + sequence packing is not yet supported." | |
| + if packed_seq_params is not None: | |
| + import warnings | |
| + warnings.warn( | |
| + "multi token prediction + sequence packing is enabled, but this combination has not been fully tested. Use with caution.", | |
| + UserWarning | |
| + ) |
| + group.add_argument('--post-self-attn-layernorm', action='store_true', | ||
| + help='If set, use post self attention layernorm.') | ||
| + group.add_argument('--post-mlp-layernorm', action='store_true', | ||
| + help='If set, use post MLP layernorm.') | ||
| + group.add_argument('--use-gated-attention', action='store_true', |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting trust_remote_code=True by default when constructing the Hugging Face tokenizer means AutoTokenizer.from_pretrained will execute arbitrary Python code from any remote model repository passed in pretrained_model_name_or_path. If an attacker can point this code at a malicious or compromised Hugging Face repo (or if an upstream repo is compromised), they can achieve remote code execution in the training/inference environment. To mitigate this, keep trust_remote_code disabled by default (or restrict it to a small allowlist of trusted repos) and require explicit, security-reviewed opt-in before enabling remote code execution for tokenizers.
|
|
||
| pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@9b8b788fdeb9c2ee528183214cef65a99b71e7d5 --no-cache-dir --force-reinstall | ||
| pip install git+https://github.com/fzyzcjy/torch_memory_saver.git@dc6876905830430b5054325fa4211ff302169c6b --no-cache-dir --force-reinstall | ||
| pip install git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl --no-build-isolation |
Copilot
AI
Dec 12, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The build script installs Megatron-Bridge directly from a mutable Git branch (git+https://github.com/fzyzcjy/Megatron-Bridge.git@dev_rl) with --no-build-isolation, so every build executes whatever code the remote branch currently contains. If that GitHub repository or branch is ever compromised, an attacker can inject arbitrary code into the build environment with access to credentials and artifacts. To reduce this supply-chain risk, pin this dependency to an immutable identifier (e.g., a specific commit SHA or a signed release artifact) and avoid installing from mutable branches in automated builds.
|
@lilei199908 Hi, a tiny nit: it would be great to use
|
Sorry about that. Good to know! Thanks. |
|
It's OK and no worries! |
[release] bump to v0.2.1


No description provided.