fix(ci): default to 8B model and fix task count syntax error#73
Merged
Conversation
- Change default model from Qwen3-VL-30B to Qwen3-8B - Fix Python syntax error in task count command (bash escaping issue)
dzorlu
pushed a commit
that referenced
this pull request
Feb 4, 2026
# What does this PR do? Upgrades to torch 2.7. This PR also makes the torch versions used explicit for different inference backends. (vllm uses torch 2.7.0 and sglang uses 2.7.1). Deepspeed performs jit compilation and is magically not dependent on a torch version. This PR also upgrades CUDA to 12.8. TODO: - [x] Test sglang after upgrade - [x] Publish new docker image to dockerhub --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
dzorlu
pushed a commit
that referenced
this pull request
Feb 4, 2026
… L4/L40S after #73 upgrade to cuda 12.8 (#108) # Overview After #73, the main code path no longer runs on GPUs without P2P support (potentially due to cuda 12.8 upgrade?) - an error would be thrown like ```bash torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices' ``` This PR adds a check for whether peer access is supported (using torch/cuda) between all GPUs on a node to the ray initialization, and sets relevant NCCL env vars to allow the code to run on these machine types. ```python if not peer_access_supported(): logger.info("Peer access is not supported, disabling P2P and SHM") env_vars["NCCL_P2P_DISABLE"] = "1" env_vars["NCCL_SHM_DISABLE"] = "1" ``` Example running on L40S: <img width="1854" height="227" alt="image" src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb" />
dzorlu
added a commit
that referenced
this pull request
Feb 4, 2026
- Change default model from Qwen3-VL-30B to Qwen3-8B - Fix Python syntax error in task count command (bash escaping issue) Co-authored-by: Deniz <deniz@Mac.localdomain>
bulb-fleet
pushed a commit
to bulb-fleet/SkyRL
that referenced
this pull request
Feb 4, 2026
# What does this PR do? Upgrades to torch 2.7. This PR also makes the torch versions used explicit for different inference backends. (vllm uses torch 2.7.0 and sglang uses 2.7.1). Deepspeed performs jit compilation and is magically not dependent on a torch version. This PR also upgrades CUDA to 12.8. TODO: - [x] Test sglang after upgrade - [x] Publish new docker image to dockerhub --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
bulb-fleet
pushed a commit
to bulb-fleet/SkyRL
that referenced
this pull request
Feb 4, 2026
… L4/L40S after fleet-ai#73 upgrade to cuda 12.8 (fleet-ai#108) # Overview After fleet-ai#73, the main code path no longer runs on GPUs without P2P support (potentially due to cuda 12.8 upgrade?) - an error would be thrown like ```bash torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices' ``` This PR adds a check for whether peer access is supported (using torch/cuda) between all GPUs on a node to the ray initialization, and sets relevant NCCL env vars to allow the code to run on these machine types. ```python if not peer_access_supported(): logger.info("Peer access is not supported, disabling P2P and SHM") env_vars["NCCL_P2P_DISABLE"] = "1" env_vars["NCCL_SHM_DISABLE"] = "1" ``` Example running on L40S: <img width="1854" height="227" alt="image" src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb" />
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Qwen/Qwen3-VL-30B-A3B-InstructtoQwen/Qwen3-8BProblem
The task count command was failing with:
The
\"escaping inside the single-quoted Python string was being misinterpreted.Solution
Rewrote the command to avoid the escaping issue:
🤖 Generated with Claude Code