fix(ci): default to 8B model and fix task count syntax error by dzorlu · Pull Request #73 · fleet-ai/SkyRL

dzorlu · 2026-01-28T20:00:27Z

Summary

Change default model from Qwen/Qwen3-VL-30B-A3B-Instruct to Qwen/Qwen3-8B
Fix Python syntax error in task count command caused by bash escaping issue

Problem

The task count command was failing with:

SyntaxError: unexpected character after line continuation character

The \" escaping inside the single-quoted Python string was being misinterpreted.

Solution

Rewrote the command to avoid the escaping issue:

TASK_COUNT=$(python -c "import json; d=json.load(open('./data/tasks.json')); print(...)")
echo "Task count: $TASK_COUNT"

🤖 Generated with Claude Code

- Change default model from Qwen3-VL-30B to Qwen3-8B - Fix Python syntax error in task count command (bash escaping issue)

# What does this PR do? Upgrades to torch 2.7. This PR also makes the torch versions used explicit for different inference backends. (vllm uses torch 2.7.0 and sglang uses 2.7.1). Deepspeed performs jit compilation and is magically not dependent on a torch version. This PR also upgrades CUDA to 12.8. TODO: - [x] Test sglang after upgrade - [x] Publish new docker image to dockerhub --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

… L4/L40S after #73 upgrade to cuda 12.8 (#108) # Overview After #73, the main code path no longer runs on GPUs without P2P support (potentially due to cuda 12.8 upgrade?) - an error would be thrown like ```bash torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices' ``` This PR adds a check for whether peer access is supported (using torch/cuda) between all GPUs on a node to the ray initialization, and sets relevant NCCL env vars to allow the code to run on these machine types. ```python if not peer_access_supported(): logger.info("Peer access is not supported, disabling P2P and SHM") env_vars["NCCL_P2P_DISABLE"] = "1" env_vars["NCCL_SHM_DISABLE"] = "1" ``` Example running on L40S: <img width="1854" height="227" alt="image" src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb" />

- Change default model from Qwen3-VL-30B to Qwen3-8B - Fix Python syntax error in task count command (bash escaping issue) Co-authored-by: Deniz <deniz@Mac.localdomain>

# What does this PR do? Upgrades to torch 2.7. This PR also makes the torch versions used explicit for different inference backends. (vllm uses torch 2.7.0 and sglang uses 2.7.1). Deepspeed performs jit compilation and is magically not dependent on a torch version. This PR also upgrades CUDA to 12.8. TODO: - [x] Test sglang after upgrade - [x] Publish new docker image to dockerhub --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com>

… L4/L40S after fleet-ai#73 upgrade to cuda 12.8 (fleet-ai#108) # Overview After fleet-ai#73, the main code path no longer runs on GPUs without P2P support (potentially due to cuda 12.8 upgrade?) - an error would be thrown like ```bash torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2 ncclUnhandledCudaError: Call to CUDA function failed. Last error: Cuda failure 217 'peer access is not supported between these two devices' ``` This PR adds a check for whether peer access is supported (using torch/cuda) between all GPUs on a node to the ray initialization, and sets relevant NCCL env vars to allow the code to run on these machine types. ```python if not peer_access_supported(): logger.info("Peer access is not supported, disabling P2P and SHM") env_vars["NCCL_P2P_DISABLE"] = "1" env_vars["NCCL_SHM_DISABLE"] = "1" ``` Example running on L40S: <img width="1854" height="227" alt="image" src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb" />

fix(ci): default to 8B model and fix task count syntax error

9ca57a5

- Change default model from Qwen3-VL-30B to Qwen3-8B - Fix Python syntax error in task count command (bash escaping issue)

dzorlu merged commit 305e3f3 into main Jan 28, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ci): default to 8B model and fix task count syntax error#73

fix(ci): default to 8B model and fix task count syntax error#73
dzorlu merged 1 commit intomainfrom
fix/ci-defaults-and-task-count

dzorlu commented Jan 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

dzorlu commented Jan 28, 2026

Summary

Problem

Solution

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments