Skip to content

fix(ci): default to 8B model and fix task count syntax error#73

Merged
dzorlu merged 1 commit intomainfrom
fix/ci-defaults-and-task-count
Jan 28, 2026
Merged

fix(ci): default to 8B model and fix task count syntax error#73
dzorlu merged 1 commit intomainfrom
fix/ci-defaults-and-task-count

Conversation

@dzorlu
Copy link
Collaborator

@dzorlu dzorlu commented Jan 28, 2026

Summary

  • Change default model from Qwen/Qwen3-VL-30B-A3B-Instruct to Qwen/Qwen3-8B
  • Fix Python syntax error in task count command caused by bash escaping issue

Problem

The task count command was failing with:

SyntaxError: unexpected character after line continuation character

The \" escaping inside the single-quoted Python string was being misinterpreted.

Solution

Rewrote the command to avoid the escaping issue:

TASK_COUNT=$(python -c "import json; d=json.load(open('./data/tasks.json')); print(...)")
echo "Task count: $TASK_COUNT"

🤖 Generated with Claude Code

- Change default model from Qwen3-VL-30B to Qwen3-8B
- Fix Python syntax error in task count command (bash escaping issue)
@dzorlu dzorlu merged commit 305e3f3 into main Jan 28, 2026
1 check passed
dzorlu pushed a commit that referenced this pull request Feb 4, 2026
# What does this PR do?

Upgrades to torch 2.7. This PR also makes the torch versions used explicit for different inference backends. (vllm uses torch 2.7.0 and sglang uses 2.7.1). Deepspeed performs jit compilation and is magically not dependent on a torch version. 

This PR also upgrades CUDA to 12.8. 

TODO: 
- [x] Test sglang after upgrade 
- [x] Publish new docker image to dockerhub

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
dzorlu pushed a commit that referenced this pull request Feb 4, 2026
… L4/L40S after #73 upgrade to cuda 12.8 (#108)

# Overview
After #73, the main code path no longer runs on GPUs without P2P support
(potentially due to cuda 12.8 upgrade?) - an error would be thrown like

```bash
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'
```

This PR adds a check for whether peer access is supported (using
torch/cuda) between all GPUs on a node to the ray initialization, and
sets relevant NCCL env vars to allow the code to run on these machine
types.

```python
if not peer_access_supported():
        logger.info("Peer access is not supported, disabling P2P and SHM")
        env_vars["NCCL_P2P_DISABLE"] = "1"
        env_vars["NCCL_SHM_DISABLE"] = "1"
```

Example running on L40S:
<img width="1854" height="227" alt="image"
src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb"
/>
dzorlu added a commit that referenced this pull request Feb 4, 2026
- Change default model from Qwen3-VL-30B to Qwen3-8B
- Fix Python syntax error in task count command (bash escaping issue)

Co-authored-by: Deniz <deniz@Mac.localdomain>
bulb-fleet pushed a commit to bulb-fleet/SkyRL that referenced this pull request Feb 4, 2026
# What does this PR do?

Upgrades to torch 2.7. This PR also makes the torch versions used explicit for different inference backends. (vllm uses torch 2.7.0 and sglang uses 2.7.1). Deepspeed performs jit compilation and is magically not dependent on a torch version. 

This PR also upgrades CUDA to 12.8. 

TODO: 
- [x] Test sglang after upgrade 
- [x] Publish new docker image to dockerhub

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
bulb-fleet pushed a commit to bulb-fleet/SkyRL that referenced this pull request Feb 4, 2026
… L4/L40S after fleet-ai#73 upgrade to cuda 12.8 (fleet-ai#108)

# Overview
After fleet-ai#73, the main code path no longer runs on GPUs without P2P support
(potentially due to cuda 12.8 upgrade?) - an error would be thrown like

```bash
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3353, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.26.2
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 217 'peer access is not supported between these two devices'
```

This PR adds a check for whether peer access is supported (using
torch/cuda) between all GPUs on a node to the ray initialization, and
sets relevant NCCL env vars to allow the code to run on these machine
types.

```python
if not peer_access_supported():
        logger.info("Peer access is not supported, disabling P2P and SHM")
        env_vars["NCCL_P2P_DISABLE"] = "1"
        env_vars["NCCL_SHM_DISABLE"] = "1"
```

Example running on L40S:
<img width="1854" height="227" alt="image"
src="https://github.com/user-attachments/assets/1cca46b5-6e16-4ae7-9a33-df52d138bdeb"
/>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments