Skip to content

vLLM rolled back to 0.8.5.post1 (temporarily)#125

Merged
rafapi merged 1 commit intomainfrom
revert-vllm-upgrade
Feb 4, 2026
Merged

vLLM rolled back to 0.8.5.post1 (temporarily)#125
rafapi merged 1 commit intomainfrom
revert-vllm-upgrade

Conversation

@ehsk
Copy link
Collaborator

@ehsk ehsk commented Feb 3, 2026

vLLM's stateless process group runs into connection issues between the trainer and actor vLLMs on different nodes. The issue is with the master address that used to be handled by PyTorch's rendezvous.

This PR reverts changes done in #122 temporarily.

Reward looks similar to the old version:
(blue=old code on 1 node, orange=this PR on 2 node, red=this PR on 1 node, stopped the job early)

image

@ehsk ehsk self-assigned this Feb 3, 2026
@ehsk ehsk requested a review from rafapi February 3, 2026 21:01
Copy link
Collaborator

@rafapi rafapi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rafapi rafapi merged commit 08b62f0 into main Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants