vLLM rolled back to 0.8.5.post1 (temporarily) by ehsk · Pull Request #125 · ServiceNow/PipelineRL

ehsk · 2026-02-03T21:01:21Z

vLLM's stateless process group runs into connection issues between the trainer and actor vLLMs on different nodes. The issue is with the master address that used to be handled by PyTorch's rendezvous.

This PR reverts changes done in #122 temporarily.

Reward looks similar to the old version:
(blue=old code on 1 node, orange=this PR on 2 node, red=this PR on 1 node, stopped the job early)

… in multi-node settings

rafapi

LGTM!

vllm rolled back to 0.8.5.post1 due to stateless process group issues…

e66341b

… in multi-node settings

ehsk self-assigned this Feb 3, 2026

ehsk requested a review from rafapi February 3, 2026 21:01

rafapi approved these changes Feb 4, 2026

View reviewed changes

rafapi merged commit 08b62f0 into main Feb 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM rolled back to 0.8.5.post1 (temporarily)#125

vLLM rolled back to 0.8.5.post1 (temporarily)#125
rafapi merged 1 commit intomainfrom
revert-vllm-upgrade

ehsk commented Feb 3, 2026 •

edited

Loading

Uh oh!

rafapi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ehsk commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rafapi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ehsk commented Feb 3, 2026 •

edited

Loading