You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running a Verl project in slurm cluster and using the Ray distributed debugger. The extension shows "No paused tasks," and the log displays the following:
(main_task pid=185099) Ray debugger is listening on 10.140.1.54:19645
(main_task pid=185099) Waiting for debugger to attach (see https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html)...
The process gets stuck here and does not proceed further.
I am certain that other issues do not cause the blockage, as the training process can complete normally when I do not add any breakpoints.
The text was updated successfully, but these errors were encountered:
Linn3a
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Mar 7, 2025
Hi @Linn3a and @Di-viner, thanks for flagging. Are you able to reproduce this with plain python script? Or does it only happen in Verl? Is the ip:port (10.140.1.54:19645) accessible from the machine initiated the debugger?
Are you able to reproduce this with plain python script? Or does it only happen in Verl?
Probably not. I followed the guidance to use the debugger (i.e., setting up a new conda environment and running the provided job.py). However, this issue persists——the process gets stuck, and when I try to attach the VSCode debugger to a paused task, I always receive the error: connect ECONNREFUSED $ip:port). I don't think it's related to the codebase (e.g., verl or others). I'm unsure whether it might be connected to #45541 and #48728.
What happened + What you expected to happen
I am running a Verl project in slurm cluster and using the Ray distributed debugger. The extension shows "No paused tasks," and the log displays the following:
The process gets stuck here and does not proceed further.
I am certain that other issues do not cause the blockage, as the training process can complete normally when I do not add any breakpoints.
Versions / Dependencies
Reproduction script
I add a breakpoint in the file
verl/trainer/main_ppo.py
line 128like this
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: