[Ray debugger] Unable to use debugger on slurm cluster #51157

Linn3a · 2025-03-07T10:02:42Z

What happened + What you expected to happen

I am running a Verl project in slurm cluster and using the Ray distributed debugger. The extension shows "No paused tasks," and the log displays the following:

(main_task pid=185099) Ray debugger is listening on 10.140.1.54:19645
(main_task pid=185099) Waiting for debugger to attach (see https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html)...

The process gets stuck here and does not proceed further.
I am certain that other issues do not cause the blockage, as the training process can complete normally when I do not add any breakpoints.

Versions / Dependencies

Ray: 2.43.0
Python: 3.9.21
OS: CentOS Linux release 7.6.1810 (Core)
debugpy: 1.8.0

Reproduction script

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 algorithm.adv_estimator=reinforce_plus_plus \
 data.train_files=$HOME/data/gsm8k/train.parquet \
 data.val_files=$HOME/data/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.max_prompt_length=512 \
 data.max_response_length=256 \
 actor_rollout_ref.model.path={model_path} \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
 critic.optim.lr=1e-5 \
 critic.model.path={model_path} \
 critic.ppo_micro_batch_size_per_gpu=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['wandb'] \
 +trainer.val_before_train=False \
 trainer.default_hdfs_dir=null \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.default_local_dir=checkpoints/test \
 trainer.total_epochs=15 2>&1 | tee verl_demo.log

I add a breakpoint in the file verl/trainer/main_ppo.py line 128
like this

    val_reward_fn=val_reward_fn)
    trainer.init_workers()
    breakpoint()
    trainer.fit()

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

Di-viner · 2025-03-10T14:47:15Z

It's weird since I could use the Ray distributed debugger normally (yeah, verl too) a few days ago, but I encountered this issue at present.

Di-viner · 2025-03-11T00:52:48Z

Hi, @Linn3a. Maybe you could try ray==2.38. It works for me😀

brycehuang30 · 2025-03-11T17:38:29Z

Hi @Linn3a and @Di-viner, thanks for flagging. Are you able to reproduce this with plain python script? Or does it only happen in Verl? Is the ip:port (10.140.1.54:19645) accessible from the machine initiated the debugger?

Di-viner · 2025-03-12T01:11:34Z

Thanks, @brycehuang30.

Are you able to reproduce this with plain python script? Or does it only happen in Verl?

Probably not. I followed the guidance to use the debugger (i.e., setting up a new conda environment and running the provided job.py). However, this issue persists——the process gets stuck, and when I try to attach the VSCode debugger to a paused task, I always receive the error: connect ECONNREFUSED $ip:port). I don't think it's related to the codebase (e.g., verl or others). I'm unsure whether it might be connected to #45541 and #48728.

liuyixin-louis · 2025-03-12T02:12:19Z

Same problem here.

Linn3a added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 7, 2025

edoakes mentioned this issue Mar 7, 2025

[telemetry] Move library usage tests out of core #51161

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray debugger] Unable to use debugger on slurm cluster #51157

[Ray debugger] Unable to use debugger on slurm cluster #51157

Linn3a commented Mar 7, 2025 •

edited

Loading

Di-viner commented Mar 10, 2025

Di-viner commented Mar 11, 2025

brycehuang30 commented Mar 11, 2025

Di-viner commented Mar 12, 2025

liuyixin-louis commented Mar 12, 2025

[Ray debugger] Unable to use debugger on slurm cluster #51157

[Ray debugger] Unable to use debugger on slurm cluster #51157

Comments

Linn3a commented Mar 7, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Di-viner commented Mar 10, 2025

Di-viner commented Mar 11, 2025

brycehuang30 commented Mar 11, 2025

Di-viner commented Mar 12, 2025

liuyixin-louis commented Mar 12, 2025

Linn3a commented Mar 7, 2025 •

edited

Loading