Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray debugger] Unable to use debugger on slurm cluster #51157

Open
Linn3a opened this issue Mar 7, 2025 · 5 comments
Open

[Ray debugger] Unable to use debugger on slurm cluster #51157

Linn3a opened this issue Mar 7, 2025 · 5 comments
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@Linn3a
Copy link

Linn3a commented Mar 7, 2025

What happened + What you expected to happen

I am running a Verl project in slurm cluster and using the Ray distributed debugger. The extension shows "No paused tasks," and the log displays the following:

(main_task pid=185099) Ray debugger is listening on 10.140.1.54:19645
(main_task pid=185099) Waiting for debugger to attach (see https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html)...

The process gets stuck here and does not proceed further.
I am certain that other issues do not cause the blockage, as the training process can complete normally when I do not add any breakpoints.

Versions / Dependencies

  • Ray: 2.43.0
  • Python: 3.9.21
  • OS: CentOS Linux release 7.6.1810 (Core)
  • debugpy: 1.8.0

Reproduction script

PYTHONUNBUFFERED=1 python3 -m verl.trainer.main_ppo \
 algorithm.adv_estimator=reinforce_plus_plus \
 data.train_files=$HOME/data/gsm8k/train.parquet \
 data.val_files=$HOME/data/gsm8k/test.parquet \
 data.train_batch_size=256 \
 data.max_prompt_length=512 \
 data.max_response_length=256 \
 actor_rollout_ref.model.path={model_path} \
 actor_rollout_ref.actor.optim.lr=1e-6 \
 actor_rollout_ref.actor.ppo_mini_batch_size=64 \
 actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \
 actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
 actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
 actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
 actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \
 critic.optim.lr=1e-5 \
 critic.model.path={model_path} \
 critic.ppo_micro_batch_size_per_gpu=4 \
 algorithm.kl_ctrl.kl_coef=0.001 \
 trainer.logger=['wandb'] \
 +trainer.val_before_train=False \
 trainer.default_hdfs_dir=null \
 trainer.n_gpus_per_node=1 \
 trainer.nnodes=1 \
 trainer.save_freq=10 \
 trainer.test_freq=10 \
 trainer.default_local_dir=checkpoints/test \
 trainer.total_epochs=15 2>&1 | tee verl_demo.log 

I add a breakpoint in the file verl/trainer/main_ppo.py line 128
like this

    val_reward_fn=val_reward_fn)
    trainer.init_workers()
    breakpoint()
    trainer.fit()

Issue Severity

High: It blocks me from completing my task.

@Linn3a Linn3a added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 7, 2025
@Di-viner
Copy link

It's weird since I could use the Ray distributed debugger normally (yeah, verl too) a few days ago, but I encountered this issue at present.

@Di-viner
Copy link

Hi, @Linn3a. Maybe you could try ray==2.38. It works for me😀

@brycehuang30
Copy link
Contributor

Hi @Linn3a and @Di-viner, thanks for flagging. Are you able to reproduce this with plain python script? Or does it only happen in Verl? Is the ip:port (10.140.1.54:19645) accessible from the machine initiated the debugger?

@Di-viner
Copy link

Thanks, @brycehuang30.

Are you able to reproduce this with plain python script? Or does it only happen in Verl?

Probably not. I followed the guidance to use the debugger (i.e., setting up a new conda environment and running the provided job.py). However, this issue persists——the process gets stuck, and when I try to attach the VSCode debugger to a paused task, I always receive the error: connect ECONNREFUSED $ip:port). I don't think it's related to the codebase (e.g., verl or others). I'm unsure whether it might be connected to #45541 and #48728.

@liuyixin-louis
Copy link

Same problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

4 participants