-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running ray with pytorch lightning in slurm job causes falure with error "ValueError: signal only works in main thread" #3651
Comments
Hi! thanks for your contribution!, great first issue! |
hey @rashindrie! Would you mind upgrading to 1.0.2 to see if the issue persists? |
Sure, will try that. |
Closing this for now, feel free to reopen! |
@edenlightning I have the same behaviour. The 'hack' fixes it. I am running a Python script on a SLURM cluster. Environment
|
Hi, I'm having this issue as well. I don't like that you have to basically circumvent the normal functionality of the code in order to get it to work... |
This is still an issue in 1.2.1 |
Update: calling |
Can confirm setting |
Does it work correctly when you’re running a distributed job across multiple nodes? |
I've never been able to get Ray properly working on multiple nodes on my SLURM cluster, nothing to do with lightning. The init script they provide fails 9/10 times when trying to start workers unfortunately, not sure if it's to do with Ray or the cluster itself. |
I mean maybe it's a problem with your code 🤣 The hack works for me as well running on a single node but not on multiple nodes. Also I'll say again that I don't think that the officially supported solution to this problem should be to change the job name to circumvent PTL's slurm detection |
Not to get off topic but it's not actually my code, it's just the sbatch script they provide. A raylet exits unexpectedly, but that's before anything Lightning-related is invoked so probably unrelated. I agree it would make sense to have a way to interface with the connectors a little more directly. |
Actually I think I’ve run into this before testing scripts on my local machine. It’s a SIGABRT that crashes silently, right? I’m not sure what causes it but restarting my computer fixed it. I don’t know what causes it.
… On Mar 10, 2021, at 7:44 PM, Jacob Danovitch ***@***.***> wrote:
I mean maybe it's a problem with your code 🤣
Not to get off topic but it's not actually my code, it's just the sbatch script they provide. A raylet exits unexpectedly, but that's before anything Lightning-related is invoked so probably unrelated.
I agree it would make sense to have a way to interface with the connectors a little more directly.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I ran into this problem on dask + SLURM. The hack described above works if it is run on each worker process. I also needed to set workers to 0 for data loaders. I hope this helps the next person. |
Is still happens on master on SLURM with ray, and probably any processing spawning library. |
🐛 Bug
I followed the instructions at https://docs.ray.io/en/master/tune/tutorials/tune-pytorch-lightning.html to integrate ray with pytorch lightning. However, when I submitted a slurm job to run the tuning I get the following error:
ValueError: signal only works in main thread
I submitted the same to ray project at ray/issues/10995 and I was suggested a hack to fix the issue.
Could we look for a way to disable the SLURM detection in pytorch lightning itself so that external parties do not have to hack its way around it?
To Reproduce
Steps to reproduce the behavior:
Code sample
slurm script
tune.py
Expected behavior
The ray tune program to run properly in a slurm environment.
Environment
Additional context
The text was updated successfully, but these errors were encountered: