-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520
Comments
Thanks @TinkeringCode! Does it segfault if you run the exact same script (still calling Does the import order between The main script is not segfaulting (only the worker is), right? |
I've tested this on Python 3.7.0 with torch 0.4.1 and ray 0.6.0 now:
It's only the |
Thanks for looking into this! For the first configuration (which segaults without requiring any remote functions), can you start python in
cc @pcmoritz |
No problem - here you go!
|
Thanks for sharing that! I wonder if it's related to some sort of C++ library issue. It's possible that the code in https://github.com/apache/arrow/blob/b3bc3384f3068edebe69f1084518ccfb85a368f8/python/pyarrow/compat.py#L247-L266 has become out of date, which is called from https://github.com/apache/arrow/blob/b3bc3384f3068edebe69f1084518ccfb85a368f8/python/pyarrow/__init__.py#L47-L51. See possibly related previous issues in https://issues.apache.org/jira/browse/ARROW-2657 and https://issues.apache.org/jira/browse/ARROW-2920 |
Thanks for the report! I haven't been able to reproduce it (this is with Python 3.6.6 from the latest anaconda, the ubuntu 18.04 AMI and the following versions:
from pip/anaconda respectively). Are you using the deep learning AMI by any chance? But I'm also not surprised that there are configurations where this is an issue. I'll file another issue with arrow and see if we can get rid of the workarounds and fix the problem at its root. |
Thank you for trying to reproduce! I haven't used the DL AMI. Out of shock to see this working for you, I have just repeated the process to test precisely what you report to be working. Keep in mind, the error might still be on my side, despite that I got it working with ray 0.4.0. The steps I executed to get the error are:
Please tell me whether there is a difference in what we have done to set this up. |
Thanks, that does the trick for me! Not sure what the problem was previously. I used an existing instance so there might have been some state on it. |
I'm running into a similar issue (Ubuntu 16.04) as well, and can reproduce the Segfault. With ray==0.6 and torch==1.0.0, segfaults occur much earlier in the script import ordering:
Importing torch first and then ray do not cause segfault immediately, but running the script from Tinkering will eventually throw errors. This import ordering error isn't observed with torch==0.4.1 and ray==0.6.0 |
Thanks @mveres01! Could you include a backtrace from @pcmoritz has been looking into it, and it looks like it potentially comes down to a bug in the C++ standard library implementation of |
Thanks for the insight to arrow! Here's the gdb:
where test2.py:
For comparison, importing torch before ray yields:
|
Thanks for the patience, I created a version of Ray that should work now (I could reproduce your crash and it is not crashing any more with the fix): As a high level, the fix is to remove std::future from arrow and replace it with boost::future. This is a temporary workaround until we have figured out https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion, which is the root cause of all of this). Feel free to try it out if you want to and report back, I'll create a PR to be included in 0.6.1. |
@pcmoritz , is this a release blocker for 0.6.1? |
Yes! |
That was a good read, thanks for looking into it @pcmoritz. Just wanted to confirm the wheels you posted fix the issue on my end! |
Fixed by #3574. |
System information
A minimal example to reproduce:
Problem Description
I clean installed PyTorch with conda on different OS and don't get Seg faults when not using ray. This may be a misunderstanding on my part as this seems like an issue that others would have met minutes after it was introduced... Isn't this how ray is supposed to be used when running locally?
Console Output
Edit 1:
The same script runs successfully with the following dependencies:
Upgrading to ray 0.5.3 throws another error that looks very similar (worker died) although it doesn't state that a segfault occurred. Upgrading to ray 0.6.0 causes the above-demonstrated segfault.
The text was updated successfully, but these errors were encountered: