Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520

Closed
EricSteinberger opened this issue Dec 12, 2018 · 17 comments
Closed

Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520

EricSteinberger opened this issue Dec 12, 2018 · 17 comments
Assignees

Comments

@EricSteinberger
Copy link

EricSteinberger commented Dec 12, 2018

System information

  • OS Platform and Distribution: Ubuntu 16.04, Ubuntu 18.04, Amazon EC2 optimized Linux
  • Ray installed from (source or binary): pip
  • Ray version: 0.6.0
  • Python version: 3.7 & 3.6

A minimal example to reproduce:

import ray
import torch


class NeuralNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.l = torch.nn.Linear(2, 2)

    def forward(self, x):
        return self.l(x)


@ray.remote
class TestActor:
    def __init__(self):
        self.net = NeuralNet()
        self.crit = torch.nn.MSELoss()

    def train(self):
        p = self.net(torch.tensor([[1.0, 2.0]]))
        loss = self.crit(p, torch.tensor([[3.0, 4.0]]))
        self.net.zero_grad()
        loss.backward()
        return loss.item()


if __name__ == '__main__':
    ray.init()
    ac = TestActor.remote()
    print(ray.get(ac.train.remote()))

Problem Description

I clean installed PyTorch with conda on different OS and don't get Seg faults when not using ray. This may be a misunderstanding on my part as this seems like an issue that others would have met minutes after it was introduced... Isn't this how ray is supposed to be used when running locally?

Console Output

Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-12-12_10-51-32_500/logs.
Waiting for redis server at 127.0.0.1:31236 to respond...
Waiting for redis server at 127.0.0.1:56879 to respond...
Starting the Plasma object store with 13.470806835000001 GB memory using /dev/shm.
Failed to start the UI, you may need to run 'pip install jupyter'.
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90 in backward
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/torch/tensor.py", line 93 in backward
  File "/home/eric/PycharmProjects/rayTorchTest/test.py", line 25 in train
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/function_manager.py", line 481 in actor_method_executor
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 856 in _process_task
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 967 in _wait_for_and_process_task
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 1010 in main_loop
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/workers/default_worker.py", line 99 in <module>
A worker died or was killed while executing task 00000000d024403f9ddae404df35ac4a32625560.
Traceback (most recent call last):
  File "/home/eric/PycharmProjects/rayTorchTest/test.py", line 32, in <module>
    print(ray.get(ac.train.remote()))
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 2366, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000d024403f9ddae404df35ac4a32625560). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

Process finished with exit code 1

Edit 1:

The same script runs successfully with the following dependencies:

python==3.6.0
torch==0.4.1
ray==0.4.0
redis==2.10.6

Upgrading to ray 0.5.3 throws another error that looks very similar (worker died) although it doesn't state that a segfault occurred. Upgrading to ray 0.6.0 causes the above-demonstrated segfault.

@robertnishihara
Copy link
Collaborator

Thanks @TinkeringCode!

Does it segfault if you run the exact same script (still calling import ray and ray.init()), but if you remove the @ray.remote from TestActor and remove the .remote parts?

Does the import order between torch and ray matter?

The main script is not segfaulting (only the worker is), right?

@EricSteinberger
Copy link
Author

I've tested this on Python 3.7.0 with torch 0.4.1 and ray 0.6.0 now:

  • Importing torch first, with the remote parts removed but still importing, it still segfaults.
  • Importing ray first with the remote parts removed, it runs fine.
  • However, with the remote parts in it, the import order doesn't matter (both orders segfault on the same line)

It's only the .backward() call in the worker's .train() that segfaults.

@robertnishihara
Copy link
Collaborator

robertnishihara commented Dec 13, 2018

Thanks for looking into this!

For the first configuration (which segaults without requiring any remote functions), can you start python in gdb and see what the backtrace is? E.g., something like

gdb python
> run script.py  # Assuming you call your example script.py
...
segfault
> bt

cc @pcmoritz

@EricSteinberger
Copy link
Author

No problem - here you go!

(gdb) run test.py 
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/eric/miniconda3/envs/RayIssue/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3577700 (LWP 7248)]
[New Thread 0x7ffff2d76700 (LWP 7249)]
[New Thread 0x7fffee575700 (LWP 7250)]
[New Thread 0x7fffebd74700 (LWP 7251)]
[New Thread 0x7fffeb573700 (LWP 7252)]
[New Thread 0x7fffe8d72700 (LWP 7253)]
[New Thread 0x7fffe6571700 (LWP 7254)]
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
[Thread 0x7fffe6571700 (LWP 7254) exited]
[Thread 0x7fffe8d72700 (LWP 7253) exited]
[Thread 0x7fffeb573700 (LWP 7252) exited]
[Thread 0x7fffebd74700 (LWP 7251) exited]
[Thread 0x7fffee575700 (LWP 7250) exited]
[Thread 0x7ffff2d76700 (LWP 7249) exited]
[Thread 0x7ffff3577700 (LWP 7248) exited]
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-12-13_11-58-32_7245/logs.
Waiting for redis server at 127.0.0.1:39011 to respond...
Waiting for redis server at 127.0.0.1:39376 to respond...
Starting the Plasma object store with 13.470806835000001 GB memory using /dev/shm.
Failed to start the UI, you may need to run 'pip install jupyter'.
[New Thread 0x7fffe6571700 (LWP 7340)]
[New Thread 0x7fffe8d72700 (LWP 7341)]
[New Thread 0x7fffeb573700 (LWP 7342)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7bc5827 in __pthread_once_slow (once_control=0x7fffddd59988 <engine+8>, init_routine=0x7fffded2ba1c <std::__once_proxy()>) at pthread_once.c:116
#2  0x00007fffda3052b4 in __gthread_once (__func=<optimized out>, __once=0x7fffddd59988 <engine+8>) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/x86_64-redhat-linux/bits/gthr-default.h:699
#3  std::call_once<void (torch::autograd::Engine::*)(), torch::autograd::Engine*> (__f=<optimized out>, __once=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/mutex:746
#4  torch::autograd::Engine::execute (this=this@entry=0x7fffddd59980 <engine>, input_roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/engine.cpp:502
#5  0x00007fffda33305c in torch::autograd::python::PythonEngine::execute (this=this@entry=0x7fffddd59980 <engine>, roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/python_engine.cpp:61
#6  0x00007fffda333ce5 in THPEngine_run_backward (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at torch/csrc/autograd/python_engine.cpp:169
#7  0x00005555556cc065 in _PyMethodDef_RawFastCallKeywords ()
#8  0x00005555556cc0e1 in _PyCFunction_FastCallKeywords ()
#9  0x0000555555728782 in _PyEval_EvalFrameDefault ()
#10 0x0000555555669059 in _PyEval_EvalCodeWithName ()
#11 0x00005555556cb2a5 in _PyFunction_FastCallKeywords ()
#12 0x0000555555727e6e in _PyEval_EvalFrameDefault ()
---Type <return> to continue, or q <return> to quit---
#13 0x0000555555669059 in _PyEval_EvalCodeWithName ()
#14 0x00005555556cb2a5 in _PyFunction_FastCallKeywords ()
#15 0x00005555557238b0 in _PyEval_EvalFrameDefault ()
#16 0x00005555556cb07b in _PyFunction_FastCallKeywords ()
#17 0x00005555557238b0 in _PyEval_EvalFrameDefault ()
#18 0x0000555555669059 in _PyEval_EvalCodeWithName ()
#19 0x0000555555669f24 in PyEval_EvalCodeEx ()
#20 0x0000555555669f4c in PyEval_EvalCode ()
#21 0x0000555555782a14 in run_mod ()
#22 0x000055555578bf11 in PyRun_FileExFlags ()
#23 0x000055555578c104 in PyRun_SimpleFileExFlags ()
#24 0x000055555578dbbd in pymain_main.constprop ()
#25 0x000055555578de30 in _Py_UnixMain ()
---Type <return> to continue, or q <return> to quit---
#26 0x00007ffff77e6b97 in __libc_start_main (main=0x555555649d20 <main>, argc=2, argv=0x7fffffffd918, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd908) at ../csu/libc-start.c:310
#27 0x0000555555733052 in _start () at ../sysdeps/x86_64/elf/start.S:103

@robertnishihara
Copy link
Collaborator

@pcmoritz
Copy link
Contributor

Thanks for the report!

I haven't been able to reproduce it (this is with Python 3.6.6 from the latest anaconda, the ubuntu 18.04 AMI and the following versions:

In [7]: ray.__version__                                                                                                                                                              
Out[7]: '0.6.0'

In [8]: torch.__version__                                                                                                                                                            
Out[8]: '0.4.1'

from pip/anaconda respectively). Are you using the deep learning AMI by any chance?

But I'm also not surprised that there are configurations where this is an issue. I'll file another issue with arrow and see if we can get rid of the workarounds and fix the problem at its root.

@pcmoritz
Copy link
Contributor

@EricSteinberger
Copy link
Author

EricSteinberger commented Dec 14, 2018

Thank you for trying to reproduce!

I haven't used the DL AMI. Out of shock to see this working for you, I have just repeated the process to test precisely what you report to be working. Keep in mind, the error might still be on my side, despite that I got it working with ray 0.4.0.

The steps I executed to get the error are:

  1. Start up a t2.micro instance with Ubuntu 18.04
  2. wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  3. bash Miniconda3-latest-Linux-x86_64.sh -b -p /home/ubuntu/miniconda
  4. export PATH=/home/ubuntu/miniconda/bin:$PATH
  5. conda create -n RayIssue python=3.6.6
  6. source activate RayIssue
  7. conda install pytorch=0.4.1 -c pytorch
  8. pip install ray==0.6.0
  9. Run the script -> Segfault

Please tell me whether there is a difference in what we have done to set this up.

@pcmoritz
Copy link
Contributor

Thanks, that does the trick for me!

Not sure what the problem was previously. I used an existing instance so there might have been some state on it.

@mveres01
Copy link

I'm running into a similar issue (Ubuntu 16.04) as well, and can reproduce the Segfault.

With ray==0.6 and torch==1.0.0, segfaults occur much earlier in the script import ordering:

conda create -n RayIssue python=3.6.6
source activate RayIssue
conda install pytorch=1.0.0 -c pytorch
pip install ray==0.6.0
(RayIssue) :~$ python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> import torch
Segmentation fault (core dumped)

Importing torch first and then ray do not cause segfault immediately, but running the script from Tinkering will eventually throw errors. This import ordering error isn't observed with torch==0.4.1 and ray==0.6.0

@robertnishihara
Copy link
Collaborator

Thanks @mveres01! Could you include a backtrace from gdb when the segfault happens in the import torch line?

@pcmoritz has been looking into it, and it looks like it potentially comes down to a bug in the C++ standard library implementation of std::future. This is coming from the Arrow codebase, and there is some discussion about the best way to fix it in apache/arrow#3177.

@mveres01
Copy link

Thanks for the insight to arrow! Here's the gdb:

(RayIssue):~$ gdb python
...
(gdb) run ~/test2
Starting program: /export/mlrg/mveres/anaconda3/envs/RayIssue/bin/python ~/test2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb)

where test2.py:

import ray
import torch

For comparison, importing torch before ray yields:

(gdb) run test1
Starting program: /export/mlrg/mveres/anaconda3/envs/RayIssue/bin/python test1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Inferior 1 (process 118344) exited normally]

@pcmoritz pcmoritz self-assigned this Dec 18, 2018
@pcmoritz
Copy link
Contributor

pcmoritz commented Dec 18, 2018

Thanks for the patience, I created a version of Ray that should work now (I could reproduce your crash and it is not crashing any more with the fix):
https://drive.google.com/open?id=1LLjYaqysbYg1Gz3RO91o3dqX77VF8u3g (this is built off pcmoritz@f8b75ef, the branch is a little messy right now and I'll clean it up).

As a high level, the fix is to remove std::future from arrow and replace it with boost::future. This is a temporary workaround until we have figured out https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion, which is the root cause of all of this).

Feel free to try it out if you want to and report back, I'll create a PR to be included in 0.6.1.

@atumanov
Copy link
Contributor

@pcmoritz , is this a release blocker for 0.6.1?

@pcmoritz
Copy link
Contributor

Yes!

@mveres01
Copy link

That was a good read, thanks for looking into it @pcmoritz. Just wanted to confirm the wheels you posted fix the issue on my end!

@robertnishihara
Copy link
Collaborator

Fixed by #3574.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants