Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520

EricSteinberger · 2018-12-12T10:01:40Z

System information

OS Platform and Distribution: Ubuntu 16.04, Ubuntu 18.04, Amazon EC2 optimized Linux
Ray installed from (source or binary): pip
Ray version: 0.6.0
Python version: 3.7 & 3.6

A minimal example to reproduce:

import ray
import torch


class NeuralNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.l = torch.nn.Linear(2, 2)

    def forward(self, x):
        return self.l(x)


@ray.remote
class TestActor:
    def __init__(self):
        self.net = NeuralNet()
        self.crit = torch.nn.MSELoss()

    def train(self):
        p = self.net(torch.tensor([[1.0, 2.0]]))
        loss = self.crit(p, torch.tensor([[3.0, 4.0]]))
        self.net.zero_grad()
        loss.backward()
        return loss.item()


if __name__ == '__main__':
    ray.init()
    ac = TestActor.remote()
    print(ray.get(ac.train.remote()))

Problem Description

I clean installed PyTorch with conda on different OS and don't get Seg faults when not using ray. This may be a misunderstanding on my part as this seems like an issue that others would have met minutes after it was introduced... Isn't this how ray is supposed to be used when running locally?

Console Output

Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-12-12_10-51-32_500/logs.
Waiting for redis server at 127.0.0.1:31236 to respond...
Waiting for redis server at 127.0.0.1:56879 to respond...
Starting the Plasma object store with 13.470806835000001 GB memory using /dev/shm.
Failed to start the UI, you may need to run 'pip install jupyter'.
Fatal Python error: Segmentation fault

Stack (most recent call first):
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90 in backward
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/torch/tensor.py", line 93 in backward
  File "/home/eric/PycharmProjects/rayTorchTest/test.py", line 25 in train
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/function_manager.py", line 481 in actor_method_executor
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 856 in _process_task
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 967 in _wait_for_and_process_task
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 1010 in main_loop
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/workers/default_worker.py", line 99 in <module>
A worker died or was killed while executing task 00000000d024403f9ddae404df35ac4a32625560.
Traceback (most recent call last):
  File "/home/eric/PycharmProjects/rayTorchTest/test.py", line 32, in <module>
    print(ray.get(ac.train.remote()))
  File "/home/eric/miniconda3/envs/TorchRay/lib/python3.7/site-packages/ray/worker.py", line 2366, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000d024403f9ddae404df35ac4a32625560). It was created by remote function <unknown> which failed with:

Remote function <unknown> failed with:

Invalid return value: likely worker died or was killed while executing the task; check previous logs or dmesg for errors.

Process finished with exit code 1

Edit 1:

The same script runs successfully with the following dependencies:

python==3.6.0
torch==0.4.1
ray==0.4.0
redis==2.10.6

Upgrading to ray 0.5.3 throws another error that looks very similar (worker died) although it doesn't state that a segfault occurred. Upgrading to ray 0.6.0 causes the above-demonstrated segfault.

The text was updated successfully, but these errors were encountered:

robertnishihara · 2018-12-13T01:09:25Z

Thanks @TinkeringCode!

Does it segfault if you run the exact same script (still calling import ray and ray.init()), but if you remove the @ray.remote from TestActor and remove the .remote parts?

Does the import order between torch and ray matter?

The main script is not segfaulting (only the worker is), right?

EricSteinberger · 2018-12-13T01:44:22Z

I've tested this on Python 3.7.0 with torch 0.4.1 and ray 0.6.0 now:

Importing torch first, with the remote parts removed but still importing, it still segfaults.
Importing ray first with the remote parts removed, it runs fine.
However, with the remote parts in it, the import order doesn't matter (both orders segfault on the same line)

It's only the .backward() call in the worker's .train() that segfaults.

robertnishihara · 2018-12-13T08:34:07Z

Thanks for looking into this!

For the first configuration (which segaults without requiring any remote functions), can you start python in gdb and see what the backtrace is? E.g., something like

gdb python
> run script.py  # Assuming you call your example script.py
...
segfault
> bt

cc @pcmoritz

EricSteinberger · 2018-12-13T11:01:20Z

No problem - here you go!

(gdb) run test.py 
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/eric/miniconda3/envs/RayIssue/bin/python test.py
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff3577700 (LWP 7248)]
[New Thread 0x7ffff2d76700 (LWP 7249)]
[New Thread 0x7fffee575700 (LWP 7250)]
[New Thread 0x7fffebd74700 (LWP 7251)]
[New Thread 0x7fffeb573700 (LWP 7252)]
[New Thread 0x7fffe8d72700 (LWP 7253)]
[New Thread 0x7fffe6571700 (LWP 7254)]
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
[Thread 0x7fffe6571700 (LWP 7254) exited]
[Thread 0x7fffe8d72700 (LWP 7253) exited]
[Thread 0x7fffeb573700 (LWP 7252) exited]
[Thread 0x7fffebd74700 (LWP 7251) exited]
[Thread 0x7fffee575700 (LWP 7250) exited]
[Thread 0x7ffff2d76700 (LWP 7249) exited]
[Thread 0x7ffff3577700 (LWP 7248) exited]
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-12-13_11-58-32_7245/logs.
Waiting for redis server at 127.0.0.1:39011 to respond...
Waiting for redis server at 127.0.0.1:39376 to respond...
Starting the Plasma object store with 13.470806835000001 GB memory using /dev/shm.
Failed to start the UI, you may need to run 'pip install jupyter'.
[New Thread 0x7fffe6571700 (LWP 7340)]
[New Thread 0x7fffe8d72700 (LWP 7341)]
[New Thread 0x7fffeb573700 (LWP 7342)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00007ffff7bc5827 in __pthread_once_slow (once_control=0x7fffddd59988 <engine+8>, init_routine=0x7fffded2ba1c <std::__once_proxy()>) at pthread_once.c:116
#2  0x00007fffda3052b4 in __gthread_once (__func=<optimized out>, __once=0x7fffddd59988 <engine+8>) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/x86_64-redhat-linux/bits/gthr-default.h:699
#3  std::call_once<void (torch::autograd::Engine::*)(), torch::autograd::Engine*> (__f=<optimized out>, __once=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/mutex:746
#4  torch::autograd::Engine::execute (this=this@entry=0x7fffddd59980 <engine>, input_roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/engine.cpp:502
#5  0x00007fffda33305c in torch::autograd::python::PythonEngine::execute (this=this@entry=0x7fffddd59980 <engine>, roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/python_engine.cpp:61
#6  0x00007fffda333ce5 in THPEngine_run_backward (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at torch/csrc/autograd/python_engine.cpp:169
#7  0x00005555556cc065 in _PyMethodDef_RawFastCallKeywords ()
#8  0x00005555556cc0e1 in _PyCFunction_FastCallKeywords ()
#9  0x0000555555728782 in _PyEval_EvalFrameDefault ()
#10 0x0000555555669059 in _PyEval_EvalCodeWithName ()
#11 0x00005555556cb2a5 in _PyFunction_FastCallKeywords ()
#12 0x0000555555727e6e in _PyEval_EvalFrameDefault ()
---Type <return> to continue, or q <return> to quit---
#13 0x0000555555669059 in _PyEval_EvalCodeWithName ()
#14 0x00005555556cb2a5 in _PyFunction_FastCallKeywords ()
#15 0x00005555557238b0 in _PyEval_EvalFrameDefault ()
#16 0x00005555556cb07b in _PyFunction_FastCallKeywords ()
#17 0x00005555557238b0 in _PyEval_EvalFrameDefault ()
#18 0x0000555555669059 in _PyEval_EvalCodeWithName ()
#19 0x0000555555669f24 in PyEval_EvalCodeEx ()
#20 0x0000555555669f4c in PyEval_EvalCode ()
#21 0x0000555555782a14 in run_mod ()
#22 0x000055555578bf11 in PyRun_FileExFlags ()
#23 0x000055555578c104 in PyRun_SimpleFileExFlags ()
#24 0x000055555578dbbd in pymain_main.constprop ()
#25 0x000055555578de30 in _Py_UnixMain ()
---Type <return> to continue, or q <return> to quit---
#26 0x00007ffff77e6b97 in __libc_start_main (main=0x555555649d20 <main>, argc=2, argv=0x7fffffffd918, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd908) at ../csu/libc-start.c:310
#27 0x0000555555733052 in _start () at ../sysdeps/x86_64/elf/start.S:103

robertnishihara · 2018-12-14T00:16:23Z

Thanks for sharing that! I wonder if it's related to some sort of C++ library issue.

It's possible that the code in https://github.com/apache/arrow/blob/b3bc3384f3068edebe69f1084518ccfb85a368f8/python/pyarrow/compat.py#L247-L266 has become out of date, which is called from https://github.com/apache/arrow/blob/b3bc3384f3068edebe69f1084518ccfb85a368f8/python/pyarrow/__init__.py#L47-L51.

See possibly related previous issues in https://issues.apache.org/jira/browse/ARROW-2657 and https://issues.apache.org/jira/browse/ARROW-2920

pcmoritz · 2018-12-14T03:27:36Z

Thanks for the report!

I haven't been able to reproduce it (this is with Python 3.6.6 from the latest anaconda, the ubuntu 18.04 AMI and the following versions:

In [7]: ray.__version__                                                                                                                                                              
Out[7]: '0.6.0'

In [8]: torch.__version__                                                                                                                                                            
Out[8]: '0.4.1'

from pip/anaconda respectively). Are you using the deep learning AMI by any chance?

But I'm also not surprised that there are configurations where this is an issue. I'll file another issue with arrow and see if we can get rid of the workarounds and fix the problem at its root.

pcmoritz · 2018-12-14T03:34:26Z

see https://issues.apache.org/jira/browse/ARROW-4025

EricSteinberger · 2018-12-14T10:40:14Z

Thank you for trying to reproduce!

I haven't used the DL AMI. Out of shock to see this working for you, I have just repeated the process to test precisely what you report to be working. Keep in mind, the error might still be on my side, despite that I got it working with ray 0.4.0.

The steps I executed to get the error are:

Start up a t2.micro instance with Ubuntu 18.04
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /home/ubuntu/miniconda
export PATH=/home/ubuntu/miniconda/bin:$PATH
conda create -n RayIssue python=3.6.6
source activate RayIssue
conda install pytorch=0.4.1 -c pytorch
pip install ray==0.6.0
Run the script -> Segfault

Please tell me whether there is a difference in what we have done to set this up.

pcmoritz · 2018-12-15T03:35:45Z

Thanks, that does the trick for me!

Not sure what the problem was previously. I used an existing instance so there might have been some state on it.

mveres01 · 2018-12-15T17:56:26Z

I'm running into a similar issue (Ubuntu 16.04) as well, and can reproduce the Segfault.

With ray==0.6 and torch==1.0.0, segfaults occur much earlier in the script import ordering:

conda create -n RayIssue python=3.6.6
source activate RayIssue
conda install pytorch=1.0.0 -c pytorch
pip install ray==0.6.0

(RayIssue) :~$ python
Python 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ray
>>> import torch
Segmentation fault (core dumped)

Importing torch first and then ray do not cause segfault immediately, but running the script from Tinkering will eventually throw errors. This import ordering error isn't observed with torch==0.4.1 and ray==0.6.0

robertnishihara · 2018-12-15T19:07:59Z

Thanks @mveres01! Could you include a backtrace from gdb when the segfault happens in the import torch line?

@pcmoritz has been looking into it, and it looks like it potentially comes down to a bug in the C++ standard library implementation of std::future. This is coming from the Arrow codebase, and there is some discussion about the best way to fix it in apache/arrow#3177.

mveres01 · 2018-12-15T19:51:02Z

Thanks for the insight to arrow! Here's the gdb:

(RayIssue):~$ gdb python
...
(gdb) run ~/test2
Starting program: /export/mlrg/mveres/anaconda3/envs/RayIssue/bin/python ~/test2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb)

where test2.py:

import ray
import torch

For comparison, importing torch before ray yields:

(gdb) run test1
Starting program: /export/mlrg/mveres/anaconda3/envs/RayIssue/bin/python test1
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Inferior 1 (process 118344) exited normally]

pcmoritz · 2018-12-18T22:59:52Z

Thanks for the patience, I created a version of Ray that should work now (I could reproduce your crash and it is not crashing any more with the fix):
https://drive.google.com/open?id=1LLjYaqysbYg1Gz3RO91o3dqX77VF8u3g (this is built off pcmoritz@f8b75ef, the branch is a little messy right now and I'll clean it up).

As a high level, the fix is to remove std::future from arrow and replace it with boost::future. This is a temporary workaround until we have figured out https://groups.google.com/a/tensorflow.org/d/topic/developers/TMqRaT-H2bI/discussion, which is the root cause of all of this).

Feel free to try it out if you want to and report back, I'll create a PR to be included in 0.6.1.

atumanov · 2018-12-19T01:03:57Z

@pcmoritz , is this a release blocker for 0.6.1?

pcmoritz · 2018-12-19T01:19:16Z

Yes!

mveres01 · 2018-12-20T13:52:56Z

That was a good read, thanks for looking into it @pcmoritz. Just wanted to confirm the wheels you posted fix the issue on my end!

robertnishihara · 2018-12-24T00:30:27Z

Fixed by #3574.

pcmoritz self-assigned this Dec 18, 2018

pcmoritz mentioned this issue Dec 19, 2018

Fix TensorFlow and PyTorch compatibility #3574

Merged

atumanov added the stability-blocker label Dec 19, 2018

EricSteinberger mentioned this issue Dec 22, 2018

Multiply ray Actors on a single machine fight for CPUs with PyTorch #3609

Closed

robertnishihara closed this as completed Dec 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520

Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520

EricSteinberger commented Dec 12, 2018 •

edited

Loading

robertnishihara commented Dec 13, 2018

EricSteinberger commented Dec 13, 2018

robertnishihara commented Dec 13, 2018 •

edited

Loading

EricSteinberger commented Dec 13, 2018

robertnishihara commented Dec 14, 2018

pcmoritz commented Dec 14, 2018

pcmoritz commented Dec 14, 2018

EricSteinberger commented Dec 14, 2018 •

edited

Loading

pcmoritz commented Dec 15, 2018

mveres01 commented Dec 15, 2018

robertnishihara commented Dec 15, 2018

mveres01 commented Dec 15, 2018

pcmoritz commented Dec 18, 2018 •

edited

Loading

atumanov commented Dec 19, 2018

pcmoritz commented Dec 19, 2018

mveres01 commented Dec 20, 2018

robertnishihara commented Dec 24, 2018

Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520

Segmentation fault with ray 0.6.0 and PyTorch 0.4.1 and 1.0.0 #3520

Comments

EricSteinberger commented Dec 12, 2018 • edited Loading

System information

A minimal example to reproduce:

Problem Description

Console Output

Edit 1:

robertnishihara commented Dec 13, 2018

EricSteinberger commented Dec 13, 2018

robertnishihara commented Dec 13, 2018 • edited Loading

EricSteinberger commented Dec 13, 2018

robertnishihara commented Dec 14, 2018

pcmoritz commented Dec 14, 2018

pcmoritz commented Dec 14, 2018

EricSteinberger commented Dec 14, 2018 • edited Loading

pcmoritz commented Dec 15, 2018

mveres01 commented Dec 15, 2018

robertnishihara commented Dec 15, 2018

mveres01 commented Dec 15, 2018

pcmoritz commented Dec 18, 2018 • edited Loading

atumanov commented Dec 19, 2018

pcmoritz commented Dec 19, 2018

mveres01 commented Dec 20, 2018

robertnishihara commented Dec 24, 2018

EricSteinberger commented Dec 12, 2018 •

edited

Loading

robertnishihara commented Dec 13, 2018 •

edited

Loading

EricSteinberger commented Dec 14, 2018 •

edited

Loading

pcmoritz commented Dec 18, 2018 •

edited

Loading