-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Floating Point Exception #235
Comments
All the frames in the callstack are not in Ray. It looks like a Open MPI problem. From the stack, it is not clear what is the problem. |
Here is my openmpi info:
|
Agree, Open MPI problem. |
Was this caused by hwloc (this appears in the call stack) ? |
I could not dig into any specifics. |
I tested installation of Ray on Debian via Virtual Box and I was not able to reproduce this issue. The above issue was caused on the Google Compute Engine machine so I am not sure if it is related to that but the output for the
which is identical to one above. |
Hi, It is possible to pinpoint the place where the problem occurs. [snap-2:24679] Failing at address: 0x7f6a7b582604 Each Ray process has its own virtual memory address space. The division by 0 (SIGFPE) is presumably in one of the libraries (one of the .so files of Open-MPI). On Linux, you can inspect the memory mapping of any running process by looking at the file /proc/PID/maps where PID is the process identifier. Below is the map of a bash (shell) process:
|
Hi Sébastien, I tried to repeat this error on Google cloud compute again.
I see that address 0x7fd0b297826e is matching However this time we see a different .so file libhwloc.so.5 as compared to Also, I am not sure how to check the pid if the process is terminated very soon. Thanks for all your help. |
Here is the strace output
|
What is your goal ? Do you want to patch hwloc ? |
Hi @sebhtml
I just wanted to run Ray on Google Compute Debian image.
It sounds like interesting thing to do but I do not have extensive experience with C. Thanks for the response. |
SIGFPE with Open-MPI 1.8.3, --bind-to option, and hwlocI know that Open-MPI 1.8.x (you use OpenMPI 1.8.3) uses process binding. This is new, the default was "--bind-to none" in previous series (1.7 and below) whereas in the 1.8 series the default is "--bind-to core". I believe that hwloc is used when the option "--bind-to" is used with something else than none. SIGSEGV with Open-MPI 1.8.3Also, Open-MPI 1.8.x uses the "vader" (like in Darth Vader) BTL (byte transfer layer) for sending messages between local processes. Open-MPI 1.8.3 contains a bug that leads to segmentation faults (SIGSEGV signal). I ran into this problem myself. See open-mpi/ompi#235 The workaround for that one is to add the option "--mca btl ^vader" to avoid Darth Vader altogether. |
Hi @sebhtml,
Later on I installed openmpi from source using these instructions: and now if i run Ray using the "newly installed" mpiexec, I get this error which is same one as I got it in the first post.
Thanks for teaching me new things. |
[2020-05-05 19:43:47] Error: Floating-point exception [CALL STACK] How do i solve this problem |
@Nebisjames Run your software inside a debugger and if it is a good debugger, it should break when SIGFPE occurs. |
Hello,
I am having trouble running ray. I am not sure if this is a simple error on my side or something related to Openmpi.
The text was updated successfully, but these errors were encountered: