Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to stop at MPIR_Breakpoint on Power systems with OpenMPI v3.0.x #5501

Closed
kent-cheung-arm opened this issue Jul 31, 2018 · 18 comments
Closed
Assignees
Labels

Comments

@kent-cheung-arm
Copy link

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.0 and v3.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source tarball with PGI 18.1 or IBM 16.1.1 Beta 2 compilers

Please describe the system on which you are running

  • Operating system/version: RHEL7
  • Computer hardware: ppc64le
  • Network type: self

Details of the problem

GDB is unable to stop at MPIR_Breakpoint when debugging the mpirun process using the MPIR interface on the above system with the above compilers.

Here are the steps to reproduce with the simple hello_c program from #5349:

$ mpirun --version
mpirun (Open MPI) 3.0.2

Report bugs to http://www.open-mpi.org/community/help/
$ gdb --quiet --args mpirun -np 2 ./hello_c
Reading symbols from /software/mpi/openmpi-3.0.2_pgi-18.1/bin/orterun...done.
(gdb) start
Temporary breakpoint 1 at 0x100010dc: file main.c, line 13.
Starting program: /software/mpi/openmpi-3.0.2_pgi-18.1/bin/mpirun -np 2 ./hello_c
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/power8/libthread_db.so.1".

Temporary breakpoint 1, main () at main.c:13
13      main.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install glibc-2.17-106.el7_2.8.ppc64le zlib-1.2.7-15.el7.ppc64le
(gdb) set MPIR_being_debugged=1
(gdb) break MPIR_Breakpoint
Breakpoint 2 at 0x3fffb7f3fec0: file orted/orted_submit.c, line 182.
(gdb) continue
Continuing.
[New Thread 0x3fffb738f1d0 (LWP 47821)]
[New Thread 0x3fffb6b8f1d0 (LWP 47822)]
[New Thread 0x3fffb602f1d0 (LWP 47824)]
[New Thread 0x3fffb582f1d0 (LWP 47825)]
Detaching after fork from child process 47826.
Detaching after fork from child process 47827.
My rank is 0.
My pid is 47826.
My rank is 1.
My pid is 47827.
[Thread 0x3fffb602f1d0 (LWP 47824) exited]
[Thread 0x3fffb582f1d0 (LWP 47825) exited]
[Thread 0x3fffb6b8f1d0 (LWP 47822) exited]
[Thread 0x3fffb738f1d0 (LWP 47821) exited]
[Inferior 1 (process 47806) exited normally]
(gdb)
@jjhursey
Copy link
Member

jjhursey commented May 2, 2019

The fix for this might be related to #6613 (comment). That PMix fix hasn't been ported to the PMIx 2.x branch yet, but I can see if it's possible.

In the meantime can you try the latest Open MPI v4.x release with the PMIx v3.1.3rc1

@James-A-Clark
Copy link
Contributor

I just retested with a build of the Open MPI v4.0.x branch (386ed07) with PMIx v3.1.3rc3 and the same issue is present. The compiler was IBM 16.1.0.

@jjhursey
Copy link
Member

jjhursey commented Jun 3, 2019

Per a note from off list. it looks like this is not reproducible with gcc, but is with PGI and IBM XL.

@gpaulsen
Copy link
Member

@jjhursey is this fixed with latest PMIx update?

@jjhursey
Copy link
Member

@gpaulsen No. This still needs to be investigated.

@gpaulsen
Copy link
Member

@awlauria will investigate this week.

@awlauria
Copy link
Contributor

In short: The MPIR_Breakpoint() is getting optimized out with xlc and pgi.

Unfortunately, there is no one "magic bullet" to make sure it works on all compilers. I explored adding support similar to what was done here for clang:
#4624

but going that approach may not be feasible. It's simple enough to add support for gcc's __optimize__("O0") directive (I think we just get lucky that GCC doesn't compile it out, it could very well in future versions), but others like XLC and PGI use a #pragma directive. That is possible to add, but very difficult to tell if it is actually available at configure time since I couldn't get either xlc nor pgi to show a warning if the pragma was invalid.

Also checking the compiler using pre-defined directives is a no-go. XLC defines a variety of them, including __clang__. And most compilers define some sort of __GNU__.

The simplest solution that will hopefully satisfy most compilers is something akin to this:

+static volatile void * volatile noop_mpir_breakpoint_ptr = NULL;
 
 /*
  * Breakpoint function for parallel debuggers
  */
 void* MPIR_Breakpoint(void)
 {
-    return NULL;
+    return noop_mpir_breakpoint_ptr;
 }

But it comes with a small annoyance that at least under XLC, gdb breaking on MPIR_Breakpoint() will show the calling function as being broken on - as if the compiler made MPIR_Breakpoint inline. I haven't yet tried this on PGI, but I think this should be ok. Looking down the rabit hole of adding a directive to prevent making it inline is just as bad as trying to get it to stop from optimizing it out all-together.

If the above is a satisfactory solution I can go ahead and PR it.

@rhc54
Copy link
Contributor

rhc54 commented Jul 18, 2019

IIRC, you aren't supposed to be able to use MPIR-based debuggers on optimized code - you are supposed to compile your code without optimization. Otherwise, even if MPIR might work, gdb won't find the necessary symbols and/or be able to properly align with the source.

So are you saying that even without optimization turned on, the MPIR_Breakpoint is being optimized out?

@awlauria
Copy link
Contributor

Do you mean the user application or OMPI? In this case, it's the way OMPI was compiled which was causing the issue.

@rhc54
Copy link
Contributor

rhc54 commented Jul 18, 2019

I was actually talking about OMPI - was it compiled optimized? Or with optimization turned off?

If it was optimized, then you technically can't use it for debugging, AFAIK, as the symbols required for attachment will be missing. We've seen people try all kinds of tricks over the years to work around that problem, but none of them have been successful for the general case (they sometimes work in special cases).

@jjhursey
Copy link
Member

(for the archives) I don't think the below is correct:

If it was optimized, then you technically can't use it for debugging, AFAIK, as the symbols required for attachment will be missing.

We can (and have for a long time) shipped an optimized MPI library that works with MPIR. The key is that the file(s) that contain the MPIR symbols is compiled with debug symbols and not compiled optimized. That's why we have a funny looking Makefile near the orted_submit.c file. That allows the debugger to attach and work with the app. They won't be able to debug the MPI library very well since it doesn't have all of the symbols, but they can debug their application.

I think the problem here is that the compiler is overly optimizing this function out even with -g -O0, or somehow optimizations are still being enabled for the orted_submit.c file. I think that's being sorted out on PR #6828

@rhc54
Copy link
Contributor

rhc54 commented Jul 23, 2019

We are agreeing, Josh - IF it was optimized, THEN the symbols are gone. What you are saying is that we try NOT to optimize the files with the symbols 😄

@jjhursey
Copy link
Member

Correct. I'm just pointing out that the MPI library as a whole can still be optimized as long as that file is not.

@bosilca
Copy link
Member

bosilca commented Jul 23, 2019

We had a similar issue in the OMPI layer regarding the debugger message queues. The solution is similar to what has been discussed here, force the compilation of certain files with additional debugging flags. Take a look at ompi/debugger/Makefile.am to see how we forced automake to do so, and at config/orte_setup_debugger_flags.m4 to see how the debugging flags are defined.

@jsquyres
Copy link
Member

We had a similar issue in the OMPI layer regarding the debugger message queues. The solution is similar to what has been discussed here, force the compilation of certain files with additional debugging flags. Take a look at ompi/debugger/Makefile.am to see how we forced automake to do so, and at config/orte_setup_debugger_flags.m4 to see how the debugging flags are defined.

This is why I want to see how -O3 is getting into @awlauria's build, because the Makefile.am in the orte_submit dir is nominally doing the Right Things.

@jjhursey
Copy link
Member

jjhursey commented Aug 8, 2019

@jsquyres jsquyres closed this as completed Aug 8, 2019
@Josh-Cottingham-Arm
Copy link

Recently tested v3.1.x (which have the PRs now merged in) with IBM 16.1.0 and PGI 18.7 compilers and can confirm that this issue is now fixed.

@jsquyres
Copy link
Member

jsquyres commented Aug 9, 2019

Thanks for the report, @Josh-Cottingham-Arm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants