Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cisco MTT IBM OSC point to point failures #5242

Closed
PeterGottesman opened this issue Jun 6, 2018 · 2 comments
Closed

Cisco MTT IBM OSC point to point failures #5242

PeterGottesman opened this issue Jun 6, 2018 · 2 comments

Comments

@PeterGottesman
Copy link
Contributor

PeterGottesman commented Jun 6, 2018

IBM's OSC tests in MTT are failing with multiple threads, as of #5200 .
Example here: https://mtt.open-mpi.org/index.php?do_redir=2631

--------------------------------------------------------------------------
The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.
--------------------------------------------------------------------------
[mpi006:14269] *** An error occurred in MPI_Win_create_dynamic
[mpi006:14269] *** reported by process [3416588289,1]
[mpi006:14269] *** on communicator MPI COMMUNICATOR 3 SPLIT FROM 0
[mpi006:14269] *** MPI_ERR_WIN: invalid window
[mpi006:14269] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[mpi006:14269] ***    and potentially your MPI job)[mpi006:14256] 1 more process has sent help
message help-osc-pt2pt.txt / mpi-thread-multiple-not-supported
[mpi006:14256] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[mpi006:14256] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal

@hjelmn What can we do to clean up these tests?

@jsquyres
Copy link
Member

jsquyres commented Jun 7, 2018

@hjelmn I believe Cisco's MTT tests runs a whole section of them with the environment variable OMPI_MPI_THREAD_LEVEL to 3, which is the equivalent of calling MPI_Init_thread() with MPI_THREAD_MULTIPLE. Hence, all the tests that create an OSC window are now failing (due to our decision on last week's Webex to make osc/pt2pt disable itself in the presence of THREAD_MULTIPLE).

I'm not quite sure what the Right thing to do is, because we really only want to exit(77) the test when a) you're using THREAD_MULTIPLE and b) you're using an osc component that can't handle THREAD_MULTIPLE.

Got any suggestions?

@jsquyres
Copy link
Member

jsquyres commented Jun 8, 2018

@hjelmn pointed out to me in IM that soon enough, btl/vader (and eventually btl/tcp) will soon support the RDMA methods, and this problem will resolve itself.

That being said, the btl/vader and btl/tcp updates almost certainly won't be back-ported to the v2.x, v3.0.x, and v3.1.x branches. So we'll need to figure out something to do there. Perhaps limit the OMPI_MPI_THREAD_LEVEL tests...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants