-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start tracing breakage #12464
Start tracing breakage #12464
Conversation
Start with PMIx v4.2.9 and PRRTE v3.0.3 Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
I could use a little help here with interpretation - the mpi4py tests are failing here: testRepr (test_toplevel.TestRC.testRepr) ... ok
[fv-az714-760:02109] [[8205,1],1] selected pml ob1, but peer [[8205,1],0] on fv-az714-760 selected pml
�
[fv-az714-760:02109] OPAL ERROR: Unreachable in file communicator/comm.c at line 2385
[fv-az714-760:02109] 0: Error in ompi_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR
======================================================================
ERROR: setUpClass (test_ulfm.TestULFMInter)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/runner/work/ompi/ompi/test/test_ulfm.py", line 196, in setUpClass
INTERCOMM = MPI.Intracomm.Create_intercomm(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "src/mpi4py/MPI/Comm.pyx", line 2335, in mpi4py.MPI.Intracomm.Create_intercomm
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error
----------------------------------------------------------------------
Ran 1666 tests in 16.787s
FAILED (errors=1, skipped=82)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
Proc: [[8205,1],1]
Errorcode: 1
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them. The test section being executed is the "run_spawn" option - so what does ULFM have to do with it? Is the test executing a failure recovery? I had previously received a reproducer (#12307 (comment)), but we pass that just fine. So I'm not sure what this section is doing, and what I'm supposed to be looking at. |
Signed-off-by: Ralph Castain <rhc@pmix.org>
Hello! The Git Commit Checker CI bot found a few problems with this PR: 5cee996: Expand the shmem2 region size
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
To potentially save you some time, @rhc54: 1000x may be too big and cause other resource exhaustion issues on the system side. 10x would be a good starting point. |
Could be - but I wound up with the identical error anyway. I'm going to have to approach this differently as having to commit changes to the release branch just to run a test over here isn't sustainable. Still hoping someone can provide some insight into the meaning of the error. 🤷♂️ |
Make it easier to test Signed-off-by: Ralph Castain <rhc@pmix.org>
My mistake. I thought I caught this change before a test run. |
Replace with the actual code to make it easier to debug CI tests Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
The test_ULFM test ULFM capabilities but without introducing failures, so it is mostly testing basic functionality. In this particular case the error happened very early in the test, basically while creating the inter-communicator for the test. This code takes a communicator, split it in two and then created an inter-comm out these two parts. According to the output the intercomm creation fails because the |
Thanks @bosilca ! Much appreciate the explanation, it helps a lot. |
Test matrix:
bot:notacherrypick