Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start tracing breakage #12464

Closed
wants to merge 18 commits into from
Closed

Start tracing breakage #12464

wants to merge 18 commits into from

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Apr 11, 2024

Test matrix:

PMIx            PRRTE        Result
-----          -------       -------
4.2.8           3.0.3        Passed
4.2.9           3.0.3        Passed
5.0.0           3.0.3        Passed
5.0.1           3.0.3        Passed
5.0.2           3.0.3        Failed   <----
5.0.1           3.0.4        Passed
5.0.1           3.0.5        Failed   <----
5.0.1           3.0H         Failed   <----
5.0H            3.0.3        Failed   <----
5.0H*           3.0.3        Passed                    *shmem2 disabled
5.0H*           3.0.4        Passed
5.0H*           3.0.5        Failed   <----

bot:notacherrypick

@github-actions github-actions bot added this to the v5.0.4 milestone Apr 11, 2024
@janjust janjust added the mpi4py-all Run the optional mpi4py CI tests label Apr 11, 2024
Start with PMIx v4.2.9 and PRRTE v3.0.3

Signed-off-by: Ralph Castain <rhc@pmix.org>
rhc54 added 11 commits April 11, 2024 07:41
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54 rhc54 mentioned this pull request Apr 11, 2024
Signed-off-by: Ralph Castain <rhc@pmix.org>
@rhc54
Copy link
Contributor Author

rhc54 commented Apr 12, 2024

I could use a little help here with interpretation - the mpi4py tests are failing here:

testRepr (test_toplevel.TestRC.testRepr) ... ok
[fv-az714-760:02109] [[8205,1],1] selected pml ob1, but peer [[8205,1],0] on fv-az714-760 selected pml 
�
[fv-az714-760:02109] OPAL ERROR: Unreachable in file communicator/comm.c at line 2385
[fv-az714-760:02109] 0: Error in ompi_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR

======================================================================
ERROR: setUpClass (test_ulfm.TestULFMInter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/ompi/ompi/test/test_ulfm.py", line 196, in setUpClass
    INTERCOMM = MPI.Intracomm.Create_intercomm(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "src/mpi4py/MPI/Comm.pyx", line 2335, in mpi4py.MPI.Intracomm.Create_intercomm
mpi4py.MPI.Exception: MPI_ERR_INTERN: internal error

----------------------------------------------------------------------
Ran 1666 tests in 16.787s

FAILED (errors=1, skipped=82)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
  Proc: [[8205,1],1]
  Errorcode: 1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

The test section being executed is the "run_spawn" option - so what does ULFM have to do with it? Is the test executing a failure recovery?

I had previously received a reproducer (#12307 (comment)), but we pass that just fine. So I'm not sure what this section is doing, and what I'm supposed to be looking at.

Signed-off-by: Ralph Castain <rhc@pmix.org>
Copy link

Hello! The Git Commit Checker CI bot found a few problems with this PR:

5cee996: Expand the shmem2 region size

  • check_cherry_pick: does not include a cherry pick message (did you need to bot:notacherrypick?)

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@samuelkgutierrez
Copy link
Member

To potentially save you some time, @rhc54: 1000x may be too big and cause other resource exhaustion issues on the system side. 10x would be a good starting point.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 12, 2024

To potentially save you some time, @rhc54: 1000x may be too big and cause other resource exhaustion issues on the system side. 10x would be a good starting point.

Could be - but I wound up with the identical error anyway. I'm going to have to approach this differently as having to commit changes to the release branch just to run a test over here isn't sustainable. Still hoping someone can provide some insight into the meaning of the error. 🤷‍♂️

Make it easier to test

Signed-off-by: Ralph Castain <rhc@pmix.org>
@samuelkgutierrez
Copy link
Member

Could be - but I wound up with the identical error anyway. I'm going to have to approach this differently as having to commit changes to the release branch just to run a test over here isn't sustainable. Still hoping someone can provide some insight into the meaning of the error. 🤷‍♂️

My mistake. I thought I caught this change before a test run.

rhc54 added 3 commits April 11, 2024 20:39
Replace with the actual code to make it easier
to debug CI tests

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Ralph Castain <rhc@pmix.org>
@bosilca
Copy link
Member

bosilca commented Apr 12, 2024

The test_ULFM test ULFM capabilities but without introducing failures, so it is mostly testing basic functionality. In this particular case the error happened very early in the test, basically while creating the inter-communicator for the test. This code takes a communicator, split it in two and then created an inter-comm out these two parts.

According to the output the intercomm creation fails because the remote_get_procs fails with an internal error. There are tens of different potential reasons for this to fail with an internal error, and there is no output to be able to help identify the cause. Maybe the simplest way to understand is to set a breakpoint in ompi_comm_get_rprocs and follow it step by step until one can figure out why it bails out.

@rhc54
Copy link
Contributor Author

rhc54 commented Apr 12, 2024

Thanks @bosilca ! Much appreciate the explanation, it helps a lot.

@rhc54 rhc54 closed this May 5, 2024
@rhc54 rhc54 deleted the topic/test branch May 5, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: v5.0.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants