Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ORTE project #7202

Merged
merged 1 commit into from
Feb 8, 2020
Merged

Remove ORTE project #7202

merged 1 commit into from
Feb 8, 2020

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Nov 27, 2019

Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.

remove orte: misc fixes

  • UCX fixes
  • VPATH issue
  • oshmem fixes
  • remove useless definition
  • Add PRRTE submodule
  • Get autogen.pl to traverse PRRTE submodule
  • Remove stale orcm reference
  • Configure embedded PRRTE
  • Correctly pass the prefix to PRRTE
  • Correctly set the OMPI_WANT_PRRTE am_conditional
  • Move prrte configuration to the end of OMPI's configure.ac
  • Make mpirun a symlink to prun, when available
  • Fix makedist with --no-orte/--no-prrte option
  • Add a --no-prrte option which is the same as the legacy
    --no-orte option.
  • Remove embedded PMIx tarball. Replace it with new submodule
    pointing to OpenPMIx master repo's master branch
  • Some cleanup in PRRTE integration and add config summary entry
  • Correctly set the hostname
  • Fix locality
  • Fix singleton operations
  • Fix support for "tune" and "am" options

Signed-off-by: Ralph Castain rhc@pmix.org
Signed-off-by: Gilles Gouaillardet gilles@rist.or.jp
Signed-off-by: Joshua Hursey jhursey@us.ibm.com

@rhc54
Copy link
Contributor Author

rhc54 commented Nov 27, 2019

@jsquyres @bwbarrett @jjhursey @naughtont3 @ggouaillardet

This is the first step in the move to replace ORTE with PRRTE. It removes the ORTE layer completely, ensures all interaction with the RTE is via PMIx, and makes the opal/pmix framework build as a static framework. See what you think

Next step will be to introduce PRRTE as a submodule and reintegrate it into this branch.

@ggouaillardet
Copy link
Contributor

@rhc54 I pushed a new commit that fixes misc stuff.
Feel free to squash it into your commit.

@ggouaillardet
Copy link
Contributor

it looks like some extra work for coll/hcoll is gonna be needed

@open-mpi open-mpi deleted a comment from ibm-ompi Nov 29, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 29, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 29, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 29, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 29, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Nov 29, 2019
@rhc54
Copy link
Contributor Author

rhc54 commented Nov 29, 2019

@ggouaillardet Can you see if this now works when running a simple "hello" program? I believe you need to configure both PRRTE and OMPI as "static" to avoid library confusion, but it should then work.

@ggouaillardet
Copy link
Contributor

@rhc54 I pushed some more misc orte->prrte related fixes.

All my jobs failed in MPI_Init() (double free error in PMIx), and refreshing PMIx fixed that.

@open-mpi open-mpi deleted a comment from ibm-ompi Dec 3, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 3, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 3, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 3, 2019
@naughtont3
Copy link
Contributor

naughtont3 commented Dec 10, 2019

@rhc54 I started to do a quick test of this PR. It looks like this is using updated version of PMIx_Register_event_handler that is non-void retval. Is there a stable release of OpenPMIx that has that signature? I'll test with master, but wanted to know for future tests.

runtime/ompi_mpi_init.c:526:8: error: void value not ignored as it ought to be
     rc = PMIx_Register_event_handler(NULL, 0, info, 2, ompi_errhandler_callback, NULL, NULL);
        ^

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 10, 2019

I believe that change was ported back to the v3.1 branch, but v3.1.5 has not yet been released. It shouldn't be flagged as an error, however, but rather just as a warning - I gather you have "treat warnings as errors" set and that is the issue.

@open-mpi open-mpi deleted a comment from ibm-ompi Dec 12, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 12, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 13, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 13, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 13, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Dec 13, 2019
@rhc54
Copy link
Contributor Author

rhc54 commented Dec 13, 2019

@jsquyres @ggouaillardet I got autopen.pl to traverse prrte, but I am not sure how to get OMPI to execute configure in that subdirectory. We also need to figure out how to direct PRRTE to use the correct libevent, hwloc, and pmix. Can you guys take a look?

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 18, 2019

bot:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 18, 2019

@bwbarrett @wckzhang One of the Build Checker tests failed to pull the git submodule - can you see why?

@rhc54
Copy link
Contributor Author

rhc54 commented Dec 18, 2019

bot:mellanox:retest

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 3, 2020

@artemry-mlnx Same here - can we build with --enable-debug?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 4, 2020

@jjhursey @artemry-mlnx I tracked it down - fix coming shortly

@open-mpi open-mpi deleted a comment from ibm-ompi Feb 5, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Feb 5, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Feb 5, 2020
@open-mpi open-mpi deleted a comment from ibm-ompi Feb 5, 2020
@rhc54
Copy link
Contributor Author

rhc54 commented Feb 5, 2020

Our CI doesn't check it, but I have ensured that singleton operations continue to work. However, singleton Comm_spawn does not currently work unless there is a PMIx-enabled RTE (e.g., PRRTE) operating on the node.

@jsquyres
Copy link
Member

jsquyres commented Feb 5, 2020

Comm_spawn does not currently work unless there is a PMIx-enabled RTE (e.g., PRRTE) operating on the node.

If we could get an eventual feature request to fail gracefully -- i.e., with a helpful / show_help-style error message telling the user that COMM_SPAWN failed because they need to run a PMIx-enabled RTE yadda yadda yadda -- that would be awesome.

👍

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 5, 2020

I hope to eventually get it to work again - we need to detect that we are not yet connected and spawn PRRTE. Tricky part will be getting the client to "reconnect" to the newly spawned PRRTE. We have thought of this already, but haven't implemented it yet (though the hook is present).

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 5, 2020

@artemry-mlnx Could you please tell me what is in your "test-amca.conf" file? We seem to be failing your test of the "--tune" option, but I can't see what is in the file and don't know how it is formatted.

@jjhursey
Copy link
Member

jjhursey commented Feb 6, 2020

bot:ibm:retest

@artemry-nv
Copy link

@rhc54
test_amca.conf content may vary but here're possible options:

  • mca_base_env_list=XXX_A=7;XXX_B=8
  • mca_base_env_list=XXX_A=1;XXX_B=2;XXX_C;XXX_D;XXX_E

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 7, 2020

@artemry-mlnx I'm running out of patience playing "whackamole" with the Mellanox CI. I need to know what is in this file:

 echo 'mca_base_env_list=XXX_A=1;XXX_B=2;XXX_C;XXX_D;XXX_E'
++ /__w/1/ompi/ompi_install1/bin/mpirun --mca pml ob1 --mca btl self,vader --np 2 --tune /__w/1/ompi/test_amca.conf /__w/1/jenkins_scripts/jenkins/ompi/env_mpi
++ grep '^XXX_'

If you look at your log, it is at the end where it fails. I have tried multiple times with different file contents, and everything works. I don't know what is in that one. Can you please help?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 7, 2020

@artemry-mlnx Perhaps it would help if you shared your test script - you have things in it that are no longer being supported. For example:

+ echo 'check if mca_base_envar_file_prefix parameter (a.k.a -tune cmd line option) is supported in /__w/1/ompi/ompi_install1'
check if mca_base_envar_file_prefix parameter (a.k.a -tune cmd line option) is supported in /__w/1/ompi/ompi_install1
++ /__w/1/ompi/ompi_install1/bin/ompi_info --param mca base --level 9
++ grep mca_base_envar_file_prefix
++ wc -l
+ val=1
+ '[' 1 -gt 0 ']'

ompi_info might well show that param, but it isn't present in PRRTE. The problem is that someone implemented it in the MCA base, which was a major mistake. So OPAL thinks it is still valid, but the RTE won't recognize it.

If you provide the script, I'll be happy to update it and return it to you. Meantime, I really do need to know the contents of these "tune" files you are testing.

Alternatively, if you folks don't have time right now, I'm happy to go ahead and commit and we can address these rather unusual options later.

@artemry-nv
Copy link

Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.

remove orte: misc fixes

 - UCX fixes
 - VPATH issue
 - oshmem fixes
 - remove useless definition
 - Add PRRTE submodule
 - Get autogen.pl to traverse PRRTE submodule
 - Remove stale orcm reference
 - Configure embedded PRRTE
 - Correctly pass the prefix to PRRTE
 - Correctly set the OMPI_WANT_PRRTE am_conditional
 - Move prrte configuration to the end of OMPI's configure.ac
 - Make mpirun a symlink to prun, when available
 - Fix makedist with --no-orte/--no-prrte option
 - Add a `--no-prrte` option which is the same as the legacy
   `--no-orte` option.
 - Remove embedded PMIx tarball. Replace it with new submodule
   pointing to OpenPMIx master repo's master branch
 - Some cleanup in PRRTE integration and add config summary entry
 - Correctly set the hostname
 - Fix locality
 - Fix singleton operations
 - Fix support for "tune" and "am" options

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@rhc54
Copy link
Contributor Author

rhc54 commented Feb 8, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rhc54 rhc54 merged commit 3224329 into open-mpi:master Feb 8, 2020
@rhc54 rhc54 deleted the topic/pmixstat branch February 8, 2020 20:25
@amaslenn
Copy link

@rhc54 will you merge it into v4.0.x?

@rhc54
Copy link
Contributor Author

rhc54 commented Feb 10, 2020

No - strictly for OMPI 5 and above.

artemry-nv pushed a commit to artemry-nv/jenkins_scripts that referenced this pull request Feb 24, 2020
The test has been updated for open-mpi/ompi#7202 in scope of mellanox-hpc#92. PR 7202 is not ported to Open MPI 4.0.x.

Signed-off-by: Artem Ryabov <artemry@mellanox.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.