Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openmpi v5.0.1 #132

Conversation

regro-cf-autotick-bot
Copy link
Contributor

It is very likely that the current package version for this feedstock is out of date.

Checklist before merging this PR:

  • Dependencies have been updated if changed: see upstream
  • Tests have passed
  • Updated license if changed and license_file is packaged

Information about this PR:

  1. Feel free to push to the bot's branch to update this PR if needed.
  2. The bot will almost always only open one PR per version.
  3. The bot will stop issuing PRs if more than 3 version bump PRs generated by the bot are open. If you don't want to package a particular version please close the PR.
  4. If you want these PRs to be merged automatically, make an issue with @conda-forge-admin,please add bot automerge in the title and merge the resulting PR. This command will add our bot automerge feature to your feedstock.
  5. If this PR was opened in error or needs to be updated please add the bot-rerun label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase @conda-forge-admin, please rerun bot in a PR comment to have the conda-forge-admin add it for you.

Closes: #131

Pending Dependency Version Updates

Here is a list of all the pending dependency version updates for this repo. Please double check all dependencies before merging.

Name Upstream Version Current Version
openmpi 5.0.1 Anaconda-Server Badge

Dependency Analysis

Please note that this analysis is highly experimental. The aim here is to make maintenance easier by inspecting the package's dependencies. Importantly this analysis does not support optional dependencies, please double check those before making changes. If you do not want hinting of this kind ever please add bot: inspection: false to your conda-forge.yml. If you encounter issues with this feature please ping the bot team conda-forge/bot.

Analysis by source code inspection shows a discrepancy between it and the the package's stated requirements in the meta.yaml.

Packages found by source code inspection but not in the meta.yaml:

  • cython

This PR was created by the regro-cf-autotick-bot. The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. Feel free to drop us a line if there are any issues! This PR was generated by https://github.com/regro/cf-scripts/actions/runs/7294847426, please use this URL for debugging.

@conda-forge-webservices
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@leofang leofang closed this Dec 23, 2023
@leofang leofang reopened this Dec 23, 2023
@dalcinl dalcinl marked this pull request as draft December 23, 2023 02:47
@dalcinl
Copy link
Contributor

dalcinl commented Dec 23, 2023

@leofang The libnl issue is gone. But now I'm seeing the missing libcuda warnings:

[fv-az1542-577:03743] mca_base_component_repository_open: unable to open mca_accelerator_cuda: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[fv-az1542-577:03743] mca_base_component_repository_open: unable to open mca_rcache_gpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)
[fv-az1542-577:03743] mca_base_component_repository_open: unable to open mca_rcache_rgpusm: libcuda.so.1: cannot open shared object file: No such file or directory (ignored)

@jsquyres We need your assistance again. I hoped the warnings above would have been silenced my setting

opal_warn_on_missing_libcuda = 0
opal_cuda_support = 0

in $PREFIX/etc/openmpi-mca-params.conf [link].
Has the configuration changed? Or is it somehow being ignored?

@dalcinl
Copy link
Contributor

dalcinl commented Dec 23, 2023

Additionally, the previous v5.0.0 builds are unusable in Circle CI

+ mpiexec -n 1 python -m coverage run -m mpi4py.bench --help
--------------------------------------------------------------------------
It looks like prte_init failed for some reason. There are many reasons
that can cause PRRTE to fail during prte_init, some of which are due to
configuration or environment problems.  This failure appears to be an
internal failure - here's some additional information (which may only
be relevant to a PRRTE developer):

  prte_plm_base_select failed
  --> Returned value  (-46) instead of PRTE_SUCCESS
--------------------------------------------------------------------------

I'm clueless about how to move forward.

@jsquyres
Copy link

@dalcinl Are these issues about Open MPI v5.0.0 or v5.0.1? Also, note that a bunch of people have disappeared for the holidays; we might not get some replies here until people return to the office in January.

First issue: warnings about libcuda

You'll need to set mca_base_component_show_load_errors to 0 to suppress this warning. I'm not sure what warnings you were suppressing with opal_warn_on_missing_libcuda=0.

@janjust @wenduwan Is this how you expected the CUDA-library-is-not-available-everywhere functionality to work? FWIW, I see the MCA var opal_warn_on_missing_libcuda registered, but I do not actually see it used anywhere in the code base.

Second issue: pml / CircleCI

"prte_plm_base_select failed" means that it couldn't find a launcher component to actually invoke your command. Can you add --mca plm_base_verbose 100 to the command line? This will show a bunch of output about the plm selection logic when you run.

@wenduwan
Copy link

The mca_base_component_repository_open warning is expected if Open MPI 5.0.0/1 is built --with-cuda but cuda is not available at runtime. This can be suppressed with mca_base_component_show_load_errors=0

I'm actually not sure what opal_warn_on_missing_libcuda does...

@dalcinl
Copy link
Contributor

dalcinl commented Dec 24, 2023

Are these issues about Open MPI v5.0.0 or v5.0.1?

They come from v5.0.0. We are trying to get a clean working build of v5.0.0 before moving to v5.0.1

I see the MCA var opal_warn_on_missing_libcuda registered, but I do not actually see it used anywhere in the code base.

That explains it all. This setting should do what it is supposed to do, or be removed for good, or at least emit a deprecation warning or something.

This can be suppressed with mca_base_component_show_load_errors=0

OK. This is really too generic, but we have to work with what we were given.

I'm actually not sure what opal_warn_on_missing_libcuda does...

Maybe in v4.1.x it was doing something, but then it got refactored out?

@dalcinl dalcinl mentioned this pull request Dec 24, 2023
5 tasks
@wenduwan
Copy link

The opal warning is new in 5.0.0 wrt how Open MPI uses CUDA:

  • Before 5, ompi dlopens cuda directly.
  • After 5, cuda calls are moved into individual opal components(ideally only the accelerator component is supported to make CUDA calls), and we want those components to be built as DSO. Therefore ompi only needs to dlopen the DSO components, without direct CUDA calls.

That explains it all. This setting should do what it is supposed to do, or be removed for good, or at least emit a deprecation warning or something.

I agree. We should do something about this.

@dalcinl
Copy link
Contributor

dalcinl commented Dec 24, 2023

@conda-forge-admin rerun bot

@conda-forge-webservices conda-forge-webservices bot added the bot-rerun Instruct the bot to retry the PR label Dec 24, 2023
@regro-cf-autotick-bot
Copy link
Contributor Author

Due to the bot-rerun label I'm closing this PR. I will make another one as appropriate. This was generated by https://github.com/regro/cf-scripts/actions/runs/7317361884

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot-rerun Instruct the bot to retry the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants