-
-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openmpi v5.0.0 #128
openmpi v5.0.0 #128
Conversation
…nda-forge-pinning 2023.10.26.11.34.06
Hi! This is the friendly automated conda-forge-linting service. I just wanted to let you know that I linted all conda-recipes in your PR ( |
This week is full of MPI 😂 @dalcinl Since you've been testing upstream Open MPI, anything we should know before merging? cc: @jakirkham @pentschev for vis |
Ahhahahaha... do truly expected it was going to be that easy? |
@leofang I can take care of this update. However, I believe we should have a branch v4.1.x to track updates to the v4.1 series. I'm not really sure how that process work, but if you know, could you please take care of it or ask someone for pointers? |
Yes, it's easy, just create a new branch (say, What would happen is that the CI would run on that branch too, but at the package upload time, the upload would be blocked because nothing is changed in the recipe and the package hashes collide with existing ones, so we're safe. |
@leofang This feedstock has become a dumpster. The build scripts have hundreds of hacks accumulated over the years. Maybe this is a good opportunity for a cleanup, but it could be a long rodeo. |
I think a big part of it was to enable cross-compiling for osx-arm (#69 (comment)), but TBH I feel they should be removed if they can be autogen'd. IIRC @erykoff suggested these changes back then, for both |
I just did a local build from the release tarball. I'm getting the following:
@jsquyres This looks awfully wrong. Why would ofi and ucx libraries get injected in |
Perhaps users are now expected to pass |
57b86e6
to
9b4ea55
Compare
@dalcinl This is probably expected. We made a change from v4.x to v5.x in that all plugins are now -- by default -- slurped into their respective libraries (vs. being built as DSOs). This has turned out to be fairly important for scalability (vs. hammering on a shared / network filesystem by calling For example, both of these were built with Libfabric support: # For Open MPI v4.1.6
root@acf210f5a701:/outside/openmpi-5.0.0# ldd /tmp/bogus-4.1/lib/libmpi.so
linux-vdso.so.1 (0x0000ffffae54a000)
libopen-rte.so.40 => /tmp/bogus-4.1/lib/libopen-rte.so.40 (0x0000ffffae2f0000)
libopen-pal.so.40 => /tmp/bogus-4.1/lib/libopen-pal.so.40 (0x0000ffffae1d0000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffffae130000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffadf80000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffae511000)
# For Open MPI v5.0.0
root@acf210f5a701:/outside/openmpi-5.0.0# ldd /tmp/bogus-5.0/lib/libmpi.so
linux-vdso.so.1 (0x0000ffff9edea000)
libopen-pal.so.80 => /tmp/bogus-5.0/lib/libopen-pal.so.80 (0x0000ffff9e940000)
libfabric.so.1 => /tmp/bogus/lib/libfabric.so.1 (0x0000ffff9e820000)
libpmix.so.2 => /tmp/bogus-5.0/lib/libpmix.so.2 (0x0000ffff9e610000)
libevent_core-2.1.so.7 => /tmp/bogus-5.0/lib/libevent_core-2.1.so.7 (0x0000ffff9e5c0000)
libevent_pthreads-2.1.so.7 => /tmp/bogus-5.0/lib/libevent_pthreads-2.1.so.7 (0x0000ffff9e5a0000)
libhwloc.so.15 => /tmp/bogus-5.0/lib/libhwloc.so.15 (0x0000ffff9e530000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff9e490000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff9e2e0000)
/lib/ld-linux-aarch64.so.1 (0x0000ffff9edb1000)
libatomic.so.1 => /lib/aarch64-linux-gnu/libatomic.so.1 (0x0000ffff9e2c0000) That being said, I could have sworn that we documented this change somewhere, specifically because it could cause some surprise to some people (e.g., packagers). All I can find easily is this, https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/packagers.html#components-plugins-dso-or-no, where we mention that building components as DSOs is not the default -- but it doesn't specifically mention that thit is a change compared to v4.x. ...ah, if you go to https://docs.open-mpi.org/en/v5.0.x/installing-open-mpi/configure-cli-options/installation.html#installation-options and scroll all the way down to the Regardless, all of this means that if you'd like to restore the v4.x behavior of building DSOs and therefore not having many external dependencies in root@acf210f5a701:/outside/openmpi-5.0.0# ldd /tmp/bogus-5.0/lib/libmpi.so
linux-vdso.so.1 (0x0000ffff9bcb8000)
libopen-pal.so.80 => /tmp/bogus-5.0/lib/libopen-pal.so.80 (0x0000ffff9ba00000)
libpmix.so.2 => /tmp/bogus-5.0/lib/libpmix.so.2 (0x0000ffff9b870000)
libevent_core-2.1.so.7 => /tmp/bogus-5.0/lib/libevent_core-2.1.so.7 (0x0000ffff9b820000)
libevent_pthreads-2.1.so.7 => /tmp/bogus-5.0/lib/libevent_pthreads-2.1.so.7 (0x0000ffff9b800000)
libhwloc.so.15 => /tmp/bogus-5.0/lib/libhwloc.so.15 (0x0000ffff9b790000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000ffff9b6f0000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffff9b540000)
/lib/ld-linux-aarch64.so.1 (0x0000ffff9bc7f000) Note that Does that make sense? |
OK. Honestly, I did not read the documentation for v5.0.x in full, so I was not aware of these changes. The option HOWEVER, I still think things are a bit off. Look here:
Are the explicit dependencies (DT_NEEDED) on UCX, libevent, PSM2, etc. really needed? Or would it be enough for Anyway, maybe my complaint is ultimately inconsequential, as |
@rhc54 This is what I got from
As the build is happening within a container, I guess this is just a missing |
@rhc54 After playing a bit on my machine, looks like the
|
Not really - PRRTE is designed to be a DVM. Pretty poor DVM if there is no way to launch across nodes 😄 You also have to admit that it's a pretty sad cluster that doesn't include even My concern would be that allowing the DVM to not have any way of launching across nodes is inviting trouble for the "real world" situation when there is a true configuration issue. Seen it before where someone complains about having spent many hours trying to figure out why the cluster won't start, only to discover that they cannot launch across nodes (for whatever reason). Perhaps some user-required directive (like what we do for Meantime, just add the |
Definitely. I just asked to be sure that's the intended behavior. However, maybe the failure could be delayed to the point where the ssh/rsh command is actually needed? Anyway, this is a minor nit.
Even better, I found a very classy workaround:
That way, if the PS: One final comment: I find the |
@dalcinl Well, shoot. Let me know what else you need to do to see if our upstream fix was not correct (or not complete). |
d443a3e
to
eb23fc5
Compare
@jsquyres I left a comment in the PR. My guess is that all the MCA plugins that use CUDA may need a similar fix, not just the accelerator plugin. But you guys know your things better. For the time being, I can get this conda-forge update going via global |
@dalcinl I think we'll have this fixed for v5.0.1; sorry for the hassle. I think your workaround for the recipe here is fine. |
@leofang @jakirkham I'm done with this. Is there anything left? |
@conda-forge-admin rerender |
1 similar comment
@conda-forge-admin rerender |
Hi! This is the friendly automated conda-forge-webservice. I tried to rerender for you, but it looks like there was nothing to do. This message was generated by GitHub actions workflow run https://github.com/conda-forge/openmpi-feedstock/actions/runs/7129333418. |
@leofang The ppc64le builds have timeout consistently. How should we move forward? |
If this keeps happening, we'll have no way but to disable ppc... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Tested locally:
- Download the linux-64 artifact and unpack the zip
- Create & activate a new env, with Open MPI 5 installed
conda create -n my_env -c path/to/build_artifacts -c conda-forge python=3.10 openmpi=5 cupy gcc
conda activate my_env
- Clone mpi4py source and do
pip install -v .
- Test CUDA awareness through CuPy:
mpirun -n 2 --mca pml ^ucx --mca osc ^ucx --mca opal_cuda_support 1 python -m mpi4py.bench pingpong -a cupy
worksmpirun -n 2 --mca pml ucx --mca osc ucx python -m mpi4py.bench pingpong -a cupy
worksmpirun -n 2 --mca opal_cuda_support 1 python -m mpi4py.bench pingpong -a cupy
works
@dalcinl Looks like ppc is behaving today, merge? |
Just glancing at the openmpi-feedstock/conda-forge.yml Lines 1 to 3 in 445ff31
openmpi-feedstock/conda-forge.yml Lines 10 to 12 in 445ff31
Probably the long-term sustainable answer is cross-compiling. Maybe we can look into that in the new year? |
…nda-forge-pinning 2023.12.21.09.51.46
ef5c6ed
to
5b6c074
Compare
@leofang I'll remove artifact upload, and then merge. If ppc64 fails, I'll keep restarting the build until I get lucky.
@jakirkham Definitely. It would be a pity to not test the aarch64/ppc64 buils, but perhaps we can setup some testing post-mortem after the package is uploaded, just to make sure the thing actually works. |
Thanks for heads-up, Jeff! Our bot also informed us about the update (#131), but we need to get this PR in first, due to the infra update to support v5.x. |
It is very likely that the current package version for this feedstock is out of date.
Checklist before merging this PR:
license_file
is packagedInformation about this PR:
@conda-forge-admin,
please add bot automerge
in the title and merge the resulting PR. This command will add our bot automerge feature to your feedstock.bot-rerun
label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase@conda-forge-admin, please rerun bot
in a PR comment to have theconda-forge-admin
add it for you.Pending Dependency Version Updates
Here is a list of all the pending dependency version updates for this repo. Please double check all dependencies before merging.
Dependency Analysis
Please note that this analysis is highly experimental. The aim here is to make maintenance easier by inspecting the package's dependencies. Importantly this analysis does not support optional dependencies, please double check those before making changes. If you do not want hinting of this kind ever please add
bot: inspection: false
to yourconda-forge.yml
. If you encounter issues with this feature please ping the bot teamconda-forge/bot
.Analysis by source code inspection shows a discrepancy between it and the the package's stated requirements in the meta.yaml.
Packages found by source code inspection but not in the meta.yaml:
This PR was created by the regro-cf-autotick-bot. The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. Feel free to drop us a line if there are any issues! This PR was generated by https://github.com/regro/cf-scripts/actions/runs/6655727034, please use this URL for debugging.