-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Singleton init mode, MPI_Comm_spawn, and OPAL_PREFIX #12349
Comments
I'll take a look at adjusting the path search. Having enough problems with #12307 at the moment. |
Sorry! 😢 |
@dalcinl Just to be clear is this a problem on main branch as well? |
@wenduwan I discovered this issue testing with the 5.0.2 tarball. I cannot assert anything about the main branch, but a quick glance at the code seems to indicate that the root issue is also in there. |
Adding main label so we don't ignore it for e.g. 5.1.x |
@jsquyres Would Open MPI accept platform-specific ways to derive |
All that is needed is to translate |
If this is about singleton comm-spawn, then you need to put the translation into the bottom of the |
@rhc54 Arent you are oversimplifying things a bit? I know almost nothing about the ompi codebase, but it looks to me that you have to, after the environment var translation you described, somehow replicate the logics of |
I wasn't oversimplifying things - just pointing out that this isn't a difficult problem. Sure, you need to update the dpm code to duplicate that in mpirun, but that's a copy/paste operation. The code in mpirun is actually doing the "right thing" - it's just the dpm that is out of date. |
Try #12363 - took longer to compile then to do 🤷♂️ |
This issue is losing steam. |
@wenduwan I believe this issue is already fixed in main, at least in my local testing. I'm still getting some warnings, though, but these are unrelated:
And no, the mpi4py CI tests do not specifically test for this issue. After all the work @jsquyres put into it, adding a new workflow should be quite easy if you really want to have a quick test for relocation via |
The warnings are real. @edgargabriel and @lrbison both observed them in singleton mode. I think adding a minimal relocation test is reasonable. IIUC the test will look like the following:
Please correct me if I'm wrong. |
Not sure I understand this report. Are you saying running |
no, its the other way around. With mpirun there is no warning, only with singleton |
Thanks - and the warning is still there on head of current main? |
I think so, I think I saw it a few days ago. I can double check in a few days and look into it, its coming from the btl.smcuda component I think |
You see this smcuda warning any time the MPI process doesn't have a prrte pmix server to talk to. |
Yeah, this has nothing to do with moving the installation. I had posted a patch to silence this warning, but it ran into a bunch of nit-picking and so I deleted it. It's a trivial fix to silence it, if someone else wants to try to get it committed. |
@dalcinl Thank you very much! |
I'm working on producing a binary Open MPI Python wheel to allow for
pip install openmpi
. I had to add a few hacks here and there to overcome a few issues strictly related to how wheel files are stored, how they are installed in Python virtual environments, and the lack of post-install hooks, and the expectations of things to work without activating the environment . All this of course requires relocating the Open MPI installation. And at that point I've found and minor issue that I don't know how to overcome.First, a clarification: I'm using internal PMIX and PRRTE. Supposedly,
OPAL_PREFIX
env var is all what is needed for things to work out of the box when relocating an Open MPI install. However, I think I came across a corner case. If using singleton init mode, then I believeOPAL_PREFIX
is simply ignored, and if tools are not located via$PATH
, then things do not work.Looking at the spawn code, I see a function
start_dvm
inompi/dpm/dpm.c
.This function
start_dvm
has the following code:However, I believe
opal_find_absolute_path()
does not care at all about theOPAL_PREFIX
env var, it only usesPATH
, eventually. The comment find the prte binary using the install_dirs support is simply not true.@hppritcha Your input is much appreciated here. I do have a reproducer, but so far it is based on Python and intermediate binary assets I'm generating locally. I hope the description above is enough for you to realize the issue.
The text was updated successfully, but these errors were encountered: