-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processes returns wrong results on OpenMPI 4.0.4 unless $OMPI_MCA_pml is set to "^ucx" #8321
Comments
It would be useful to know the version of UCX is being used. |
ucx.x86_64 1.8.1-3.fc33 |
Can you provide more detail about exactly what is going wrong, and/or a small example showing the problem? |
I'm the worst -- you provided a gist link with a sample and I missed it. We'll investigate... |
@ksiazekm i'm having reproducing this. First how many MPI processes are you using when you observe the problem and what is the tabsize you're supplying? |
Here you have 30 executions of the code from the previous gist: Test Results |
@hppritcha FWIW, I was able to reproduce the issue on a Fedora 33 virtual machine, using Open MPI and UCX provided by the distro. I was only able to find discarding the |
So far I can't reproduce on power9 + rhel8.2. I tried v4.0.x latest and also the v4.0.4 release tarball. |
hi |
That is a pretty standard Virtualbox VM running on a x86_64 (intel) processor. |
what is number of CPU? memory volume? thank you |
1 GB
|
use
you might have to run it several times to evidence the issue. |
I'm pre-emptively adding a v4.1 label on this issue too -- I'm assuming that the issue exists over there as well. |
I set the blocker label until we have more information. |
@jsquyres @gpaulsen I was only able to reproduce this on a Fedora 33 (virtual) machine, and the workaround is to discard the I did not investigate more (lack of time) and could not reach a conclusion w.r.t. the root cause of this issue (Open MPI? UCX? Fedora? a combination of these?) @opoplawski meanwhile, you might want to skip the
line into |
@jladd-mlnx @hoopoepg Someone from the UCX community needs to look into this. |
we can't reproduce issue on our environment :( |
I'm happy to push a ucx 1.9 update to Fedora 33 if that seems appropriate. Is updating UCX versions okay? There isn't a soname bump. |
@opoplawski thanks Orion. Transferring to Nvidia to let them decide how to deal with this from the Open MPI side. Recommend a configure check to not build UCX pml and osc if the UCX being installed is older than 1.9. @hoopoepg |
@hppritcha Better to use the general "UCX" team name to alert all the relevant people on the UCX side, not just @hoopoepg. FYI: @open-mpi/ucx |
@hppritcha @jsquyres how about pml/ucx would disqualify itself at runtime, if UCX library version is older than v1.9, with some warning and mca var to override this? |
If you really want to stop supporting older versions of UCX, you can certainly do that. IMHO: It would be best to have both |
I've submitted ucx 1.9 to Fedora 33 - https://bodhi.fedoraproject.org/updates/FEDORA-2021-613166cadb Please test and provide feedback. |
@ksiazekm Just out of curiosity, are you seeing the same behavior with Open MPI v4.1.0 as well? |
@open-mpi/ucx Can you give a definitive ruling on what you plan to do? I.e., are UCX versions prior to v1.9.0 problematic? |
there is one more issue #8442 which may be related to this one, and we reproduced it on our environment. working over idenrifying. |
@ksiazekm could you test latest OMPI (4.0.5)+UCX (1.9) available in Fedora 33? thank you |
I can confirm that it works properly with:
|
@open-mpi/ucx just to note there seem to be some significant differences between this issue and #8442 namely this one doesn't seem to be reproducible when using RC UCX_TLS, whereas #8442 seems to be avoidable using sm, tcp. Still they may be related although the conditions under which they appear wrt UCX_TLS are different. |
@ksiazekm could you provide output of command ```ucx_info -d" from system where data corruption happened? thank you |
Hi @jsquyres Is it still important enough to add warning given that we are pushing updated package for Fedora? |
Many people use Open MPI outside of Fedora. If you want to restrict Open MPI to only use UCX >= v1.10, you need to update the configury. Silent data corruption is BAD. |
@jsquyres IMO, runtime version check would be more strict (in case the user compiles OpenMPI with one UCX version, but runs with another version), WDYT? |
That would be fine as well. But we definitely also like configure-time failures (with helpful explanation messages). E.g., if someone tries to compile with an old/unsupported UCX, it should fail right away during configure (vs. succeeding to configure, build, and install, and then only fail at run time). Make sense? |
@jsquyres yes, that makes sense. just to make sure, it would not be a problem for OpenMPI that we essentially drop support for older UCX versions? |
adding @shamisp as well |
@hoopoepg here you have:
|
@ksiazekm thank you for info. |
@open-mpi/ucx What version of UCX is known to be good -- is it v1.9.0, or the upcoming v1.10.0? (I ask because I see that 1.9.0 is still the current release on https://github.com/openucx/ucx). Given that this is silent data corruption, and per the discussion above, it sounds like we need both a configure-time check and a run-time check to ensure that the UCX PML is running with >= UCX v1.GOOD_VERSION. Is there any progress on this? We need this ASAP to get new Open MPI releases out the door. |
@jsquyres , FYI we documented this critical issue on the front page https://github.com/openucx/ucx#known-critical-issues |
OpenMPI 4.0.4
Installed from official Fedora repository with use of dnf.
OS: Fedora 33, the latest
gcc 10.2.1 20201125 (Red Hat 10.2.1-9)
AMD FX8350,
some processes randomly return wrong results on the given configuration. When
tested the same source code on Debian GNU/Linux 7 (wheezy) with OpenMPI 1.4.5
it returns correct results each time. Test on each configuration was executed
30 times in a loop. OpenMPI 4.0.4 started to work properly after set
$OMPI_MCA_pml to "^ucx".
The source code: gist link
The text was updated successfully, but these errors were encountered: