-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vader in a Docker Container #4948
Comments
What are those error messages from -- are they from your program? I.e., what exactly are those error messages indicating? FWIW: I do not believe we have any tests -- in configure or otherwise -- to check for non-functional CMA. If we find CMA support, we assume it's working. Vader should work just fine if CMA support is not present -- it will fall back to regular copy-in-copy-out shared memory. |
Are the processes within different docker containers? If so, then its likely CMA is failing because the containers may be in different name spaces. The workaround is to disable CMA on the mpirun command line: --mca btl_vader_single_copy_mechanism=none |
Thank you for the details!
No they originate from OpenMPI. I guess from here: https://github.com/open-mpi/ompi/blob/v3.0.0/opal/mca/btl/vader/btl_vader_get.c#L74-L78
No it's the same container on a Nvidia DGX-1 (low detail datasheet, detailed guide) which has two Xeon packages in case that is relevant. We will try to debug it hands-on again with Nvidia engineers next week. I was just wondering if the error (see code lines above) already tells you something that could give me pointers on how to debug CMA (or if you had runtime CMA tests in place). I see that you define |
@ax3l the issue here is that The root cause could be docker prevents this, and some sysadmin config might be required. From the man page
you might want to manually check those conditions are met as well. I would suggest you first try to run your app on the host first, and then in the container.
might help you Note the
might also help you here. |
@ggouaillardet btl/sm is not needed. The way we are recommending to deal with docker if the ptrace permissions can't be fixed is to set OMPI_MCA_btl_vader_single_copy_mechanism=none. That will disable CMA. |
I recently upgraded the OpenMPI library in our project's Travis CI setup from 2.1.1 to 3.0.2 (and tried 3.1.0), and we were observing
as soon as Setting
before launching the jobs, as @hjelmn suggested, seems to have fixed the problem in 3.0.2 for us. |
Note if you want better performance you want CMA to work. It will only work if all the local MPI processes are in the same namespace. An alternative (I haven't tested this) would be to use xpmem: http://gitlab.com/hjelmn/xpmem (there is a version on github but it will no longer be maintained). |
Note that one can allow ptrace permissions by |
Got exactly the same issue with OpenMPI 3.1.3 in Docker. This fixes the problem! |
Same problem on OpenMPI 4.0.0 in Docker; disabling vader single copy mechanism as suggested above fixes it. |
not directly relevant for our current 16.04 image but still good to have. See open-mpi/ompi#4948
not directly relevant for our current 16.04 image but still good to have. See open-mpi/ompi#4948
not directly relevant for our current 16.04 image but still good to have. See open-mpi/ompi#4948
not directly relevant for our current 16.04 image but still good to have. See open-mpi/ompi#4948
not directly relevant for our current 16.04 image but still good to have. See open-mpi/ompi#4948
not directly relevant for our current 16.04 image but still good to have. See open-mpi/ompi#4948
not directly relevant for our current 16.04 image but still good to have. See open-mpi/ompi#4948
FWIW the root cause of this is likely that Edit: |
… Docker container. We add a line in Dockerfile to set OpenMPI environment variable after everything has bee installed. For further details, see [the issue in OpenMPI Github page](open-mpi/ompi#4948)
Saw the problem again today with:
Work-around as before is still: export OMPI_MCA_btl_vader_single_copy_mechanism=none |
Update Docker container and run instructions to avoid MPI / Docker security conflicts This approach is expedient for this example, but probably not the best approach for production deployments. See separate discussion on this issue open-mpi/ompi#4948.
A few notes on this ticket:
I think this ticket can be closed given the workaround for the v4.x series. The change in PR #10694 should make it so that the workaround is not required and the As noted above |
…e call OMPI_MCA_btl_vader_single_copy_mechanism is meant to suppress an error message from an incompatibility between btl/vader and docker, see open-mpi/ompi#4948. PARSEC_MCA_runtime_bind_threads is meant to disable thread binding in PaRSEC, potentially speeding up test runs. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
…e call OMPI_MCA_btl_vader_single_copy_mechanism is meant to suppress an error message from an incompatibility between btl/vader and docker, see open-mpi/ompi#4948. PARSEC_MCA_runtime_bind_threads is meant to disable thread binding in PaRSEC, potentially speeding up test runs. Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
I'm running into this issue on https://modal.com, and it is causing |
What version of OMPI are you using? It's hard to help if you don't provide at least some basic info. |
The command I'm running is:
where the hostfile is: |
@hppritcha @ggouaillardet You folks have any thoughts here? I don't know anything about v4.1, I'm afraid. |
This looks like a gpu within docker issue. @vedantroy please open a new issue and give a full description of your problem. |
vader does not use |
I second @ggouaillardet on opening a separate issue if @bosilca 's suggestion doesn't work. |
This leads to errors such as [runner-...] Read -1, expected <some numer>, errno = 1 in docker, so we disable it. Some more discussion can be found here: open-mpi/ompi#4948
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
OpenMPI 3.0.0 (and 2.1.2 for comparisons)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source (via spack) inside a docker container.
Please describe the system on which you are running
Details of the problem
Starting MPI with more than one rank will result in errors of the form
as soon as communication is performed (send, receive, reduce, etc.). Simple programs that only contain MPI startup (Init, Get_Rank, Finalize) and shutdown run without issues.
The only way to work-around this issue was for me to down-grade to OpenMPI 2.X which still supports "sm" as a BTL and deactivating vader, e.g. with
export OMPI_MCA_btl="^vader"
.Is it possible the detection/test of a working CMA is not fully functional? This issue is likely caused by either a non-existent or not-fully forwarded CMA kernel support inside the docker container. Do you have any recommendations on how to use vader in such an environment as the in-node BTL?
The text was updated successfully, but these errors were encountered: