Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use CMA in user namespaces #6844

Merged
merged 1 commit into from
Sep 21, 2019

Conversation

adrianreber
Copy link
Member

Trying out to run processes via mpirun in Podman containers has shown that the CMA btl_vader_single_copy_mechanism does not work when user namespaces are involved.

Creating containers with Podman requires at least user namespaces to be able to do unprivileged mounts in a container

Even if running the container with user namespace user ID mappings which result in the same user ID on the inside and outside of all involved containers, the check in the kernel to allow ptrace (and thus process_vm_{read,write}v()), fails if the same IDs are not in the same user namespace.

One workaround is to specify '--mca btl_vader_single_copy_mechanism none' and this commit adds code to automatically skip CMA if user namespaces are detected.

Preferred implementation would have been to detect if the other local processes are running in different user namespaces, but it was not clear how get the PIDs of the other involved processes in
mca_btl_vader_check_single_copy(). This is even more complicated if some processes would be running in the same user namespace, but not all of them. If one different user namespace is detected, CMA should be disabled for all involved processes. So if one local process detects that CMA is not working it would need to communicate this information to all local processes.

This implementation now checks during the first access of mca_btl_vader_{put,get}_cma() if the destination process is running in another user namespace and switches to MCA_BTL_VADER_EMUL if this is true.

So if the first access to process_vm_{read,write}v()) fails all further accesses are automatically no longer trying to use CMA.

@rhc54
Copy link
Contributor

rhc54 commented Jul 28, 2019

I'm not familiar with Podman. Are you saying that at time of launch, you have no idea that the containers are going to wind up in different user namespaces? That the only way to know is to search the /proc file system to discover what the other containers did? Are the containers setting the namespace upon exec - or is the kernel setting the namespace upon fork?

Assuming that situation is true (and I sincerely hope it isn't): Given that you are using mpirun to start the containers, and that mpirun knows the pid of each proc once it does the fork, and assuming it is the kernel setting the namespace, I'm wondering if it wouldn't be more efficient to have mpirun:

  • fork all the local procs, but don't exec them yet
  • check the namespace of each pid it just generated
  • add a PMIx attribute to the local job info indicating if different namespaces are in operation
  • exec the procs

If the container is setting the namespace at exec, then the only other option would be to modex_send the namespace of the process during vader component open, and have each process modex_recv the namespace of all local procs during vader_add_procs. This would remove the performance penalty on first access this PR introduces and ensure that all local procs disable CMA if needed.

We have found in other scenarios that having every local process access /proc really impacts scalability of launch, so avoiding that method is highly preferable.

@adrianreber
Copy link
Member Author

I'm not familiar with Podman.

Podman uses, just like Docker, runc to run the containers. runc is an implementation of the OCI spec. runc, as far as I understand it, does first a clone() and then an unshare() for the namespace setup.

Are you saying that at time of launch, you have no idea that the containers are going to wind up in different user namespaces?

Well, if I know that I am running Podman, I probably could assume that each process will run in its own namespace. There is, however, an option to run different containers in the same user namespace. So just by saying it is a Podman container it will be in different user namespaces is not correct. There are also other container runtimes which are using user namespaces (for namespace based container user namespaces are basically required to be able to mount something in the container), so you would need to detect those also.

That the only way to know is to search the /proc file system to discover what the other containers did?

As far as I understand it, yes, looking at /proc is the only way. I will ask someone more familiar to also comment here.

Are the containers setting the namespace upon exec - or is the kernel setting the namespace upon fork?

The namespaces are set up by runc, which is started by Podman (Podman actually starts conmon and conmon starts runc). There is a long comment how this is done at https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/nsexec.c#L645

Assuming that situation is true (and I sincerely hope it isn't): Given that you are using mpirun to start the containers, and that mpirun knows the pid of each proc once it does the fork, and assuming it is the kernel setting the namespace, I'm wondering if it wouldn't be more efficient to have mpirun:

* fork all the local procs, but don't exec them yet

* check the namespace of each pid it just generated

* add a PMIx attribute to the local job info indicating if different namespaces are in operation

* exec the procs

I am pretty sure this will not work.

If the container is setting the namespace at exec, then the only other option would be to modex_send the namespace of the process during vader component open, and have each process modex_recv the namespace of all local procs during vader_add_procs. This would remove the performance penalty on first access this PR introduces and ensure that all local procs disable CMA if needed.

I think I tried to do something like this, but I did not manage to do the modex_send and modex_recv correctly. I can try to come a with a patch to do this, but last time I got really confused about how modex works. Maybe with this description and the clarification from the mailing list I can try it once more.

We have found in other scenarios that having every local process access /proc really impacts scalability of launch, so avoiding that method is highly preferable.

Understood.

@bwbarrett
Copy link
Member

I agree with @rhc54; publishing the namespace id as part of modex send/recv seems like a much better idea.

If CMA is explicitly requested, the result of not being in the same namespace (and therefore not using CMA) should be a hard error, not a warning (ie, not the cma-different-user-namespace warning). I could be reading the code wrong, but it appears that the patch would not fail in this situation.

@adrianreber
Copy link
Member Author

If CMA is explicitly requested, the result of not being in the same namespace (and therefore not using CMA) should be a hard error, not a warning (ie, not the cma-different-user-namespace warning). I could be reading the code wrong, but it appears that the patch would not fail in this situation.

I implemented the same behaviour as requesting cma in a non-namespaced environment. vader will fall back to another single-copy mechanism. No hard error.

Just to make sure, before looking into @rhc54 recommendation, how it should fail. Same behaviour as the current cma code, or hard error if running in user namespaces?

@jsquyres
Copy link
Member

@adrianreber We have a general rule of thumb in Open MPI: if a human asks for something and we can't deliver it, that's a hard error / exit. If Vader isn't obeying that in some cases, that's actually a bug that should be fixed.

@adrianreber
Copy link
Member Author

@jsquyres Perfect, thanks.

@adrianreber
Copy link
Member Author

I force pushed a new version based on recv() and send().

The only thing missing is the hard error if the user explicitly selected cma. I am not sure where to track the user selected single_copy_mechanism. There is mca_btl_vader_component.single_copy_mechanism which can change its value.

Should I just introduce a new field in struct mca_btl_vader_component_t to track the user selected single_copy_mechanism? The problem I see is, how can I differentiate if the user has not selected anything and vader is just using the first single_copy_mechanism (which can be CMA), or if the user has actually selected CMA. Is it possible to see if the default value is used or some specific option has been selected.

hjelmn
hjelmn previously requested changes Jul 30, 2019
opal/mca/btl/vader/btl_vader.h Show resolved Hide resolved
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to look at mca_base_var_get_value() to get the source of the MCA variable value -- e.g., whether it was just the default value, or whether a user actually set that value.

opal/mca/btl/vader/btl_vader_component.c Outdated Show resolved Hide resolved
@adrianreber
Copy link
Member Author

You should be able to look at mca_base_var_get_value() to get the source of the MCA variable value -- e.g., whether it was just the default value, or whether a user actually set that value.

Thanks. I included some code based on this to OPAL_ERROR if CMA is not selected as part of MCA_BASE_VAR_SOURCE_DEFAULT.

I hope I addressed all open points.

@jsquyres
Copy link
Member

False cray failure: bot:ompi:retest

@ggouaillardet
Copy link
Contributor

@adrianreber you can simply get the namespace id via a stat() as in the inline patch below

can you also please explain why you had to add struct vader_modex_other_t with an int seg_ds_size and move seg_ds into that struct ?

diff --git a/opal/mca/btl/vader/btl_vader_component.c b/opal/mca/btl/vader/btl_vader_component.c
index dc299b1..922658c 100644
--- a/opal/mca/btl/vader/btl_vader_component.c
+++ b/opal/mca/btl/vader/btl_vader_component.c
@@ -41,6 +41,9 @@
 #include "btl_vader_fbox.h"
 #include "btl_vader_xpmem.h"
 
+#ifdef HAVE_SYS_STAT_H
+#include <sys/stat.h>
+#endif
 #include <sys/mman.h>
 #include <fcntl.h>
 
@@ -351,47 +354,25 @@ static int mca_btl_vader_component_close(void)
 }
 
 /*
- * mca_btl_vader_parse_proc_ns_user() tries to parse the user namespace ID
+ * mca_btl_vader_parse_proc_ns_user() tries to get the user namespace ID
  * of the current process.
  * Returns the ID of the user namespace. In the case of an error '0' is returned.
  */
 uint64_t mca_btl_vader_parse_proc_ns_user(void)
 {
-    char *link = malloc(PATH_MAX);
-    pid_t pid = getpid();
-    uint64_t user_ns_id;
-    char *tmp;
-    int i;
-
-    i = readlink("/proc/self/ns/user", link, PATH_MAX);
-    if (-1 == i) {
-        return 0;
-    }
-
-    /*
-     * Result in link should look like 'user:[<inode-number>]', so at least
-     * 8 characters long: 'user:[?]'.
-     */
-    if (8 > i) {
-        free(link);
-        opal_output(0, "Error reading user namespace ID of process %d\n", pid);
-        return 0;
-    }
+    struct stat buf;
 
-    /* remove trailing ']' */
-    link[i - 1] = '\0';
-    tmp = strchr(link, '[');
-    if (NULL == tmp) {
-        free(link);
-        opal_output(0, "Error reading user namespace ID of process %d\n", pid);
+    if (0 > stat("/proc/self/ns/user", &buf)) {
+        /*
+         * Something went wrong, probably an old kernel that does not support namespaces
+         * simply assume all processes are in the same user namespace and return 0
+         */
         return 0;
     }
 
-    user_ns_id = strtoul(tmp + 1, NULL, 10);
-    free(link);
-
-    return user_ns_id;
+    return (uint64_t)buf.st_ino;
 }
+
 static int mca_btl_base_vader_modex_send (void)
 {
     union vader_modex_t modex;

@adrianreber
Copy link
Member Author

@ggouaillardet

you can simply get the namespace id via a stat() as in the inline patch below

Are you sure? Using the command-line tool stat on /proc/self/ns/user this is not true. I assumed that the inode given by stat is the same as with stat(2).

@ggouaillardet
Copy link
Contributor

@adrianreber That is why assumptions should be avoided :-)

this is a counter-intuitive, but stat(1) (e.g. CLI) uses lstat(2) and not stat(2)

from the CLI, you have to

stat -L /proc/self/ns/user

that uses stat(2), and use glibc's stat() in your C code.

@adrianreber
Copy link
Member Author

@ggouaillardet

can you also please explain why you had to add struct vader_modex_other_t with an int seg_ds_size and move seg_ds into that struct ?

I cannot really explain it. vader_modex_t was a union:

union vader_modex_t {
#if OPAL_BTL_VADER_HAVE_XPMEM
    struct vader_modex_xpmem_t {
        xpmem_segid_t seg_id;
        void *segment_base;
    } xpmem;
#endif 
    opal_shmem_ds_t seg_ds;
}

So I thought to extend the non xpmem part I have to create a struct just like vader_modex_xpmem_t. Do you have any other recommendations how to do it better?

@adrianreber
Copy link
Member Author

@ggouaillardet Thanks for the simplification. Changed it and force pushed. As this removed all occurrences of opal_output() this should also solve the questions from @jsquyres.

@jjhursey
Copy link
Member

bot:ibm:retest

@open-mpi open-mpi deleted a comment from ibm-ompi Jul 31, 2019
@open-mpi open-mpi deleted a comment from ibm-ompi Jul 31, 2019
@hjelmn
Copy link
Member

hjelmn commented Jul 31, 2019

It is a union because either Vader is using xpmem or or isn't. Do you know if xpmem works between namespaces? If not then additional changes are needed.

@adrianreber
Copy link
Member Author

Do you know if xpmem works between namespaces?

Unfortunately I know nothing about xpmem. The vader code is probably the first time I read about it.

@hppritcha
Copy link
Member

bot:ompi:retest

@adrianreber
Copy link
Member Author

Are there any more changes required from my side? Not sure if all discussions are resolved or not.

@adrianreber
Copy link
Member Author

Any more comments, reviews, suggestions? Any chance to get this merged?

@jsquyres
Copy link
Member

jsquyres commented Sep 5, 2019

Do you know if XPMEM is broken due to this PR? Or was it already broken on master?

@rhc54
Copy link
Contributor

rhc54 commented Sep 5, 2019

Per @hppritcha note: using master or using this PR

@hppritcha
Copy link
Member

appears to have been broken already on master.

@jsquyres
Copy link
Member

jsquyres commented Sep 5, 2019

Per @hppritcha note: using master or using this PR

Heh. Missed that. Ok.

How do we move forward with this PR then? Should we just go ahead and merge it? Or do we need to wait for an XPMEM fix (independent of this PR) first?

@hppritcha
Copy link
Member

I'd go ahead and merge it. But wait, have you actually seen what the output looks like if this feature kicks in?
I've asked charliecloud people to post what they see. I'd suggest we wait till that is posted, you may not like it @jsquyres

@jsquyres
Copy link
Member

jsquyres commented Sep 5, 2019

@hppritcha No problem with waiting for more data -- I just wanted to make sure this PR didn't get dropped / forgotten again...

@adrianreber
Copy link
Member Author

Ah, good to know, I though my change broke xpmem. So xpmem is also not working on master. I was already looking at code.

Definitely interested to hear if this also works for charliecloud.

@heasterday
Copy link
Contributor

Hello all I am one of the Charliecloud folks @hppritcha mentioned. The fallback seems to work as intended so all looks good on that end. However, when "OMPI_MCA_btl_vader_single_copy_mechanism=cma" is specified the application will print the fatal error and continue to run.

ERROR: Linux kernel CMA support was requested via the
btl_vader_single_copy_mechanism MCA variable, but CMA support is
not available due to different user namespaces.

Your MPI job will abort now. Please select another value for
btl_vader_single_copy_mechanism.

  Local host: ko034
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018 Update 1, MPI-1 part    
#------------------------------------------------------------
# Date                  : Thu Sep  5 15:04:04 2019
# Machine               : x86_64
# System                : Linux
<application output clipped>

The resulting performance is what I would get if I specified OMPI_MCA_btl=tcp,self. I'm not sure how often anyone would do this but printing a fatal error and then continuing to run seems unintended.

@heasterday
Copy link
Contributor

Thank you to @adrianreber for working on this, this will be nice functionality to have :)

@adrianreber
Copy link
Member Author

I have also seen that the ring example I have been using continues to run if a fatal error happens. I never really looked but I thought that it just ignores the error it gets from Open MPI. In ring.c there are MPI_Send and MPI_Recv without ever checking the return code. (Actually not sure if that is necessary/correct, but that is what I would expect).

@rhc54
Copy link
Contributor

rhc54 commented Sep 5, 2019

OMPI policy is to indeed abort if the user explicitly requests something we cannot do - so in the case cited by @heasterday, we should definitely abort.

@adrianreber
Copy link
Member Author

OMPI policy is to indeed abort if the user explicitly requests something we cannot do - so in the case cited by @heasterday, we should definitely abort.

Is returning OPAL_ERROR not enough to abort? OPAL_ERROR is what I saw at all the places I looked how to handle an error. What is the right way to abort?

@rhc54
Copy link
Contributor

rhc54 commented Sep 5, 2019

Well, looks pretty simple - from a different btl:

void mca_btl_ofi_exit(void)
{
    BTL_ERROR(("BTL OFI will now abort."));
    exit(1);
}

@adrianreber
Copy link
Member Author

Okay, will add exit() for the abort cases.

@rhc54
Copy link
Contributor

rhc54 commented Sep 5, 2019

Thx!

@jsquyres
Copy link
Member

jsquyres commented Sep 5, 2019

A better example would be to look at the usNIC BTL utility function to exit upon error -- it will call the PML callback, if it exists / has been set:

void opal_btl_usnic_exit(opal_btl_usnic_module_t *module)
{
if (NULL == module) {
/* Find the first module with an error callback */
for (int i = 0; i < mca_btl_usnic_component.num_modules; ++i) {
if (NULL != mca_btl_usnic_component.usnic_active_modules &&
NULL != mca_btl_usnic_component.usnic_active_modules[i] &&
NULL != mca_btl_usnic_component.usnic_active_modules[i]->pml_error_callback) {
module = mca_btl_usnic_component.usnic_active_modules[i];
break;
}
}
/* If we didn't find a PML error callback, just exit. */
if (NULL == module) {
fprintf(stderr, "*** The Open MPI usnic BTL is aborting the MPI job (via exit(3)).\n");
fflush(stderr);
exit(1);
}
}
/* After discussion with George, we decided that it was safe to
cast away the const from opal_proc_local_get() -- the error
function needs to be smart enough to not take certain actions
if the passed proc is yourself (e.g., don't call del_procs() on
yourself). */
if (NULL != module->pml_error_callback) {
module->pml_error_callback(&module->super,
MCA_BTL_ERROR_FLAGS_FATAL,
(opal_proc_t*) opal_proc_local_get(),
"The usnic BTL is aborting the MPI job (via PML error callback).");
}
/* If the PML error callback returns (or if there wasn't one),
just exit. Shrug. */
exit(1);
}

@adrianreber
Copy link
Member Author

adrianreber commented Sep 5, 2019

@jsquyres something like that?

diff --git a/opal/mca/btl/vader/btl_vader_module.c b/opal/mca/btl/vader/btl_vader_module.c
index 55b4726340..17381ba17d 100644
--- a/opal/mca/btl/vader/btl_vader_module.c
+++ b/opal/mca/btl/vader/btl_vader_module.c
@@ -80,6 +80,30 @@ mca_btl_vader_t mca_btl_vader = {
     }
 };
 
+// Exit function copied from btl_usnic_util.c
+// The following comment tells Coverity that this function does not return.
+// See https://scan.coverity.com/tune.
+
+/* coverity[+kill] */
+static void vader_btl_exit(mca_btl_vader_t *btl)
+{
+    if (NULL == btl) {
+        fprintf(stderr, "*** The Open MPI vader BTL is aborting the MPI job (via exit(3)).\n");
+        fflush(stderr);
+        exit(1);
+    }
+
+    if (NULL != btl->error_cb) {
+        btl->error_cb(&btl->super, MCA_BTL_ERROR_FLAGS_FATAL,
+                      (opal_proc_t*) opal_proc_local_get(),
+                      "The vader BTL is aborting the MPI job (via PML error callback).");
+    }
+
+    /* If the PML error callback returns (or if there wasn't one),
+       just exit.  Shrug. */
+    exit(1);
+}
+
 static int vader_btl_first_time_init(mca_btl_vader_t *vader_btl, int n)
 {
     mca_btl_vader_component_t *component = &mca_btl_vader_component;
@@ -236,7 +260,7 @@ static int init_vader_endpoint (struct mca_btl_base_endpoint_t *ep, struct opal_
                         /* If CMA has been explicitly selected we want to error out */
                         opal_show_help("help-btl-vader.txt", "cma-different-user-namespace-error",
                                        true, opal_process_info.nodename);
-                        return OPAL_ERROR;
+                        vader_btl_exit(&mca_btl_vader);
                     }
                     /*
                      * If CMA has been selected because it is the default or

@jsquyres
Copy link
Member

jsquyres commented Sep 5, 2019

@adrianreber Minor suggestion (and I should probably do something like this to the usnic BTL, too...):

static void vader_btl_exit(mca_btl_vader_t *btl)
{
    if (NULL != btl && NULL != btl->error_cb) {
        btl->error_cb(&btl->super, MCA_BTL_ERROR_FLAGS_FATAL,
                      (opal_proc_t*) opal_proc_local_get(),
                      "The vader BTL is aborting the MPI job (via PML error callback).");
    }

    /* If the PML error callback returns (or if there wasn't one),
       just exit.  Shrug. */
    fprintf(stderr, "*** The Open MPI vader BTL is aborting the MPI job (via exit(3)).\n");
    fflush(stderr);
    exit(1);
}

Also, github pro tip: if you put the word "patch" after the 3 tick marks for the verbatim section, github syntax highlights it as a diff. Similarly, putting "c" after the 3 tick marks syntax highlights it as C code. ...etc.

Trying out to run processes via mpirun in Podman containers has shown
that the CMA btl_vader_single_copy_mechanism does not work when user
namespaces are involved.

Creating containers with Podman requires at least user namespaces to be
able to do unprivileged mounts in a container

Even if running the container with user namespace user ID mappings which
result in the same user ID on the inside and outside of all involved
containers, the check in the kernel to allow ptrace (and thus
process_vm_{read,write}v()), fails if the same IDs are not in the same
user namespace.

One workaround is to specify '--mca btl_vader_single_copy_mechanism none'
and this commit adds code to automatically skip CMA if user namespaces
are detected and fall back to MCA_BTL_VADER_EMUL.

Signed-off-by: Adrian Reber <areber@redhat.com>
@adrianreber
Copy link
Member Author

@jsquyres Thanks for the help.

I added an exit function and I am calling it now in case cma has been explicitly requested by the user.

@hppritcha
Copy link
Member

@jsquyres could you double check and see if these are the changes you requested?

@adrianreber
Copy link
Member Author

Is this ready or does it need additional changes?

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the poke / sorry for the delay.

@jsquyres
Copy link
Member

@hjelmn Are you cool with everything in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants