-
Notifications
You must be signed in to change notification settings - Fork 870
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not use CMA in user namespaces #6844
Conversation
I'm not familiar with Podman. Are you saying that at time of launch, you have no idea that the containers are going to wind up in different user namespaces? That the only way to know is to search the Assuming that situation is true (and I sincerely hope it isn't): Given that you are using mpirun to start the containers, and that mpirun knows the pid of each proc once it does the
If the container is setting the namespace at We have found in other scenarios that having every local process access |
Podman uses, just like Docker, runc to run the containers. runc is an implementation of the OCI spec. runc, as far as I understand it, does first a clone() and then an unshare() for the namespace setup.
Well, if I know that I am running Podman, I probably could assume that each process will run in its own namespace. There is, however, an option to run different containers in the same user namespace. So just by saying it is a Podman container it will be in different user namespaces is not correct. There are also other container runtimes which are using user namespaces (for namespace based container user namespaces are basically required to be able to mount something in the container), so you would need to detect those also.
As far as I understand it, yes, looking at
The namespaces are set up by runc, which is started by Podman (Podman actually starts conmon and conmon starts runc). There is a long comment how this is done at https://github.com/opencontainers/runc/blob/master/libcontainer/nsenter/nsexec.c#L645
I am pretty sure this will not work.
I think I tried to do something like this, but I did not manage to do the
Understood. |
I agree with @rhc54; publishing the namespace id as part of modex send/recv seems like a much better idea. If CMA is explicitly requested, the result of not being in the same namespace (and therefore not using CMA) should be a hard error, not a warning (ie, not the |
I implemented the same behaviour as requesting cma in a non-namespaced environment. vader will fall back to another single-copy mechanism. No hard error. Just to make sure, before looking into @rhc54 recommendation, how it should fail. Same behaviour as the current cma code, or hard error if running in user namespaces? |
@adrianreber We have a general rule of thumb in Open MPI: if a human asks for something and we can't deliver it, that's a hard error / exit. If Vader isn't obeying that in some cases, that's actually a bug that should be fixed. |
@jsquyres Perfect, thanks. |
8c9ad46
to
2857b0b
Compare
I force pushed a new version based on recv() and send(). The only thing missing is the hard error if the user explicitly selected Should I just introduce a new field in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to look at mca_base_var_get_value() to get the source of the MCA variable value -- e.g., whether it was just the default value, or whether a user actually set that value.
2857b0b
to
7686c6a
Compare
Thanks. I included some code based on this to OPAL_ERROR if CMA is not selected as part of MCA_BASE_VAR_SOURCE_DEFAULT. I hope I addressed all open points. |
False cray failure: bot:ompi:retest |
@adrianreber you can simply get the namespace id via a can you also please explain why you had to add diff --git a/opal/mca/btl/vader/btl_vader_component.c b/opal/mca/btl/vader/btl_vader_component.c
index dc299b1..922658c 100644
--- a/opal/mca/btl/vader/btl_vader_component.c
+++ b/opal/mca/btl/vader/btl_vader_component.c
@@ -41,6 +41,9 @@
#include "btl_vader_fbox.h"
#include "btl_vader_xpmem.h"
+#ifdef HAVE_SYS_STAT_H
+#include <sys/stat.h>
+#endif
#include <sys/mman.h>
#include <fcntl.h>
@@ -351,47 +354,25 @@ static int mca_btl_vader_component_close(void)
}
/*
- * mca_btl_vader_parse_proc_ns_user() tries to parse the user namespace ID
+ * mca_btl_vader_parse_proc_ns_user() tries to get the user namespace ID
* of the current process.
* Returns the ID of the user namespace. In the case of an error '0' is returned.
*/
uint64_t mca_btl_vader_parse_proc_ns_user(void)
{
- char *link = malloc(PATH_MAX);
- pid_t pid = getpid();
- uint64_t user_ns_id;
- char *tmp;
- int i;
-
- i = readlink("/proc/self/ns/user", link, PATH_MAX);
- if (-1 == i) {
- return 0;
- }
-
- /*
- * Result in link should look like 'user:[<inode-number>]', so at least
- * 8 characters long: 'user:[?]'.
- */
- if (8 > i) {
- free(link);
- opal_output(0, "Error reading user namespace ID of process %d\n", pid);
- return 0;
- }
+ struct stat buf;
- /* remove trailing ']' */
- link[i - 1] = '\0';
- tmp = strchr(link, '[');
- if (NULL == tmp) {
- free(link);
- opal_output(0, "Error reading user namespace ID of process %d\n", pid);
+ if (0 > stat("/proc/self/ns/user", &buf)) {
+ /*
+ * Something went wrong, probably an old kernel that does not support namespaces
+ * simply assume all processes are in the same user namespace and return 0
+ */
return 0;
}
- user_ns_id = strtoul(tmp + 1, NULL, 10);
- free(link);
-
- return user_ns_id;
+ return (uint64_t)buf.st_ino;
}
+
static int mca_btl_base_vader_modex_send (void)
{
union vader_modex_t modex; |
Are you sure? Using the command-line tool |
@adrianreber That is why assumptions should be avoided :-) this is a counter-intuitive, but from the CLI, you have to
that uses |
I cannot really explain it.
So I thought to extend the non xpmem part I have to create a struct just like |
7686c6a
to
6b45ec2
Compare
@ggouaillardet Thanks for the simplification. Changed it and force pushed. As this removed all occurrences of |
6b45ec2
to
33f345f
Compare
bot:ibm:retest |
It is a union because either Vader is using xpmem or or isn't. Do you know if xpmem works between namespaces? If not then additional changes are needed. |
Unfortunately I know nothing about xpmem. The vader code is probably the first time I read about it. |
bot:ompi:retest |
Are there any more changes required from my side? Not sure if all discussions are resolved or not. |
Any more comments, reviews, suggestions? Any chance to get this merged? |
Do you know if XPMEM is broken due to this PR? Or was it already broken on master? |
Per @hppritcha note: using master or using this PR |
appears to have been broken already on master. |
Heh. Missed that. Ok. How do we move forward with this PR then? Should we just go ahead and merge it? Or do we need to wait for an XPMEM fix (independent of this PR) first? |
I'd go ahead and merge it. But wait, have you actually seen what the output looks like if this feature kicks in? |
@hppritcha No problem with waiting for more data -- I just wanted to make sure this PR didn't get dropped / forgotten again... |
Ah, good to know, I though my change broke xpmem. So xpmem is also not working on master. I was already looking at code. Definitely interested to hear if this also works for charliecloud. |
Hello all I am one of the Charliecloud folks @hppritcha mentioned. The fallback seems to work as intended so all looks good on that end. However, when "OMPI_MCA_btl_vader_single_copy_mechanism=cma" is specified the application will print the fatal error and continue to run.
The resulting performance is what I would get if I specified |
Thank you to @adrianreber for working on this, this will be nice functionality to have :) |
I have also seen that the |
OMPI policy is to indeed abort if the user explicitly requests something we cannot do - so in the case cited by @heasterday, we should definitely abort. |
Is returning OPAL_ERROR not enough to abort? OPAL_ERROR is what I saw at all the places I looked how to handle an error. What is the right way to abort? |
Well, looks pretty simple - from a different btl: void mca_btl_ofi_exit(void)
{
BTL_ERROR(("BTL OFI will now abort."));
exit(1);
} |
Okay, will add |
Thx! |
A better example would be to look at the usNIC BTL utility function to exit upon error -- it will call the PML callback, if it exists / has been set: ompi/opal/mca/btl/usnic/btl_usnic_util.c Lines 27 to 62 in 50db085
|
@jsquyres something like that? diff --git a/opal/mca/btl/vader/btl_vader_module.c b/opal/mca/btl/vader/btl_vader_module.c
index 55b4726340..17381ba17d 100644
--- a/opal/mca/btl/vader/btl_vader_module.c
+++ b/opal/mca/btl/vader/btl_vader_module.c
@@ -80,6 +80,30 @@ mca_btl_vader_t mca_btl_vader = {
}
};
+// Exit function copied from btl_usnic_util.c
+// The following comment tells Coverity that this function does not return.
+// See https://scan.coverity.com/tune.
+
+/* coverity[+kill] */
+static void vader_btl_exit(mca_btl_vader_t *btl)
+{
+ if (NULL == btl) {
+ fprintf(stderr, "*** The Open MPI vader BTL is aborting the MPI job (via exit(3)).\n");
+ fflush(stderr);
+ exit(1);
+ }
+
+ if (NULL != btl->error_cb) {
+ btl->error_cb(&btl->super, MCA_BTL_ERROR_FLAGS_FATAL,
+ (opal_proc_t*) opal_proc_local_get(),
+ "The vader BTL is aborting the MPI job (via PML error callback).");
+ }
+
+ /* If the PML error callback returns (or if there wasn't one),
+ just exit. Shrug. */
+ exit(1);
+}
+
static int vader_btl_first_time_init(mca_btl_vader_t *vader_btl, int n)
{
mca_btl_vader_component_t *component = &mca_btl_vader_component;
@@ -236,7 +260,7 @@ static int init_vader_endpoint (struct mca_btl_base_endpoint_t *ep, struct opal_
/* If CMA has been explicitly selected we want to error out */
opal_show_help("help-btl-vader.txt", "cma-different-user-namespace-error",
true, opal_process_info.nodename);
- return OPAL_ERROR;
+ vader_btl_exit(&mca_btl_vader);
}
/*
* If CMA has been selected because it is the default or |
@adrianreber Minor suggestion (and I should probably do something like this to the usnic BTL, too...): static void vader_btl_exit(mca_btl_vader_t *btl)
{
if (NULL != btl && NULL != btl->error_cb) {
btl->error_cb(&btl->super, MCA_BTL_ERROR_FLAGS_FATAL,
(opal_proc_t*) opal_proc_local_get(),
"The vader BTL is aborting the MPI job (via PML error callback).");
}
/* If the PML error callback returns (or if there wasn't one),
just exit. Shrug. */
fprintf(stderr, "*** The Open MPI vader BTL is aborting the MPI job (via exit(3)).\n");
fflush(stderr);
exit(1);
} Also, github pro tip: if you put the word "patch" after the 3 tick marks for the verbatim section, github syntax highlights it as a diff. Similarly, putting "c" after the 3 tick marks syntax highlights it as C code. ...etc. |
Trying out to run processes via mpirun in Podman containers has shown that the CMA btl_vader_single_copy_mechanism does not work when user namespaces are involved. Creating containers with Podman requires at least user namespaces to be able to do unprivileged mounts in a container Even if running the container with user namespace user ID mappings which result in the same user ID on the inside and outside of all involved containers, the check in the kernel to allow ptrace (and thus process_vm_{read,write}v()), fails if the same IDs are not in the same user namespace. One workaround is to specify '--mca btl_vader_single_copy_mechanism none' and this commit adds code to automatically skip CMA if user namespaces are detected and fall back to MCA_BTL_VADER_EMUL. Signed-off-by: Adrian Reber <areber@redhat.com>
33f345f
to
fc68d8a
Compare
@jsquyres Thanks for the help. I added an exit function and I am calling it now in case |
@jsquyres could you double check and see if these are the changes you requested? |
Is this ready or does it need additional changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the poke / sorry for the delay.
@hjelmn Are you cool with everything in this PR? |
@adrianreber fixed @hjelmn's issue forever ago; I'm going to assume @hjelmn is good with this.
Trying out to run processes via mpirun in Podman containers has shown that the CMA btl_vader_single_copy_mechanism does not work when user namespaces are involved.
Creating containers with Podman requires at least user namespaces to be able to do unprivileged mounts in a container
Even if running the container with user namespace user ID mappings which result in the same user ID on the inside and outside of all involved containers, the check in the kernel to allow ptrace (and thus process_vm_{read,write}v()), fails if the same IDs are not in the same user namespace.
One workaround is to specify '--mca btl_vader_single_copy_mechanism none' and this commit adds code to automatically skip CMA if user namespaces are detected.
Preferred implementation would have been to detect if the other local processes are running in different user namespaces, but it was not clear how get the PIDs of the other involved processes in
mca_btl_vader_check_single_copy(). This is even more complicated if some processes would be running in the same user namespace, but not all of them. If one different user namespace is detected, CMA should be disabled for all involved processes. So if one local process detects that CMA is not working it would need to communicate this information to all local processes.
This implementation now checks during the first access of mca_btl_vader_{put,get}_cma() if the destination process is running in another user namespace and switches to MCA_BTL_VADER_EMUL if this is true.
So if the first access to process_vm_{read,write}v()) fails all further accesses are automatically no longer trying to use CMA.