Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Implement a fallback when clone3 returns ENOSYS #2030

Closed
yihuaf opened this issue Jun 10, 2023 · 9 comments
Closed

[RFC] Implement a fallback when clone3 returns ENOSYS #2030

yihuaf opened this issue Jun 10, 2023 · 9 comments
Assignees
Labels

Comments

@yihuaf
Copy link
Collaborator

yihuaf commented Jun 10, 2023

During the recent research on seccomp while trying to resolve #2022, I learned that clone3 returning ENOSYS or EPERM is actually an issue in the security context, not as simple as whether the kernel has a minimal version. Specifically, this reminded me of #1861. In this issue, the clone3 returned ENOSYS, but the kernel version is 5.15, which should definitely support clone3. I suspect that it is because there are rules on the host system that blocks clone3 but returns ENOSYS instead. Normally, these blocked calls would return EPERM, but returning EPERM for clone3 would break glibc and therefore almost all applications. Therefore, it is very likely that clone3 is blocked as ENOSYS and the application should just fallback to fork or clone.

However, we choose to not implement a fallback. I believe we were naive at the time thinking that as long as we mandate a minimum kernel version, this should not be an issue. However, since now there are real evidence that real system has these behaviors, we may want to reconsider the assumptions we had at the time. Falling back to clone from a implementation perspective is not hard. We already have the change, twice, LOL.

On the other hand, clone3 is the only way to use new features such as the new time namespace. I am not sure about the implication on this just yet.

@containers/youki-maintainers comments?

A few note for references. This is PR #1610 where we decided to use clone3. We also have to use clone because fork doesn't let us control create a sibling process.

@yihuaf yihuaf self-assigned this Jun 10, 2023
@utam0k
Copy link
Member

utam0k commented Jun 10, 2023

Thanks for the great survey, I honestly hadn't considered seccomp. Sorry.

Can you briefly tell me what the security issues are?

I learned that clone3 returning ENOSYS or EPERM is actually an issue in the security context,

@yihuaf
Copy link
Collaborator Author

yihuaf commented Jun 11, 2023

Thanks for the great survey, I honestly hadn't considered seccomp. Sorry.

No worries. It has been interesting reading up all the email threads in about clone3 and seccomp in different projects.

The main issue is that some are choose to block clone3 with a ENOSYS and force a fallback behavior, because seccomp can't effectively filter a subset of clone3 call. Specifically, seccomp used to check the clone_flag, which is an argument to the clone function. In the new clone3, the clone_flag is located inside the struct being passed in. seccomp can't check that flag with the rules because seccomp can only check against the syscalls and their argument. Reference the following section from https://lwn.net/Articles/792628/

This interface seems to be satisfactory to everybody involved, though Jann Horn did point out one significant problem: the seccomp mechanism is unable to examine system-call arguments that are passed in separate structures, so it will be unable to make decisions based on the flags given to clone3(). That, he said, means that code meant to be sandboxed with seccomp may not use clone3() at all.

An example of this is how firebox and chrome choose to block clone3: https://bugzilla.mozilla.org/show_bug.cgi?id=1715254:

Linux sandbox: return ENOSYS for clone3

Because clone3 uses a pointer argument rather than a flags argument, we
cannot examine the contents with seccomp, which is essential to
preventing sandboxed processes from starting other processes. So, we
won't be able to support clone3 in Chromium.

There are more links you can dig further from the mozilla bug I referenced above.

As a result, platforms may choose to block clone3 with ENOSYS for security reason. Not sure if the gitpod issue is related to this, but I suspect it may. I would imagine we would see more instances of this situation. In another word, clone3 may return ENOSYS even with newer kernel version and should we consider implementing a fallback in a situation like this.

With that being said, youki can afford to wait a bit and see what happens. The clone3 call is so new that people are still figuring out how to deal/work with it. While I want this to be documented in this issue as a future reference for us, we can wait to see if more platform choose to block clone3. clone will at some point definitely be deprecated in favor of clone3. Especially consider clone_flag in clone ran out of bits. New features such as time namespace and future new features can only be accessed via clone3. We can put in the fallback now or kick the can down the road.

@yihuaf yihuaf added the RFC label Jun 12, 2023
@utam0k
Copy link
Member

utam0k commented Jun 14, 2023

Thanks for the detailed survey, excellent. Would the solution be to fall back to clone(2)?

@yihuaf
Copy link
Collaborator Author

yihuaf commented Jun 14, 2023

Fall back to clone, yes.

@utam0k
Copy link
Member

utam0k commented Jun 15, 2023

I agree with @yihuaf. This fallback is needed. However, the version of linux kernel support will not be lowered.

@yihuaf
Copy link
Collaborator Author

yihuaf commented Jun 23, 2023

Wait a second, I think we made a mistake here. We are not suppose to bypass libc to call the raw syscall, for clone3 or clone... We need to go through libc for this. There are internal bookkeepings we bypass if we call the raw syscalls. In fact, glibc has not implemented the clone3 wrapper yet. We may need to revert back to not use clone3 at all.

Ref: rust-lang/rust#89522 (comment)

@utam0k
Copy link
Member

utam0k commented Jun 25, 2023

What problems do we have if we bypass it?

@yihuaf
Copy link
Collaborator Author

yihuaf commented Jul 3, 2023

What problems do we have if we bypass it?

Through my research on clone and clone3, it is my understand that we should not be bypassing the libc to create process (through kernel clone or clone3 syscall). Libc internally does bookkeepings with regarding to both process and threads. I am not familiar with the internal libc implementation details but there are a number of places (glibc email threads and github issues on other projects) where lib authors (both musl and glibc) stated that people should not be bypassing libc. Specifically, if people call kernel syscall clone/clone3 to create a process, they may not call libc functions again. It may work but may also fail in subtle ways that they will not support. In another word, in their design, this usecase falls into the undefined behavior case.

With that being said, there is currently no clone3 libc wrapper. We also have no idea what libc interface for clone3 will be yet. At the same time, clone3 works so far with all of our tests. This is why I am eager to get the fallback working correctly, so we can have a way out in the case we discover some problem. To me, clone or fork will always produce the correct result.

For conclusion, we should keep the existing clone3 logic and implements the fallback. In the case that clone3 becomes an issue, we can easily switch to the fallback. In the future when libc supports the clone3 wrapper, then we should transition to use the libc wrapper.

@yihuaf
Copy link
Collaborator Author

yihuaf commented Aug 3, 2023

This is done via #2121

@yihuaf yihuaf closed this as completed Aug 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants