-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Don't bump RLIMIT_NOFILE in exec sessions with '--ulimit host' #24243
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: debarshiray The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d53ce41
to
aba3b60
Compare
Starting from commit 9126b45 ("Up default Podman rlimits to avoid max open files"), Podman started bumping its soft limit for the maximum number of open file descriptors (RLIMIT_NOFILE or ulimit -n) to permit exposing a large number of ports to a container. This was later fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also without CAP_SYS_RESOURCE"). Unfortunately, this also increases the limits for 'podman exec' sessions running in containers created with: $ podman create --network host --ulimit host ... This is what Toolbx uses to provide a containerized interactive command line environment for software development and troubleshooting the host operating system. It confuses developers and system administrators debugging a process that's leaking file descriptors and crashing on the host OS. The crashes either don't reproduce inside the container or they take a lot longer to reproduce, both of which are frustrating. Therefore, it will be good to retain the limits, at least for this specific scenario. It turns out that since this code was written, the Go runtime has had two interesting changes. Starting from Go 1.19 [1], the Go runtime bumps the soft limit for RLIMIT_NOFILE for all Go programs [2]. This means that there's no longer any need for Podman to bump it's own limits, because it switched from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang requirement from 1.18 to 1.20"). It's probably good to still log the detected limits, in case Go's behaviour changes. Not everybody was happy with this [3], because the higher limits got propagated to child processes spawned by Go programs. Among other things, this can break old programs using select(2) [4]. So, Go's behaviour was fine-tuned to restore the original soft limit for RLIMIT_NOFILE when forking a child process [5]. With these two changes in Go, which Podman already uses, if the bumping of RLIMIT_NOFILE is left to the Go runtime, then the limits are no longer increased for 'podman exec' sessions. Otherwise, if Podman continues to bump the soft limit for RLIMIT_NOFILE on its own, then it prevents the Go runtime from restoring the original limits when forking, and leads to the higher limits in 'podman exec' sessions. The existing 'podman run --ulimit host ... ulimit -Hn' test in test/e2e/run_test.go was extended to also check the soft limit. The similar test for 'podman exec' was moved from test/e2e/toolbox_test.go to test/e2e/exec_test.go for consistency and because there's nothing Toolbx specific about it. The test was similarly extended, and updated to be more idiomatic. Due to the behaviour of the Go runtime noted above, and since the tests are written in Go, the current or soft limit for RLIMIT_NOFILE returned by syscall.Getrlimit() is the same as the hard limit. The Alpine Linux image doesn't have a standalone binary for 'ulimit' and it's picky about the order in which the options are listed. The -H or -S must come first, followed by a space, and then the -n. [1] https://go.dev/doc/go1.19#runtime [2] Go commit 8427429c592588af ("os: raise open file rlimit at startup") golang/go@8427429c592588af golang/go#46279 [3] containerd/containerd#8249 [4] http://0pointer.net/blog/file-descriptor-limits.html [5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...") golang/go@f5eef58e4381259c golang/go#46279 Fixes: containers#17681 Signed-off-by: Debarshi Ray <rishi@fedoraproject.org>
aba3b60
to
302d638
Compare
I don't quite understand what is going on here. In my local builds on a Fedora 40 host, this doesn't change the outcome of |
I don't know what to do here, because |
You have to add the release note to the PR description, see the template text under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will get a conflict after #24228 merges,
but fundamentally I do not see this as correct change. We really want to keep the high ulimit for (most?) of our child processes.
For example the dariwn code was added in #20643 becuase we needed a higher limit for qemu, while we no longe ruse qemu there I would think it applies the same to vfkit and we still use qemu on linux so this would break it there as well.
And as you pointed out it was added originally for the port forwarding logic which opens a lot of fds, this problem is not solved by the go runtime code, all the fds are passed into conmon as root so it also needs a high limit. And as rootless we use pasta or slirp4netns which open the ports so they all need higher limits.
And there might be other child process as well that I don't know of right now.
As such I do not think your current change is correct
sudo bin/podman run -p 1000-2000 quay.io/libpod/testimage:20240123 ip a
Error: OCI runtime error: crun: openat2 `proc/bus`: Too many open files
I think to actually solve your issue it would make more sense to store the host limits in a global var and the for --ulimit host
set them in the container spec accordingly from that.
Thanks for the review!
Yes, I saw it fly by and it reminded me of this problem.
Right. I didn't know if the higher limit is necessary for the child processes or just the I didn't test the port forwarding with my local builds because
As far as I can make out, a Go program can never get the original soft limit for Instead, it seems what a Go program can do is to not restore the original limit when forking a child process. One way of doing this is to set the limits in Podman, like we do today. I am wary of embedding the limits in the immutable persistent definition of the container, because if for some reason the limits change in the host OS, then the container will fail to start. The limits can change without any intervention from the system administrator. Just updating the OS and rebooting is enough. I remember hitting this a few times in the past. So, it might be better to do all this at run-time. One thing to note, is that the limits for the container's entry point don't matter for Toolbx users. They always get a We could make this conditional on exec sessions in containers with |
That really sucks, we have already some things like that I hate it with a passion. It will also not work for the things like the podman system service. I have not looked deeply into what the go runtime does with the limts and if there is a way to get the real original limits. If the go runtime does not expose things there is already a work around via pkg/rootless c code that runs before the go runtime kicks in so it would in theory be possible to get and store the limits there and then add our own go function to access the real values. |
Hey sorry I'm not too familiar with all context and history within the Podman project regarding this (I am bit short on time to investigate), but I was involved in pushing the fix for both Docker (v25) and Containerd (v2.0) to revert their usage of Until Containerd has the v2.0 release and Docker does a follow-up release to upgrade to that, it still hasn't quite resolved the issue either. For these two projects it was their systemd Within the container environment itself I've only recently used Podman rootless for verifying a docs contribution to a project, where I had to troubleshoot something and was curious if FD limits might be at fault, the container turned out to have Any software running within the container itself that needs a higher soft limit should explicitly request that at runtime, which Go software will do on behalf of the user these days (as already mentioned above). I am aware of Amazon eagerly adopting the change for AWS and reverting it shortly after due to some breakage it introduced for them, similar to Envoy which has also implicitly relied on the container environment to provide excessive FD limits (a requirement that isn't documented for their software). These are cases where the software has bugs that they should fix internally, not relying on container runtimes implementing "fixes" that negatively impact other software. FWIW If you'd like any additional reference links for what I've mentioned above, let me know and I'll dig those up for you 👍 |
I think this is fixed with: #24228 |
No that still bumps the soft limit to the hard limit so --ulimit host will not have the proper soft limit set in the container. |
yes, true. Since we set the values we expect in the OCI configuration, we could probably just drop |
As written above we cannot drop it because then our child processes do not have the high limit, #24243 (review) causing other problems |
I was waiting for #24228 to get merged. I will get back to this pull request now. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Starting from commit 9126b45 ("Up default Podman rlimits to avoid max open files"), Podman started bumping its soft limit for the maximum number of open file descriptors (
RLIMIT_NOFILE
orulimit -n
) to permit exposing a large number of ports to a container. This was later fine-tuned in commit a2c1a2d ("podman: bump RLIMIT_NOFILE also without CAP_SYS_RESOURCE").Unfortunately, this also increases the limits for 'podman exec' sessions running in containers created with:
This is what Toolbx uses to provide a containerized interactive command line environment for software development and troubleshooting the host operating system.
It confuses developers and system administrators debugging a process that's leaking file descriptors and crashing on the host OS. The crashes either don't reproduce inside the container or they take a lot longer to reproduce, both of which are frustrating.
Therefore, it will be good to retain the limits, at least for this specific scenario.
It turns out that since this code was written, the Go runtime has had two interesting changes.
Starting from Go 1.19 [1], the Go runtime bumps the soft limit for
RLIMIT_NOFILE
for all Go programs [2]. This means that there's no longer any need for Podman to bump it's own limits, because it switched from requiring Go 1.18 to 1.20 in commit 4dd58f2 ("Move golang requirement from 1.18 to 1.20"). It's probably good to still log the detected limits, in case Go's behaviour changes.Not everybody was happy with this [3], because the higher limits got propagated to child processes spawned by Go programs. Among other things, this can break old programs using
select(2)
[4]. So, Go's behaviour was fine-tuned to restore the original soft limit forRLIMIT_NOFILE
when forking a child process [5].With these two changes in Go, which Podman already uses, if the bumping of
RLIMIT_NOFILE
is left to the Go runtime, then the limits are no longer increased forpodman exec
sessions. Otherwise, if Podman continues to bump the soft limit forRLIMIT_NOFILE
on its own, then it prevents the Go runtime from restoring the original limits when forking, and leads to the higher limits inpodman exec
sessions.The existing
podman run --ulimit host ... ulimit -Hn
test intest/e2e/run_test.go
was extended to also check the soft limit. The similar test forpodman exec
was moved fromtest/e2e/toolbox_test.go
totest/e2e/exec_test.go
for consistency and because there's nothing Toolbx specific about it. The test was similarly extended, and updated to be more idiomatic.Due to the behaviour of the Go runtime noted above, and since the tests are written in Go, the current or soft limit for
RLIMIT_NOFILE
returned bysyscall.Getrlimit()
is the same as the hard limit.The Alpine Linux image doesn't have a standalone binary for
ulimit
and it's picky about the order in which the options are listed. The-H
or-S
must come first, followed by a space, and then the-n
.[1] https://go.dev/doc/go1.19#runtime
[2] Go commit 8427429c592588af ("os: raise open file rlimit at startup")
golang/go@8427429c592588af
golang/go#46279
[3] containerd/containerd#8249
[4] http://0pointer.net/blog/file-descriptor-limits.html
[5] Go commit f5eef58e4381259c ("syscall: restore original NOFILE ...")
golang/go@f5eef58e4381259c
golang/go#46279
Fixes: #17681
Does this PR introduce a user-facing change?