Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFE] Optimise /proc/self/mountinfo notification dispatch #15464

Closed
msekletar opened this issue Apr 17, 2020 · 14 comments
Closed

[RFE] Optimise /proc/self/mountinfo notification dispatch #15464

msekletar opened this issue Apr 17, 2020 · 14 comments
Labels
optimization pid1 RFE 🎁 Request for Enhancement, i.e. a feature request

Comments

@msekletar
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Whenever we get notification on /proc/self/mountinfo we dispatch routine that updates state of our mount units based on the current contents of the file. On machines that host a lot of containers it is very common to have a lot of mount-points and high "mount traffic" (a lot of mount()/umount() syscalls). As a result systemd is woken up and running on CPU very often.

Describe the solution you'd like

Currently we get notification that something has changed and we need to reparse the file to figure out what just happened. New interface is still being discussed on LKML [1]. In the interim we should try to deal with the current interface the best we can.

Describe alternatives you've considered

Previously, @neilbrown tried to address this problem in #10268 and #11178, however, patches got reverted in #11212. @neilbrown have you thought about proposing them one more time?

[1] https://lkml.org/lkml/2020/3/18/569

@msekletar msekletar added the RFE 🎁 Request for Enhancement, i.e. a feature request label Apr 17, 2020
@neilbrown
Copy link
Contributor

No, I haven't seriously thought about propose the solution again. I provided a working solution, the systemd maintainers chose to revert it without even telling me, I see it as their problem now, not mine.
I understand the that the fix was reverted because some regression test failed. I suspect the next step is working out why that test fails. My guess is that the test make some invalid assumption, but I haven't looked at it.

@poettering
Copy link
Member

@neilbrown sorry for not pinging you when it was reverted, that happened by accident. Sorry!

@neilbrown
Copy link
Contributor

Thanks @poettering . The commit message at #11212 suggests that the patches could be reapplied after the release that was, at that time, approaching. If you do that and give me some pointers to the reported CI failures, ideally enough that I can reproduce the failures myself, I'll undertake to work out what is causing the failures.

@msekletar
Copy link
Contributor Author

@neilbrown Rebasing the patches on top of the master and opening new PR would be enough to trigger CI. Let me know if you plan to do that or should I.

@neilbrown
Copy link
Contributor

Where would I go to see the results of the CI ?? How easy is it to run the tests locally? Can you point me to any doco on this?

@msekletar
Copy link
Contributor Author

Hmm...well, just revert previous revert and open new PR. CI will run automatically.

@msekletar
Copy link
Contributor Author

But generally speaking, meson test will run unit tests. Integration tests from test/ directory can be run locally using make. make setup will prepare the test image, make run will execute the test and make clean nukes the image. Also Ubuntu CI autopkgtests can be run locally according to wiki but I've never tried that. Should be pretty straightforward though. I guess that just installing deb pkgs on a test system and rerunning the failing test is all.

https://www.freedesktop.org/wiki/Software/systemd/autopkgtest/

@poettering
Copy link
Member

BTW, I think we should do this differently now. Specifically I think sd-event should have a per-event source "throttle" rate limit logic that we can enable individually. We'd enable it for the /proc/self/mountinfo event source and nothing else. the rate limit would be triggered by x events in a time window of y, and would then result in the event source being removed from the polling for a time z. This would make things in mount.c very simple, in particular as 3508048 has been merged now. And we could use this elsewhere wherever we need something similar.

@axot
Copy link

axot commented Jul 29, 2020

Hello, is there an exists work around for this? We got 40 load one in a 1 core instance.

image

@Geass-LL
Copy link
Contributor

Hi, as #17703 has been merged, maybe this issue should be closed?

@poettering
Copy link
Member

i guess. let's close hence.

@hexiaowen
Copy link
Contributor

hexiaowen commented Mar 7, 2022

@poettering
I think the real reason is that there are millions of /run/mount/utab entries. I can easily reproduce the problem manually。

util-linux/util-linux#1617

How to deal with this situation and who is responsible for cleaning up
SRC=192.168.x.x:/ TARGET=/mnt/paas/kubernetes/kubelet/pods/1312de32-0c40-4490-a3fb-d0895ed82207/volumes/kubernetes.io~dsf/pv-sfs-train/mount ROOT=/ ATTRS=vers=3,noresvport,nolock,addr=192.168.x.x

--- update
The problem has been confirmed,
It is because the mount operation is performed on the host, the umount opteration is in the container, and /run/mount/utab is not mounted in the container, the utab is not refreshed when umount.

@karelzak
Copy link
Contributor

--- update The problem has been confirmed, It is because the mount operation is performed on the host, the umount opteration is in the container, and /run/mount/utab is not mounted in the container, the utab is not refreshed when umount.

This is an expected utab limitation.

The question is how to improve it in some elegant way. Maybe detect on umount that FS is unshared and try to enter (by setns()) parental namespace to update utab, but it sounds a little bit complicated.

@poettering
Copy link
Member

This is an expected utab limitation.

The question is how to improve it in some elegant way. Maybe detect on umount that FS is unshared and try to enter (by setns()) parental namespace to update utab, but it sounds a little bit complicated.

maybe revalidate all entries whenever you make changes to the file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization pid1 RFE 🎁 Request for Enhancement, i.e. a feature request
7 participants