-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deflate_on_oom
doesn't seem to work as expected/documented
#4324
Comments
Hi Volker, Thanks for reporting this. We will take a look and reproduce it, but in the meantime I'd like to point out that this configuration:
is not supported. Would you be able to try and reproduce with a supported set of host/guest kernels? What we test with is |
Also, to answer your question:
this should work, and we have tests that indicate it does work, i.e. the balloon gets deflated, however we do not track the |
Sorry for the late answer @bchalios , @pb8o. I finally managed to run my experiments on a "supported" platform, but unfortunately the results are exactly the same. Host: 6.1.72-96.166.amzn2023.x86_64 So to summarize the problem: when I start Firecracker with a large ballooning device and
The guest itself is not really dead-locked, just extremely slow. I can ssh into it and and see the following:
The Java process I've started (i.e. Querying the balloon metrics from the guest shows that the balloon slightly deflates itself, but this happens extremely slowly. E.g. initially we have something like this:
And after about 45 minutes we get to:
If I wait about 60 minutes, So ballooning is indeed "kind" of working, but not really practically usable. I would expect that the ballooning device deflates much more promptly in this case. |
I did one more run to confirm the behavior and collect more numbers:
As you can see, it takes more than an hour until The other interesting observation is that after PS: these results were collected on a |
Hi @simonis, I've been taking a look at this. The deflate on oom is indeed a slow process, this is managed by the balloon driver in the guest kernel not by Firecracker itself. It seem to try release as little memory as possible on deflate. The balloon will indeed not re-inflate if it is deflated on oom if it has reached it's target size, this appears to by design in the driver. However, if the balloon not yet reached it's target size it will continue trying to inflate even after being deflated. The high CPU usage while trying to reach it's target size again appears to be by design, the driver will aggressively try to allocate memory to reach it's target size. Overall, it appears the driver is intending the balloon to be inflated for shorter periods of time to free up memory before an operation such as VM migration etc. Hope this helps |
After reading the Ballooning documentation my understanding of the
deflate_on_oom
is that if the parameter is set totrue
the ballooning device will be deflated automatically if a process in the guest requires memory pages which can not be otherwise provided:However, if I run Firecracker with e.g. 2 vCPUs,1gb of memory and a balloon device of 900mb:
..and then try to start a Java process in the guest with
-Xms800m -Xmx800m
(i.e. with a heap size of 800mb) the Java process in the guest will hang, Firecracker will use ~200% CPU time but the actual size occupied by the ballooning device in the guest will not change and remain at 900mb:Once I reset the target size of the ballooning device to 100mb, the Java process will become unblocked and start.
However, from the documentation of the
deflate_on_oom
option I would have expected that the guest kernel would deflate the ballooning device automatically, ifdeflate_on_oom=true
?If I run the same experiment with
deflate_on_oom=false
, I instantly get an out of memory error when I trying to start the Java process:which is what I would have expected.
Also, if I increase (i.e. inflate) the balloon to 900m again after I started the Java process, I start getting warnings from the ballooning driver (as documented):
..but the CPU usage again goes up to almost ~200%. Is this expected? I mean, the warnings are OK, but I wouldn't expect that Firecracker will burn all its CPU shares while trying to inflate the balloon?
So to summarize, is the described behavior with
deflate_on_oom=true
a bug in the implementation or have I misunderstood the behavior of the ballooning device in the event of low memory in the guest?PS: I've used the following kernel and FC versinons for the experiments:
Guest kernel: 5.19.8
Host kernel : 6.5.7 (Ubuntu 20.04)
Firecracker : 1.5.1 and 1.6.0-dev ( from today 036d9906)
The text was updated successfully, but these errors were encountered: