-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fluent Bit keeps Windows k8s pods from shutting down #2027
Comments
See the same issue from 1.4 dev branch. |
Yes, I see the issue with the following fluent-bit version. |
do we know if this is a kubelet issue or a fluentbit issue at this point ? @vishiy / @bragi92 are any of you working with the MSFT AKS Windows team to identify if this could be an AKS/Windows Pod (Kubelet) specific ? @fujimotos any change you were able to reproduce with the provided configuration ? |
Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next. |
@mikkelhegn - Below is the kubelet error from the node. Nothing specific to this issue as kubernetes events. E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem |
If you look closely. From what I recall, the file is gone. Seems like a lock is kept on the file by the fluentbit process. But from a docker point of view, everything was removed.
… On Mar 24, 2020, at 8:37 PM, Vishwanath ***@***.***> wrote:
Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next.
@mikkelhegn - Below is the kubelet error from the node. Nothing specific to this issue as kubernetes events.
E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem
on: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@djsly - file exists, but access is denied to do anything on it (including read). |
ok, thanks I will try to see if this is always this use case that I get. From the looks of it, it seems that Fluentbit is keeping a lock, since as soon as we delete fluent bit, the files disappear and kubelet continue with the deletion process (pod disappear) I'm guessing we have an issue with the way fluent bit opens the file ? |
@djsly Yeah. I see the below in the open code - I see the dwFlagsAndAttributes being 0 weird. I would expect FILE_ATTRIBUTE_NORMAL. |
from the error, I think it might be in Docker the issue ... This is the Kubelet part, it just relay the RemoveContainer call to the lower level runtimeEngine (in this case Docker)
Docker as this issue
the client code returns the daemon error: We see that The remaining error seems to be coming from GO
I cannot find too many
I will try to find time to test with
|
@djsly @titilambert Hmm. I spent several hours this evening trying to In particular, I asked kubernets to create two pods on the same $ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
fluent-bit-84ff694f4f-c7h6x 1/1 Running 0 4m36s 10.240.0.17 akswin32000000 <none> <none>
dotnet-app-746c4444fd-4mcrr 1/1 Running 0 3m50s 10.240.0.10 akswin32000000 <none> <none> And when I deleted the .NET app, it just worked: $ kubectl delete pod dotnet-app-746c4444fd-4mcrr
pod "dotnet-app-746c4444fd-4mcrr" deleted
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
fluent-bit-84ff694f4f-c7h6x 1/1 Running 0 11m
dotnet-app-746c4444fd-8qg8l 1/1 Running 0 52s I used Fluent Bit v1.5.0 (master head) and k8s 1.16.7 for the testing. I think something is missing from my set up; I will recheck tomorrow. |
On my second try, I successfully regenerate the issue using v1.4.3.
So the root issue was the handle managament. Fluent Bit was keeping open A proper fix would be teaching Fluent Bit about ... and then I realized that I already implemented that feature in #2133. # Testing with Fluent Bit v1.4.4
$ kubectl delete pod dotnet-app-64db59b858-86qz9
pod "dotnet-app-64db59b858-86qz9" deleted
$ So in short, please use Fuent Bit v1.4.4 or later, and @djsly @titilambert @vishiy I close this ticket now. Please feel free |
Hi, I am still seeing the error mentioned in #issuecomment-603576125 for fluentbit 1.7.7 Will the fix in fluentd#3340 also apply to fluent-bit next release? |
@lizhuqi Fluent Bit already had that issue fixed in v1.5. If you can reproduce the same issue in Flunet Bit v1.7.7, then it must be another bug. |
Corrrct we are still seeing also with a recent fluent bit version.
It’s less often but there must be a new edge case that triggers it.
A restart of fluentbit fixes the pods stuck in terminating.
… On Jul 29, 2021, at 8:23 PM, Fujimoto Seiji ***@***.***> wrote:
Hi, I am still seeing the error mentioned in #issuecomment-603576125 for fluentbit 1.7.7
@lizhuqi Fluent Bit already had that issue fixed in v1.5.
If you can reproduce the same issue in Flunet Bit v1.7.7, then it must be another bug.
Post a new bug report with your configuration and log files.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I opened #3892 to report the edge case which can trigger the error mentioned in #issuecomment-603576125 |
Bug Report
Describe the bug
When a windows pod gets deleted, the pod stays in Terminating State for ever until we delete the fluent bit pod running on the same node.
To Reproduce
Expected behavior
No locking
Additional context
This issue was originally reported by @djsly and @titilambert in #1159.
The text was updated successfully, but these errors were encountered: