Fluent Bit keeps Windows k8s pods from shutting down #2027

fujimotos · 2020-03-18T05:52:43Z

Bug Report

Describe the bug

When a windows pod gets deleted, the pod stays in Terminating State for ever until we delete the fluent bit pod running on the same node.

To Reproduce

[SERVICE] 
        Flush        5s
        Daemon       Off
        Log_Level    info
        Parsers_File parsers.conf
        Plugins_File plugins.conf
        HTTP_Server  On
        HTTP_Listen  0.0.0.0
        HTTP_Port    2020
        
[INPUT] 
        Name             tail
        Path             C:\k\kubelet.err.log
        Parser           raw
        Path_Key         kubelet_err_log
        DB               C:\tail2.db
        Tag              host.* 
        Refresh_Interval 60
        Rotate_Wait      5
        Skip_Long_Lines  On
        DB.Sync          Normal

[FILTER]
        Name modify
        Match host.*
        Add _HOSTNAME ${NODE_NAME}

[INPUT]
        Name             tail
        Path             C:\var\log\containers\*
        Parser           docker
        DB               C:\tail.db
        Tag              containers.*
        Refresh_Interval 60
        Rotate_Wait      5
        Skip_Long_Lines  On 
        DB.Sync          Normal

[FILTER]
        Name             kubernetes
        Match            containers.*
        tls.verify       On
        Kube_URL         https://kubernetes.default.svc.cluster.local:443
        Kube_CA_File     /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File  /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix  kube.var.log.containers.
        Merge_Log        On
        Merge_Log_Key    log_processed

[OUTPUT]
        Name forward
        Match *
        Host receiver.svc.cluster.local
        Port 24321
        tls off
        tls.verify on

Expected behavior

No locking

Additional context

This issue was originally reported by @djsly and @titilambert in #1159.

vishiy · 2020-03-18T19:59:13Z

See the same issue from 1.4 dev branch.

bragi92 · 2020-03-18T22:39:56Z

Yes, I see the issue with the following fluent-bit version.

#1888 (comment)

djsly · 2020-03-24T18:17:52Z

do we know if this is a kubelet issue or a fluentbit issue at this point ?

@vishiy / @bragi92 are any of you working with the MSFT AKS Windows team to identify if this could be an AKS/Windows Pod (Kubelet) specific ?

@fujimotos any change you were able to reproduce with the provided configuration ?

vishiy · 2020-03-24T19:51:45Z

@mikkelhegn

mikkelhegn · 2020-03-24T21:42:13Z

Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next.

vishiy · 2020-03-25T00:37:05Z

Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next.

@mikkelhegn - Below is the kubelet error from the node. Nothing specific to this issue as kubernetes events.

E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem
on: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied.

djsly · 2020-03-25T00:40:24Z

If you look closely. From what I recall, the file is gone. Seems like a lock is kept on the file by the fluentbit process. But from a docker point of view, everything was removed.

…

On Mar 24, 2020, at 8:37 PM, Vishwanath ***@***.***> wrote: Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next. @mikkelhegn - Below is the kubelet error from the node. Nothing specific to this issue as kubernetes events. E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem on: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

vishiy · 2020-03-25T00:44:39Z

If you look closely. From what I recall, the file is gone. Seems like a lock is kept on the file by the fluentbit process. But from a docker point of view, everything was removed.
…
On Mar 24, 2020, at 8:37 PM, Vishwanath @.***> wrote: Kubelet logs or K8 events from the pods stuck in terminating would be good to start getting a hint as to where to look next. @mikkelhegn - Below is the kubelet error from the node. Nothing specific to this issue as kubernetes events. E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d": Error response from daem on: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d: CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

@djsly - file exists, but access is denied to do anything on it (including read).

djsly · 2020-03-25T02:11:28Z

ok, thanks I will try to see if this is always this use case that I get. From the looks of it, it seems that Fluentbit is keeping a lock, since as soon as we delete fluent bit, the files disappear and kubelet continue with the deletion process (pod disappear)

I'm guessing we have an issue with the way fluent bit opens the file ?

vishiy · 2020-03-25T03:12:52Z

@djsly Yeah. I see the below in the open code - I see the dwFlagsAndAttributes being 0 weird. I would expect FILE_ATTRIBUTE_NORMAL.
h = CreateFileA(path,
GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE|FILE_SHARE_DELETE,
NULL, /* lpSecurityAttributes /
OPEN_EXISTING, / dwCreationDisposition /
0, / dwFlagsAndAttributes /
NULL); / hTemplateFile */

djsly · 2020-03-25T04:11:16Z

from the error, I think it might be in Docker the issue ...

This is the Kubelet part, it just relay the RemoveContainer call to the lower level runtimeEngine (in this case Docker)

E0325 00:32:14.667270 4400 remote_runtime.go:261] RemoveContainer "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d":

from here: https://github.com/kubernetes/kubernetes/blob/393bb44ea95e4690775bbf7ea73fc695f6ce5e6c/pkg/kubelet/dockershim/docker_container.go#L306-L309

Docker as this issue

Error response from daemon: unable to remove filesystem for cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d:

the client code returns the daemon error: Error response from daemon
based on the following : unable to remove filesystem for
https://github.com/moby/moby/blob/8921b59d5df10b33f5fd037b8af99aed08e6736a/daemon/delete.go#L130-L133

We see that EnsureRemoveAll is called.

The remaining error seems to be coming from GO

CreateFile C:\ProgramData\docker\containers\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d\cdce17ef7168b13b58d9409524324d0067f46a554979cfea0db7f6a2fcc0627d-json.log: Access is denied.

I cannot find too many CreateFile calls to the system from the os go library, except around SymLink calls. I'm not sure if this is a hint back to the fact that the container logs are indeed symlinks based on our INPUTS

C:\var\log\containers\*

I will try to find time to test with

C:\var\log\pods\*

fujimotos · 2020-05-25T08:27:53Z

@djsly @titilambert Hmm. I spent several hours this evening trying to
reproduce this issue, but I couldn't.

In particular, I asked kubernets to create two pods on the same
Windows node:

$ kubectl get pod -o wide
NAME                          READY   STATUS    RESTARTS   AGE     IP            NODE             NOMINATED NODE   READINESS GATES
fluent-bit-84ff694f4f-c7h6x   1/1     Running   0          4m36s   10.240.0.17   akswin32000000   <none>           <none>
dotnet-app-746c4444fd-4mcrr   1/1     Running   0          3m50s   10.240.0.10   akswin32000000   <none>           <none>

And when I deleted the .NET app, it just worked:

$ kubectl delete pod dotnet-app-746c4444fd-4mcrr
pod "dotnet-app-746c4444fd-4mcrr" deleted
$ kubectl get pod
NAME                                         READY   STATUS    RESTARTS   AGE
fluent-bit-84ff694f4f-c7h6x   1/1     Running   0          11m
dotnet-app-746c4444fd-8qg8l   1/1     Running   0          52s

I used Fluent Bit v1.5.0 (master head) and k8s 1.16.7 for the testing.

I think something is missing from my set up; I will recheck tomorrow.
If you have some additional info on your environment (e.g. k8s yaml),
I'd appreciate if you could share.

fujimotos · 2020-05-26T07:43:16Z

On my second try, I successfully regenerate the issue using v1.4.3.

If you look closely. From what I recall, the file is gone. Seems like a lock is kept on the file by the fluentbit process. But from a docker point of view, everything was removed.

So the root issue was the handle managament. Fluent Bit was keeping open
handlers in /var/log/containers, and that kept NTFS from deleting files.
Since a pod waits for shutdown until the deletion, it caused a dead lock.

A proper fix would be teaching Fluent Bit about DeletePending flag
so that it can swiftly close handlers on to-be-deleted files...

... and then I realized that I already implemented that feature in #2133.
And indeed, I could confirm that the dead lock issue has gone since Fluent
Bit v1.4.4.

# Testing with Fluent Bit v1.4.4
$ kubectl delete pod dotnet-app-64db59b858-86qz9
pod "dotnet-app-64db59b858-86qz9" deleted
$

So in short, please use Fuent Bit v1.4.4 or later, and kubectl delete
should just work.

@djsly @titilambert @vishiy I close this ticket now. Please feel free
to reopen this ticket if you experience this issue on newer versons.

lizhuqi · 2021-07-30T00:14:57Z

Hi, I am still seeing the error mentioned in #issuecomment-603576125 for fluentbit 1.7.7

Will the fix in fluentd#3340 also apply to fluent-bit next release?

fujimotos · 2021-07-30T00:22:48Z

Hi, I am still seeing the error mentioned in #issuecomment-603576125 for fluentbit 1.7.7

@lizhuqi Fluent Bit already had that issue fixed in v1.5.

If you can reproduce the same issue in Flunet Bit v1.7.7, then it must be another bug.
Post a new bug report with your configuration and log files.

djsly · 2021-07-30T00:24:54Z

Corrrct we are still seeing also with a recent fluent bit version. It’s less often but there must be a new edge case that triggers it. A restart of fluentbit fixes the pods stuck in terminating.

…

On Jul 29, 2021, at 8:23 PM, Fujimoto Seiji ***@***.***> wrote: Hi, I am still seeing the error mentioned in #issuecomment-603576125 for fluentbit 1.7.7 @lizhuqi Fluent Bit already had that issue fixed in v1.5. If you can reproduce the same issue in Flunet Bit v1.7.7, then it must be another bug. Post a new bug report with your configuration and log files. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

lizhuqi · 2021-07-30T05:36:56Z

I opened #3892 to report the edge case which can trigger the error mentioned in #issuecomment-603576125

fujimotos mentioned this issue Mar 18, 2020

in_tail: avoid locking files on Windows #1159

Merged

fujimotos self-assigned this Mar 26, 2020

binarylogic mentioned this issue Apr 8, 2020

chore: Kubernetes Integration RFC vectordotdev/vector#2222

Merged

fujimotos mentioned this issue Apr 13, 2020

Add Windows Support to Fluent Bit #960

Closed

fujimotos closed this as completed May 26, 2020

fujimotos added this to the Fluent Bit v1.5 milestone Jun 23, 2020

dotsuber mentioned this issue Apr 20, 2021

Fluentd keeps Windows k8s pods from shutting down fluent/fluentd#3340

Closed

This was referenced Jul 30, 2021

Pods stuck "Terminating" on Windows nodes microsoft/Windows-Containers#106

Closed

RemoveContainer failure caused by container log file access is denied #3892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluent Bit keeps Windows k8s pods from shutting down #2027

Fluent Bit keeps Windows k8s pods from shutting down #2027

fujimotos commented Mar 18, 2020

vishiy commented Mar 18, 2020 •

edited

Loading

bragi92 commented Mar 18, 2020

djsly commented Mar 24, 2020

vishiy commented Mar 24, 2020

mikkelhegn commented Mar 24, 2020

vishiy commented Mar 25, 2020

djsly commented Mar 25, 2020 via email

vishiy commented Mar 25, 2020

djsly commented Mar 25, 2020

vishiy commented Mar 25, 2020

djsly commented Mar 25, 2020 •

edited

Loading

fujimotos commented May 25, 2020

fujimotos commented May 26, 2020

lizhuqi commented Jul 30, 2021 •

edited

Loading

fujimotos commented Jul 30, 2021

djsly commented Jul 30, 2021 via email

lizhuqi commented Jul 30, 2021 •

edited

Loading

Fluent Bit keeps Windows k8s pods from shutting down #2027

Fluent Bit keeps Windows k8s pods from shutting down #2027

Comments

fujimotos commented Mar 18, 2020

Bug Report

vishiy commented Mar 18, 2020 • edited Loading

bragi92 commented Mar 18, 2020

djsly commented Mar 24, 2020

vishiy commented Mar 24, 2020

mikkelhegn commented Mar 24, 2020

vishiy commented Mar 25, 2020

djsly commented Mar 25, 2020 via email

vishiy commented Mar 25, 2020

djsly commented Mar 25, 2020

vishiy commented Mar 25, 2020

djsly commented Mar 25, 2020 • edited Loading

fujimotos commented May 25, 2020

fujimotos commented May 26, 2020

lizhuqi commented Jul 30, 2021 • edited Loading

fujimotos commented Jul 30, 2021

djsly commented Jul 30, 2021 via email

lizhuqi commented Jul 30, 2021 • edited Loading

vishiy commented Mar 18, 2020 •

edited

Loading

djsly commented Mar 25, 2020 •

edited

Loading

lizhuqi commented Jul 30, 2021 •

edited

Loading

lizhuqi commented Jul 30, 2021 •

edited

Loading