Pods stuck "Terminating" on Windows nodes #106

lizhuqi · 2021-03-30T06:27:51Z

We're seeing pods that fail to terminate and have to manually run "kubectl delete --force" to force terminate the pods.

See the error message:
RemoveContainer "[ID]" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "[ID]": Error response from daemon: unable to remove filesystem for [ID]: CreateFile C:\ProgramData\docker\containers[ID][ID]-json.log.4: Access is denied.

Similar issue was reported in AKS #1765 and fixed by upstream PR which is available for 1.17.12+. However the issues still repros in the 1.17.15.

lizhuqi · 2021-03-30T16:52:12Z

If you manually remove the json.log file, container can be removed and pod can then be ternimated.

vrapolinario · 2021-03-31T22:55:04Z

@lizhuqi can you please confirm this issue you're seeing is on AKS or is it outside of AKS? CC: @keikhara for visibility.

lizhuqi · 2021-04-01T21:51:06Z

It is seen in GKE.

vrapolinario · 2021-04-02T16:46:59Z

Got It. I added the label Windows on Kubernetes then. @immuzz or @keikhara can you please confirm this issue can be reproduced on AKS after the fix described above? If not, it might be something specific from the GKE platform.

lizhuqi · 2021-04-02T23:45:55Z

We took another look. It seems that the failure was caused by our logging agent handling symlinked log file and symlink commands result in access denied error in Windows containers. So I think the issue I reported here is a duplicate of #97.

lizhuqi · 2021-04-15T23:50:01Z

More logs for this issue.

2021-04-15T23:06:15ZHandler for DELETE /v1.40/containers/[CONTAINER-ID] returned error: unable to remove filesystem for [CONTAINER-ID]: CreateFile C:\ProgramData\docker\containers[CONTAINER-ID][CONTAINER-ID]-json.log: Access is denied.

2021-04-15T23:06:15.472679ZSyncLoop (PLEG): "[POD-NAME]_NAME", event: &pleg.PodLifecycleEvent{ID:"[ID]", Type:"ContainerDied", Data:"[DATA]"}

2021-04-15T23:06:15.472679Z[fake cpumanager] RemoveContainer (container id: [CONTAINER-ID])
2021-04-15T23:06:15.499033ZRemoveContainer "[CONTAINER-ID]" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "[CONTAINER-ID]": Error response from daemon: unable to remove filesystem for [CONTAINER-ID]: CreateFile C:\ProgramData\docker\containers[CONTAINER-ID][CONTAINER-ID]-json.log: Access is denied.

2021-04-15T23:06:16.024224Zdetected rotation of /var/log/containers/[POD-NAME][NAMESPACE-NAME][CONTAINER-NAME]-[CONTAINER-ID].log; waiting 5 seconds

lizhuqi · 2021-04-20T19:06:44Z

Any update on this issue?

lizhuqi · 2021-04-22T19:53:26Z

Below is the docker version used:
Container Runtime Version: docker://19.3.13

jsturtevant · 2021-04-27T16:51:17Z

What OS version is your host? Can run cmd /c ver on the host?

lizhuqi · 2021-04-27T17:16:49Z

it was repro on 10.0.17763.1577 but the issue is gone now after upgrading to 10.0.17763.1757.

Yeah, we can close the bug now.

lizhuqi · 2021-05-07T19:13:36Z

Sad to reopen this issue but we see this issue happens again in our customer's cluster. And this time the cluster has both kubelet fix and OS fix.

Kernel version: 10.0.17763.1757
OS image: Windows Server 2019 Datacenter
Container runtime version: docker://19.3.14
kubelet version: v1.17.17-gke.3700
kube-proxy version: v1.17.17-gke.3700

jsturtevant · 2021-05-11T20:35:59Z

Is this reproduced in later versions of kubernetes?

lizhuqi · 2021-05-12T18:04:41Z

I saw a related issue in moby: moby/moby#39274

The error seems to also cause our logging agent fail to read the container logs.

lizhuqi · 2021-05-12T20:58:50Z

Yes this also repro in 1.19.

See the versions for the latest repro in our internal test:
VERSION OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
v1.19.7-gke.2400 Windows Server 2019 Standard 10.0.17763.1282 docker://19.3.2

jeremyje · 2021-05-12T21:58:59Z

kubectl --kubeconfig user-kubeconfig.yaml get nodes -o wide
NAME            STATUS   ROLES    AGE   VERSION            INTERNAL-IP    EXTERNAL-IP    OS-IMAGE                       KERNEL-VERSION     CONTAINER-RUNTIME
ddff-1500fd41   Ready    <none>   43h   v1.19.7-gke.2400   21.0.253.65    21.0.253.65    Windows Server 2019 Standard   10.0.17763.1282    docker://19.3.2
ddff-1500fdc6   Ready    <none>   43h   v1.19.7-gke.2400   21.0.253.198   21.0.253.198   Windows Server 2019 Standard   10.0.17763.1282    docker://19.3.2
ddff-1500fe8a   Ready    <none>   43h   v1.19.7-gke.2400   21.0.254.138   21.0.254.138   Ubuntu 18.04.5 LTS             5.4.0-1014-gkeop   docker://19.3.2
ddff-1500feb3   Ready    <none>   43h   v1.19.7-gke.2400   21.0.254.179   21.0.254.179   Ubuntu 18.04.5 LTS             5.4.0-1014-gkeop   docker://19.3.2
ddff-1500ff09   Ready    <none>   43h   v1.19.7-gke.2400   21.0.255.9     21.0.255.9     Ubuntu 18.04.5 LTS             5.4.0-1014-gkeop   docker://19.3.2

apiVersion: apps/v1
kind: Deployment
metadata:
  name: windows-log-test
  labels:
    app: windows-log-test
spec:
  selector:
    matchLabels:
      app: windows-log-test
  replicas: 2
  template:
    metadata:
      labels:
        app: windows-log-test
    spec:
      containers:
      - name: winloggertest
        image: gcr.io/gke-release/pause-win:1.6.1
        command:
        - powershell.exe
        - -command
        - Write-Host('log some message...')
      nodeSelector:
        kubernetes.io/os: windows
      tolerations:
      - key: "node.kubernetes.io/os"
        operator: "Equal"
        value: "windows"
        effect: "NoSchedule"

In the kubelet logs you'll see:

E0512 14:19:32.145807    4684 docker_sandbox.go:730] ResolvConfPath is empty.
E0512 14:19:50.340739    4684 docker_stats_windows.go:42] Error opening container (stats will be missing) 'd8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8': hcsshim::System::Properties d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8: The requested virtual machine or container operation is not valid in the current state.
E0512 14:19:51.321091    4684 pod_workers.go:191] Error syncing pod 91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7 ("windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"), skipping: failed to "StartContainer" for "winloggertest" with CrashLoopBackOff: "back-off 10s restarting failed container=winloggertest pod=windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"
E0512 14:19:52.687370    4684 pod_workers.go:191] Error syncing pod 91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7 ("windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"), skipping: failed to "StartContainer" for "winloggertest" with CrashLoopBackOff: "back-off 10s restarting failed container=winloggertest pod=windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"
E0512 14:20:21.101782    4684 pod_workers.go:191] Error syncing pod 91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7 ("windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"), skipping: failed to "StartContainer" for "winloggertest" with CrashLoopBackOff: "back-off 20s restarting failed container=winloggertest pod=windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"
E0512 14:20:21.116783    4684 remote_runtime.go:291] RemoveContainer "d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8": Error response from daemon: unable to remove filesystem for d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8: CreateFile C:\ProgramData\docker\containers\d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8\d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8-json.log: Access is denied.

After a few moments you'll see this:

kubectl get pods -o wide
NAME                               READY   STATUS             RESTARTS   AGE     IP          NODE              NOMINATED NODE   READINESS GATES
windows-log-test-55f64676b-p7svf   0/1     CrashLoopBackOff   3          2m19s   10.96.4.4   gke-2c9026-3hxz   <none>           <none>
windows-log-test-55f64676b-qdw5x   0/1     Completed          4          2m19s   10.96.2.4   gke-2c9026-zskg   <none>           <none>

kubectl get nodes -o wide
NAME                               STATUS   ROLES    AGE     VERSION            INTERNAL-IP    EXTERNAL-IP       OS-IMAGE                             KERNEL-VERSION    CONTAINER-RUNTIME
gke-2c9026-0k31                    Ready    <none>   9m2s    v1.18.17-gke.100   10.138.0.223   104.196.253.239   Windows Server 2019 Datacenter       10.0.17763.1817   docker://19.3.14
gke-2c9026-3hxz                    Ready    <none>   8m54s   v1.18.17-gke.100   10.138.0.216   35.197.104.170    Windows Server 2019 Datacenter       10.0.17763.1817   docker://19.3.14
gke-2c9026-zskg                    Ready    <none>   9m1s    v1.18.17-gke.100   10.138.0.55    34.82.111.231     Windows Server 2019 Datacenter       10.0.17763.1817   docker://19.3.14
gke-r-default-pool-fffa6112-47jp   Ready    <none>   11m     v1.18.17-gke.100   10.138.0.204   34.83.230.3       Container-Optimized OS from Google   5.4.89+           docker://19.3.14
gke-r-default-pool-fffa6112-g4ss   Ready    <none>   11m     v1.18.17-gke.100   10.138.0.18    35.199.153.15     Container-Optimized OS from Google   5.4.89+           docker://19.3.14
gke-r-default-pool-fffa6112-v728   Ready    <none>   11m     v1.18.17-gke.100   10.138.0.49    34.82.255.36      Container-Optimized OS from Google   5.4.89+           docker://19.3.1

jeremyje · 2021-05-12T23:08:05Z

This repros in 1.20.5 with Docker 19.3.14. @jsturtevant

jsturtevant · 2021-05-12T23:09:40Z

I've been able to reproduce with @jeremyje's deployment spec in azure with 1.19 and 19.3.14

kubelet.err.log:469:E0512 23:05:55.112008    4324 remote_runtime.go:291] RemoveContainer "b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90" from runtime service        
failed: rpc error: code = Unknown desc = failed to remove container "b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90": Error response from daemon: unable to remove  
filesystem for b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90: CreateFile
C:\ProgramData\docker\containers\b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90\b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90-json.log: Access is  
denied.

chenlonglab198 · 2021-05-17T23:55:54Z

Can reproduce with AKS 1.20.2 / Node Image Version "AKSWindows-2019-17763.1817.210310"
May I know is there any update for this issue?

Thanks in Advance!

immuzz · 2021-05-25T16:53:05Z

@kevpar do you have any insights here?

kevpar · 2021-06-01T19:25:33Z

It sounds like this issue is about getting Access is denied. when opening a path involving a symlink to a directory. Is that correct? Can you share the symlink path and its target?

The only thing that comes to mind is that if we are trying to open a symlink pointing to a directory, and the symlink was created without the SYMBOLIC_LINK_FLAG_DIRECTORY flag, then it will fail with Access is denied.. If that is the case, the issue can be resolved by passing this flag during symlink creation. If using Go's os.Symlink, then the target path needs to exist (as a directory) before the symlink is created.

lizhuqi · 2021-06-03T18:35:04Z

The file is C:\ProgramData\docker\containers[ID][ID]-json.log

So it seems to be some log file created by docker daemon.

immuzz · 2021-06-09T21:50:48Z

@lizhuqi Did you try the above workaround @kevpar recommended? Also have you tried this with containerd?

lizhuqi · 2021-06-10T23:31:26Z

@kevpar can you provide more details about the workaround? Like the steps to test the workaround.

kevpar · 2021-06-11T07:06:03Z

I don't know if there is a viable workaround here. I was asking about symlinks because of the earlier comment that this issue seemed like a duplicate of #97. Reading through this again, though, it seems like the issue is internal to Docker: It is failing to delete the container log file at C:\ProgramData\docker\containers\<id>\<id>-json.log.

The symlink is /var/log/containers/[POD-NAME][NAMESPACE-NAME][CONTAINER-NAME]-[CONTAINER-ID].log and its target is C:\ProgramData\docker\containers[ID][ID]-json.log

Assuming my understanding above is correct, how is this symlink in /var/log related to the Docker error?

It's not clear what could cause Docker to get an access denied error when deleting the container log file. If it's due to something else holding the file open without sharing, I would expect the error to be ERROR_SHARING_VIOLATION rather than ERROR_ACCESS_DENIED.

lizhuqi · 2021-06-11T18:40:48Z

Maybe the previous fix doesn't fully make the issue described in this comment go away?

Random-Liu · 2021-06-15T22:30:14Z

Is this related? moby/moby@2502db6

Will that fix the issue? That fix is only in Docker 20.10

lizhuqi · 2021-06-16T18:09:46Z

I will check if this issue repros for docker 20.10

lizhuqi · 2021-06-16T21:24:07Z

Confirmed the issue still repro for v.1.20.7 with Container Runtime Version: docker://20.10.4

lizhuqi · 2021-06-16T21:25:22Z

@kevpar and @immuzz - can you take another look? I am happy to try out if there is any workaround.

Random-Liu · 2021-06-16T22:06:04Z

Confirmed the issue still repro for v.1.20.7 with Container Runtime Version: docker://20.10.4

If the issue still repro, as far as I can understand, docker just calls os.RemoveAll and got this error after the fix moby/moby@2502db6.

This looks like either a golang implementation issue or windows underlying issue (bad error reporting?).

In our case, the container directory actually is already gone (at least not visible) on the node, when we get this error.

lizhuqi · 2021-06-24T18:29:03Z

@kevpar and @immuzz - can you take another look?

lizhuqi · 2021-07-01T00:03:40Z

Confirmed the issue repros for v1.20.6-gke.1000 but doesn't repro for 1.21.1-gke.1800 and 1.21.1-gke.2200. The main difference between those two versions are the docker runtime. v1.20.6-gke.1000 has docker://19.3.14. 1.21.1-gke.1800 and 1.21.1-gke.2200 has docker://20.10.4. The issue also doesn't repro for containerd. Lantao's guess was right. My previous comment about reproing for v1.20.7 was wrong.

lizhuqi · 2021-07-01T00:05:11Z

Close this as the issue is fixed in moby/moby@2502db6.

Anyone who experience this shall upgrade to use docker://20.10.4 or switch to use containerd.

lizhuqi · 2021-07-15T18:12:46Z

The issue comes back. I think last time I repro I didn't start and restart pod frequent enough.

I will try to switch to 20.10.6.

lizhuqi · 2021-07-19T22:27:38Z

We have a MS support case opened for this issue and plan to have this case transferred to Mirantis support for Mirantis team to take a look.

lizhuqi · 2021-07-28T23:42:23Z

If we disabled logging agent(fluentd or fluentbit) on the node, the "Access is denied" error will go away. So logging agent might be the one holding the container log file handle.

lizhuqi · 2021-07-30T05:19:45Z

Both fluentd and fluentbit have some fix for this issue.

fluentbit fix is now in 1.5+ version. Some edge case may still trigger the access is denied error but it is much less often with the fix. A restart of fluentbit fixes the pods stuck in terminating.

fluentd fix has just been merged and will be out in its next release.

lizhuqi · 2021-08-04T22:51:07Z

I tested fluentd v1.13.3 which is the latest fluentd release that just came out and has the fix we need. I found things have much improved. Before the fix, "RemoveContainer" stuck because of the "access is denied" error and after, "RemoveContainer" eventually succeeded with several retries and pod can proceed to terminate.

lizhuqi · 2021-08-08T23:25:24Z

Close this issue as the fix is now in the latest release of fluentd and fluentbit.

Norfolc · 2021-12-06T10:09:02Z

We have the same issue with fluentbit 1.8.10 and EKS:

  Kernel Version:             10.0.17763.2300
  OS Image:                   Windows Server 2019 Datacenter
  Operating System:           windows
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.7
  Kubelet Version:            v1.20.10-eks-3bcdcd
  Kube-Proxy Version:         v1.20.10-eks-3bcdcd

lizhuqi · 2021-12-06T21:56:50Z

@Norfolc - do you have a repro? I have an edge case repro for the issue logged at fluent/fluent-bit#3892

Norfolc · 2021-12-07T09:30:37Z

@lizhuqi I have a repro:

Run a fluentbit/fluent bit logger daemon set
Run a deployment with 1 pod that produces a lot of logs:
kubectl apply -f https://gist.githubusercontent.com/Norfolc/9f81e2c0a525a107b9d207b81d03dee9/raw/ae0868b3b62537c5deb86784955498b092b007c1/windows-server-core.yaml
After some time scale down the deployment to zero:
kubectl scale deploy windows-server-core --replicas 0
The pod will stuck in terminating state with error:
unable to remove filesystem for [ID]: CreateFile C:\ProgramData\docker\containers\[ID]\[ID]-json.log: Access is denied.
Killing fluentbit process mitigates the issue.

ghost added the triage New and needs attention label Mar 30, 2021

vrapolinario assigned immuzz Mar 31, 2021

vrapolinario added Windows on AKS Deploying Windows Container using AKS bug Something isn't working and removed triage New and needs attention labels Mar 31, 2021

vrapolinario added the Windows on Kubernetes Windows Containers using Kubernetes label Apr 2, 2021

lizhuqi closed this as completed Apr 27, 2021

lizhuqi reopened this May 7, 2021

lizhuqi closed this as completed Jul 1, 2021

This was referenced Jul 1, 2021

chore: Update Windows docker version to 20.10.5 Azure/aks-engine#4556

Merged

chore: Update Windows docker version to 20.10.6 Azure/AgentBaker#906

Merged

sumantfordev mentioned this issue Jul 12, 2021

File watcher hangs microsoft/windows-container-tools#35

Closed

lizhuqi reopened this Jul 15, 2021

lizhuqi closed this as completed Aug 8, 2021

This was referenced Sep 6, 2021

Pods fails to terminate on Windows nodes #157

Closed

Pods fails to terminate on Windows nodes on AKS Azure/AKS#2538

Closed

cwilhit mentioned this issue Mar 12, 2022

Pods stuck in "Terminating" state on Windows nodes when deleting the pod after docker restart #207

Closed

backjo mentioned this issue Mar 28, 2022

Log collection on Windows node causes EKS pods to be stuck Terminating DataDog/datadog-agent#11483

Open

Pods stuck "Terminating" on Windows nodes #106

Pods stuck "Terminating" on Windows nodes #106

Comments

lizhuqi commented Mar 30, 2021

lizhuqi commented Mar 30, 2021

vrapolinario commented Mar 31, 2021

lizhuqi commented Apr 1, 2021

vrapolinario commented Apr 2, 2021

lizhuqi commented Apr 2, 2021 • edited Loading

lizhuqi commented Apr 15, 2021 • edited Loading

lizhuqi commented Apr 20, 2021

lizhuqi commented Apr 22, 2021

jsturtevant commented Apr 27, 2021

lizhuqi commented Apr 27, 2021

lizhuqi commented May 7, 2021

jsturtevant commented May 11, 2021

lizhuqi commented May 12, 2021

lizhuqi commented May 12, 2021

jeremyje commented May 12, 2021 • edited Loading

jeremyje commented May 12, 2021

jsturtevant commented May 12, 2021

chenlonglab198 commented May 17, 2021

immuzz commented May 25, 2021

kevpar commented Jun 1, 2021

lizhuqi commented Jun 3, 2021

immuzz commented Jun 9, 2021

lizhuqi commented Jun 10, 2021

kevpar commented Jun 11, 2021

lizhuqi commented Jun 11, 2021

Random-Liu commented Jun 15, 2021

lizhuqi commented Jun 16, 2021

lizhuqi commented Jun 16, 2021

lizhuqi commented Jun 16, 2021

Random-Liu commented Jun 16, 2021 • edited Loading

lizhuqi commented Jun 24, 2021

lizhuqi commented Jul 1, 2021

lizhuqi commented Jul 1, 2021

lizhuqi commented Jul 15, 2021

lizhuqi commented Jul 19, 2021

lizhuqi commented Jul 28, 2021

lizhuqi commented Jul 30, 2021

lizhuqi commented Aug 4, 2021

lizhuqi commented Aug 8, 2021 • edited Loading

Norfolc commented Dec 6, 2021

lizhuqi commented Dec 6, 2021

Norfolc commented Dec 7, 2021 • edited Loading

lizhuqi commented Apr 2, 2021 •

edited

Loading

lizhuqi commented Apr 15, 2021 •

edited

Loading

jeremyje commented May 12, 2021 •

edited

Loading

Random-Liu commented Jun 16, 2021 •

edited

Loading

lizhuqi commented Aug 8, 2021 •

edited

Loading

Norfolc commented Dec 7, 2021 •

edited

Loading