Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck "Terminating" on Windows nodes #106

Closed
lizhuqi opened this issue Mar 30, 2021 · 45 comments · Fixed by Azure/aks-engine#4556
Closed

Pods stuck "Terminating" on Windows nodes #106

lizhuqi opened this issue Mar 30, 2021 · 45 comments · Fixed by Azure/aks-engine#4556
Assignees
Labels
bug Something isn't working Windows on AKS Deploying Windows Container using AKS Windows on Kubernetes Windows Containers using Kubernetes

Comments

@lizhuqi
Copy link

lizhuqi commented Mar 30, 2021

We're seeing pods that fail to terminate and have to manually run "kubectl delete --force" to force terminate the pods.

See the error message:
RemoveContainer "[ID]" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "[ID]": Error response from daemon: unable to remove filesystem for [ID]: CreateFile C:\ProgramData\docker\containers[ID][ID]-json.log.4: Access is denied.

Similar issue was reported in AKS #1765 and fixed by upstream PR which is available for 1.17.12+. However the issues still repros in the 1.17.15.

@ghost ghost added the triage New and needs attention label Mar 30, 2021
@lizhuqi
Copy link
Author

lizhuqi commented Mar 30, 2021

If you manually remove the json.log file, container can be removed and pod can then be ternimated.

@vrapolinario vrapolinario added Windows on AKS Deploying Windows Container using AKS bug Something isn't working and removed triage New and needs attention labels Mar 31, 2021
@vrapolinario
Copy link
Contributor

@lizhuqi can you please confirm this issue you're seeing is on AKS or is it outside of AKS? CC: @keikhara for visibility.

@lizhuqi
Copy link
Author

lizhuqi commented Apr 1, 2021

It is seen in GKE.

@vrapolinario vrapolinario added the Windows on Kubernetes Windows Containers using Kubernetes label Apr 2, 2021
@vrapolinario
Copy link
Contributor

Got It. I added the label Windows on Kubernetes then. @immuzz or @keikhara can you please confirm this issue can be reproduced on AKS after the fix described above? If not, it might be something specific from the GKE platform.

@lizhuqi
Copy link
Author

lizhuqi commented Apr 2, 2021

We took another look. It seems that the failure was caused by our logging agent handling symlinked log file and symlink commands result in access denied error in Windows containers. So I think the issue I reported here is a duplicate of #97.

@lizhuqi
Copy link
Author

lizhuqi commented Apr 15, 2021

More logs for this issue.

2021-04-15T23:06:15ZHandler for DELETE /v1.40/containers/[CONTAINER-ID] returned error: unable to remove filesystem for [CONTAINER-ID]: CreateFile C:\ProgramData\docker\containers[CONTAINER-ID][CONTAINER-ID]-json.log: Access is denied.

2021-04-15T23:06:15.472679ZSyncLoop (PLEG): "[POD-NAME]_NAME", event: &pleg.PodLifecycleEvent{ID:"[ID]", Type:"ContainerDied", Data:"[DATA]"}

2021-04-15T23:06:15.472679Z[fake cpumanager] RemoveContainer (container id: [CONTAINER-ID])
2021-04-15T23:06:15.499033ZRemoveContainer "[CONTAINER-ID]" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "[CONTAINER-ID]": Error response from daemon: unable to remove filesystem for [CONTAINER-ID]: CreateFile C:\ProgramData\docker\containers[CONTAINER-ID][CONTAINER-ID]-json.log: Access is denied.

2021-04-15T23:06:16.024224Zdetected rotation of /var/log/containers/[POD-NAME][NAMESPACE-NAME][CONTAINER-NAME]-[CONTAINER-ID].log; waiting 5 seconds

@lizhuqi
Copy link
Author

lizhuqi commented Apr 20, 2021

Any update on this issue?

@lizhuqi
Copy link
Author

lizhuqi commented Apr 22, 2021

Below is the docker version used:
Container Runtime Version: docker://19.3.13

@jsturtevant
Copy link

What OS version is your host? Can run cmd /c ver on the host?

@lizhuqi
Copy link
Author

lizhuqi commented Apr 27, 2021

it was repro on 10.0.17763.1577 but the issue is gone now after upgrading to 10.0.17763.1757.

Yeah, we can close the bug now.

@lizhuqi lizhuqi closed this as completed Apr 27, 2021
@lizhuqi lizhuqi reopened this May 7, 2021
@lizhuqi
Copy link
Author

lizhuqi commented May 7, 2021

Sad to reopen this issue but we see this issue happens again in our customer's cluster. And this time the cluster has both kubelet fix and OS fix.

Kernel version: 10.0.17763.1757
OS image: Windows Server 2019 Datacenter
Container runtime version: docker://19.3.14
kubelet version: v1.17.17-gke.3700
kube-proxy version: v1.17.17-gke.3700

@jsturtevant
Copy link

Is this reproduced in later versions of kubernetes?

@lizhuqi
Copy link
Author

lizhuqi commented May 12, 2021

I saw a related issue in moby: moby/moby#39274

The error seems to also cause our logging agent fail to read the container logs.

@lizhuqi
Copy link
Author

lizhuqi commented May 12, 2021

Yes this also repro in 1.19.

See the versions for the latest repro in our internal test:
VERSION OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
v1.19.7-gke.2400 Windows Server 2019 Standard 10.0.17763.1282 docker://19.3.2

@jeremyje
Copy link

jeremyje commented May 12, 2021

kubectl --kubeconfig user-kubeconfig.yaml get nodes -o wide
NAME            STATUS   ROLES    AGE   VERSION            INTERNAL-IP    EXTERNAL-IP    OS-IMAGE                       KERNEL-VERSION     CONTAINER-RUNTIME
ddff-1500fd41   Ready    <none>   43h   v1.19.7-gke.2400   21.0.253.65    21.0.253.65    Windows Server 2019 Standard   10.0.17763.1282    docker://19.3.2
ddff-1500fdc6   Ready    <none>   43h   v1.19.7-gke.2400   21.0.253.198   21.0.253.198   Windows Server 2019 Standard   10.0.17763.1282    docker://19.3.2
ddff-1500fe8a   Ready    <none>   43h   v1.19.7-gke.2400   21.0.254.138   21.0.254.138   Ubuntu 18.04.5 LTS             5.4.0-1014-gkeop   docker://19.3.2
ddff-1500feb3   Ready    <none>   43h   v1.19.7-gke.2400   21.0.254.179   21.0.254.179   Ubuntu 18.04.5 LTS             5.4.0-1014-gkeop   docker://19.3.2
ddff-1500ff09   Ready    <none>   43h   v1.19.7-gke.2400   21.0.255.9     21.0.255.9     Ubuntu 18.04.5 LTS             5.4.0-1014-gkeop   docker://19.3.2
apiVersion: apps/v1
kind: Deployment
metadata:
  name: windows-log-test
  labels:
    app: windows-log-test
spec:
  selector:
    matchLabels:
      app: windows-log-test
  replicas: 2
  template:
    metadata:
      labels:
        app: windows-log-test
    spec:
      containers:
      - name: winloggertest
        image: gcr.io/gke-release/pause-win:1.6.1
        command:
        - powershell.exe
        - -command
        - Write-Host('log some message...')
      nodeSelector:
        kubernetes.io/os: windows
      tolerations:
      - key: "node.kubernetes.io/os"
        operator: "Equal"
        value: "windows"
        effect: "NoSchedule"

In the kubelet logs you'll see:

E0512 14:19:32.145807    4684 docker_sandbox.go:730] ResolvConfPath is empty.
E0512 14:19:50.340739    4684 docker_stats_windows.go:42] Error opening container (stats will be missing) 'd8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8': hcsshim::System::Properties d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8: The requested virtual machine or container operation is not valid in the current state.
E0512 14:19:51.321091    4684 pod_workers.go:191] Error syncing pod 91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7 ("windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"), skipping: failed to "StartContainer" for "winloggertest" with CrashLoopBackOff: "back-off 10s restarting failed container=winloggertest pod=windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"
E0512 14:19:52.687370    4684 pod_workers.go:191] Error syncing pod 91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7 ("windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"), skipping: failed to "StartContainer" for "winloggertest" with CrashLoopBackOff: "back-off 10s restarting failed container=winloggertest pod=windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"
E0512 14:20:21.101782    4684 pod_workers.go:191] Error syncing pod 91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7 ("windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"), skipping: failed to "StartContainer" for "winloggertest" with CrashLoopBackOff: "back-off 20s restarting failed container=winloggertest pod=windows-log-test-6c9f887d7d-lpk8x_default(91f0a6e0-34e2-4bcc-a0e1-4e2eb1ace3d7)"
E0512 14:20:21.116783    4684 remote_runtime.go:291] RemoveContainer "d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8" from runtime service failed: rpc error: code = Unknown desc = failed to remove container "d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8": Error response from daemon: unable to remove filesystem for d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8: CreateFile C:\ProgramData\docker\containers\d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8\d8c1850ca559b976681d11fda2518d837389abff8bfe7a910bede5740c26a9b8-json.log: Access is denied.

After a few moments you'll see this:

kubectl get pods -o wide
NAME                               READY   STATUS             RESTARTS   AGE     IP          NODE              NOMINATED NODE   READINESS GATES
windows-log-test-55f64676b-p7svf   0/1     CrashLoopBackOff   3          2m19s   10.96.4.4   gke-2c9026-3hxz   <none>           <none>
windows-log-test-55f64676b-qdw5x   0/1     Completed          4          2m19s   10.96.2.4   gke-2c9026-zskg   <none>           <none>
kubectl get nodes -o wide
NAME                               STATUS   ROLES    AGE     VERSION            INTERNAL-IP    EXTERNAL-IP       OS-IMAGE                             KERNEL-VERSION    CONTAINER-RUNTIME
gke-2c9026-0k31                    Ready    <none>   9m2s    v1.18.17-gke.100   10.138.0.223   104.196.253.239   Windows Server 2019 Datacenter       10.0.17763.1817   docker://19.3.14
gke-2c9026-3hxz                    Ready    <none>   8m54s   v1.18.17-gke.100   10.138.0.216   35.197.104.170    Windows Server 2019 Datacenter       10.0.17763.1817   docker://19.3.14
gke-2c9026-zskg                    Ready    <none>   9m1s    v1.18.17-gke.100   10.138.0.55    34.82.111.231     Windows Server 2019 Datacenter       10.0.17763.1817   docker://19.3.14
gke-r-default-pool-fffa6112-47jp   Ready    <none>   11m     v1.18.17-gke.100   10.138.0.204   34.83.230.3       Container-Optimized OS from Google   5.4.89+           docker://19.3.14
gke-r-default-pool-fffa6112-g4ss   Ready    <none>   11m     v1.18.17-gke.100   10.138.0.18    35.199.153.15     Container-Optimized OS from Google   5.4.89+           docker://19.3.14
gke-r-default-pool-fffa6112-v728   Ready    <none>   11m     v1.18.17-gke.100   10.138.0.49    34.82.255.36      Container-Optimized OS from Google   5.4.89+           docker://19.3.1

@jeremyje
Copy link

This repros in 1.20.5 with Docker 19.3.14. @jsturtevant

@jsturtevant
Copy link

I've been able to reproduce with @jeremyje's deployment spec in azure with 1.19 and 19.3.14

kubelet.err.log:469:E0512 23:05:55.112008    4324 remote_runtime.go:291] RemoveContainer "b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90" from runtime service        
failed: rpc error: code = Unknown desc = failed to remove container "b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90": Error response from daemon: unable to remove  
filesystem for b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90: CreateFile
C:\ProgramData\docker\containers\b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90\b4105aac93cc83f5bbaff5450e4d6c5dec0c4ccdfa5f8c27a19198b58e9afd90-json.log: Access is  
denied.

@chenlonglab198
Copy link

Can reproduce with AKS 1.20.2 / Node Image Version "AKSWindows-2019-17763.1817.210310"
May I know is there any update for this issue?

Thanks in Advance!

@immuzz
Copy link

immuzz commented May 25, 2021

@kevpar do you have any insights here?

@kevpar
Copy link
Member

kevpar commented Jun 1, 2021

It sounds like this issue is about getting Access is denied. when opening a path involving a symlink to a directory. Is that correct? Can you share the symlink path and its target?

The only thing that comes to mind is that if we are trying to open a symlink pointing to a directory, and the symlink was created without the SYMBOLIC_LINK_FLAG_DIRECTORY flag, then it will fail with Access is denied.. If that is the case, the issue can be resolved by passing this flag during symlink creation. If using Go's os.Symlink, then the target path needs to exist (as a directory) before the symlink is created.

@lizhuqi
Copy link
Author

lizhuqi commented Jun 3, 2021

The file is C:\ProgramData\docker\containers[ID][ID]-json.log

So it seems to be some log file created by docker daemon.

@immuzz
Copy link

immuzz commented Jun 9, 2021

@lizhuqi Did you try the above workaround @kevpar recommended? Also have you tried this with containerd?

@lizhuqi
Copy link
Author

lizhuqi commented Jun 10, 2021

@kevpar can you provide more details about the workaround? Like the steps to test the workaround.

@kevpar
Copy link
Member

kevpar commented Jun 11, 2021

I don't know if there is a viable workaround here. I was asking about symlinks because of the earlier comment that this issue seemed like a duplicate of #97. Reading through this again, though, it seems like the issue is internal to Docker: It is failing to delete the container log file at C:\ProgramData\docker\containers\<id>\<id>-json.log.

The symlink is /var/log/containers/[POD-NAME][NAMESPACE-NAME][CONTAINER-NAME]-[CONTAINER-ID].log and its target is C:\ProgramData\docker\containers[ID][ID]-json.log

Assuming my understanding above is correct, how is this symlink in /var/log related to the Docker error?

It's not clear what could cause Docker to get an access denied error when deleting the container log file. If it's due to something else holding the file open without sharing, I would expect the error to be ERROR_SHARING_VIOLATION rather than ERROR_ACCESS_DENIED.

@lizhuqi
Copy link
Author

lizhuqi commented Jun 11, 2021

Maybe the previous fix doesn't fully make the issue described in this comment go away?

@Random-Liu
Copy link

Is this related? moby/moby@2502db6

Will that fix the issue? That fix is only in Docker 20.10

@lizhuqi
Copy link
Author

lizhuqi commented Jun 16, 2021

I will check if this issue repros for docker 20.10

@lizhuqi
Copy link
Author

lizhuqi commented Jun 16, 2021

Confirmed the issue still repro for v.1.20.7 with Container Runtime Version: docker://20.10.4

@lizhuqi
Copy link
Author

lizhuqi commented Jun 16, 2021

@kevpar and @immuzz - can you take another look? I am happy to try out if there is any workaround.

@Random-Liu
Copy link

Random-Liu commented Jun 16, 2021

Confirmed the issue still repro for v.1.20.7 with Container Runtime Version: docker://20.10.4

If the issue still repro, as far as I can understand, docker just calls os.RemoveAll and got this error after the fix moby/moby@2502db6.

This looks like either a golang implementation issue or windows underlying issue (bad error reporting?).

In our case, the container directory actually is already gone (at least not visible) on the node, when we get this error.

@lizhuqi
Copy link
Author

lizhuqi commented Jun 24, 2021

@kevpar and @immuzz - can you take another look?

@lizhuqi
Copy link
Author

lizhuqi commented Jul 1, 2021

Confirmed the issue repros for v1.20.6-gke.1000 but doesn't repro for 1.21.1-gke.1800 and 1.21.1-gke.2200. The main difference between those two versions are the docker runtime. v1.20.6-gke.1000 has docker://19.3.14. 1.21.1-gke.1800 and 1.21.1-gke.2200 has docker://20.10.4. The issue also doesn't repro for containerd. Lantao's guess was right. My previous comment about reproing for v1.20.7 was wrong.

@lizhuqi
Copy link
Author

lizhuqi commented Jul 1, 2021

Close this as the issue is fixed in moby/moby@2502db6.

Anyone who experience this shall upgrade to use docker://20.10.4 or switch to use containerd.

@lizhuqi
Copy link
Author

lizhuqi commented Jul 15, 2021

The issue comes back. I think last time I repro I didn't start and restart pod frequent enough.

I will try to switch to 20.10.6.

@lizhuqi lizhuqi reopened this Jul 15, 2021
@lizhuqi
Copy link
Author

lizhuqi commented Jul 19, 2021

We have a MS support case opened for this issue and plan to have this case transferred to Mirantis support for Mirantis team to take a look.

@lizhuqi
Copy link
Author

lizhuqi commented Jul 28, 2021

If we disabled logging agent(fluentd or fluentbit) on the node, the "Access is denied" error will go away. So logging agent might be the one holding the container log file handle.

@lizhuqi
Copy link
Author

lizhuqi commented Jul 30, 2021

Both fluentd and fluentbit have some fix for this issue.

fluentbit fix is now in 1.5+ version. Some edge case may still trigger the access is denied error but it is much less often with the fix. A restart of fluentbit fixes the pods stuck in terminating.

fluentd fix has just been merged and will be out in its next release.

@lizhuqi
Copy link
Author

lizhuqi commented Aug 4, 2021

I tested fluentd v1.13.3 which is the latest fluentd release that just came out and has the fix we need. I found things have much improved. Before the fix, "RemoveContainer" stuck because of the "access is denied" error and after, "RemoveContainer" eventually succeeded with several retries and pod can proceed to terminate.

@lizhuqi
Copy link
Author

lizhuqi commented Aug 8, 2021

Close this issue as the fix is now in the latest release of fluentd and fluentbit.

@Norfolc
Copy link
Member

Norfolc commented Dec 6, 2021

We have the same issue with fluentbit 1.8.10 and EKS:

  Kernel Version:             10.0.17763.2300
  OS Image:                   Windows Server 2019 Datacenter
  Operating System:           windows
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.7
  Kubelet Version:            v1.20.10-eks-3bcdcd
  Kube-Proxy Version:         v1.20.10-eks-3bcdcd

@lizhuqi
Copy link
Author

lizhuqi commented Dec 6, 2021

@Norfolc - do you have a repro? I have an edge case repro for the issue logged at fluent/fluent-bit#3892

@Norfolc
Copy link
Member

Norfolc commented Dec 7, 2021

@lizhuqi I have a repro:

  • Run a fluentbit/fluent bit logger daemon set
  • Run a deployment with 1 pod that produces a lot of logs:
    kubectl apply -f https://gist.githubusercontent.com/Norfolc/9f81e2c0a525a107b9d207b81d03dee9/raw/ae0868b3b62537c5deb86784955498b092b007c1/windows-server-core.yaml
  • After some time scale down the deployment to zero:
    kubectl scale deploy windows-server-core --replicas 0
  • The pod will stuck in terminating state with error:
    unable to remove filesystem for [ID]: CreateFile C:\ProgramData\docker\containers\[ID]\[ID]-json.log: Access is denied.
  • Killing fluentbit process mitigates the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Windows on AKS Deploying Windows Container using AKS Windows on Kubernetes Windows Containers using Kubernetes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants