-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k0s reset hangs #4211
Comments
I tried again with debug enabled:
So it is doing something, just very slowly. Are these delays necessary i.e. could the removal be forced somehow? (e.g. |
From the logs, it looks like your worker ran a lot of pods, each of which took a minute to shut down. This could be an indication of a graceful shutdown timeout. K0s needs to stop all running containers before it can clean up the data directory, because running containers usually prevent certain paths from being deleted because they still have some active mount points. Usually, There are some ways this could be improved, such as parallelizing pod removal. Forcibly terminating pods could also speed things up, although this is quite destructive, especially if the node is still part of an active cluster, and the pods don't get time to complete their shutdown tasks. |
Thanks for the explanation. I'm coming to k0s from k3s & rke2 which uninstall in less than 30 seconds (on a system with same previously running workloads), so was curious to understand the differences. |
I reckon k3s will forcibly kill all the container processes. This is of course faster and arguably a reasonable choice for non-production clusters. However, for a cluster running more valuable workloads, proper pod termination will be a safer bet. Nevertheless, k0s could definitely do things concurrently during reset in order to speed up the process. |
Mine is also hanging, and I have restarted the node and it's still hanging. I don't care about data loss, so what's the easiest way to force reset without reformatting the drive?
|
In the same boat as those above. Would definitely appreciate an easy way to force remove all of k0s when in this state so I can apply it again. At the moment I can neither apply nor reset as both fail. To get k0s running again it looks like I will have to format my drives. Edit: |
Same exact issue as chrischen |
hmm, I wonder what makes stopping&killing pods so slow in some cases. Just tested with 106 pods:
Looking at the code, the stop signalling is probably not the optimal and as @twz123 said, we could easily parallelise cleaning of the pods. One things I'm thinking is that if there's pods with long running shutdown sequences (handling of SIGTERM) and/or stop hooks, that might affect this heavily. Currently I believe the code waits 60secs for all containers in a pod to stop before |
Looking at some optimizations in the code raises couple questions. Should the grace-period actually be user settable? With some sensible 30s default maybe? How would user apply force? using |
Not sure anymore at all. 😄 I've been testing this with pods that deliberately refuse to shutdown properly and I still don't see any major slowness in resetting a node with 100 pods in. @ianb-mp Do you have any idea if those pods you've had running have some long shutdown hooks/handling in place? |
@chrischen @teldredge Are you still experiencing this? If so, would you mind sharing the console output of One reason for |
@jnummelin unfortunately my test environment has changed significantly and I can't (easily) test the exact same scenario as I had when I first created the ticket. That said, I done many resets since then and not experienced any significant slowness so perhaps it's fixed now. |
Seems to be fine now. |
The issue is marked as stale since no activity has been recorded in 30 days |
Before creating an issue, make sure you've checked the following:
Platform
Version
v1.29.2+k0s.0
Sysinfo
`k0s sysinfo`
What happened?
Running
k0s reset
hangs (I've left it for over 15min)I tried rebooting the host, however had the same issue when running
k0s stop; k0s reset
after boot.k0s happily starts & stops, it just won't uninstall.
Steps to reproduce
Not sure how to reproduce this reliably.
Expected behavior
Reset should not hang. At least it should timeout with an error.
Actual behavior
Hangs
Screenshots and logs
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: