-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod Evictions (memory, node shutodowns) - requires manual hub restart! #223
Comments
I can confirm that we experience this frequently and my solution is always to restart the hub. We are on GKE. |
@minrk @betatim @choldgraf this looks suspicious to me 1. Only DELETED stateWe don't look for EVICTED state for example. I can look into exactly what states you can be in and what states will overlap, because they will. For example, you will be considered Pending even though you are ContainerCreating in the Status I recall. Perhaps we can find one state, including DELETED and EVICTED at the same time? kubespawner/kubespawner/reflector.py Lines 217 to 219 in 472a662
2. Abort in finallyA finally statement has a stop and a break within it even though the except statements have continue etc... Hmmm... UPDATE: oops okay hmmm perhaps not an issue, I jumped the gun. kubespawner/kubespawner/reflector.py Lines 244 to 248 in 472a662
For reference, this is what happens if you have an try / except / else issue... Try -> Except -> FinallyTry -> Else -> Finally |
@minrk helped me narrow the space where this needs to be fixed: spawner.py@poll, or the function called from that including spawner.py@stop. |
Hi, @consideRatio |
Hi @kakal! I'm happy to see some activity for this issue as I think it is very important. I have not done any additional work to resolve this since. Hoping to get back to it. |
From check_spawner, kubespawner's poll is invoked, and after it returns check_spawner will decide to kill the evicted pod. kubespawner/kubespawner/spawner.py Lines 1458 to 1494 in 0137336
What needs to happen, is that kubespawner or the hub realize that pods become evicted. Questions raised
|
The pod reflector - code base investigationThe pod reflector will watch and update itself, and it will remove things from its list if it has
|
A Python 3 memory bomb for debuggingNo... This failed. The process was instead killed. Hmmm... I ran into the "out of memory killer" or "OOM killer". #!/usr/bin/env python
import sys
import time
if len(sys.argv) != 2:
print("usage: fillmem <number-of-megabytes>")
sys.exit()
count = int(sys.argv[1])
megabyte = (0,) * (1024 * 128)
data = megabyte * count
while True:
time.sleep(1) |
For reference, I'm quite sure I'm seeing this too. I don't have any extra help in debugging, other than the database showed |
I think what happens is that we consider the pod terminated, but it remains in a Failed state. kubernetes/kubernetes#54525 (comment) In the poll function there is this section, note the check for ctr_status, I think that is missing for pods that has entered a Failed or Succeeded state where all containers have terminated according to specifications of these pod phases. kubespawner/kubespawner/spawner.py Lines 1476 to 1481 in 0137336
If a pod is evicted due to memory preassure etc, the pod will enter a Failed state, but the key difference from being culled into a Succeeded (?) state, is that the pod with the Failed state will remain and not be cleaned up quickly. It will have a pod phase reason being "Evicted" I think. So if we to surgically want to solve this, I think we should look for that specific scenario and decide what to do based on that. If we delete it we may loose some debugging information though, but it don't we will fail to spawn new pods as it blocks the pods name I think. Perhaps we should look for this case, log stuff of relevance, and then delete it? |
Below are three pods that will remain if stopped, and I can imagine could cause trouble by not being deleted and will remain listed in Observations
A pod quit - completes (pod phase: Success)
A pod quit - errors (pod phase: Failed)
A pod quit - evicted (pod phase: Failed, reason: Evicted)
|
Hmmm... But in the initial description of the issue, things remained in...
Which is what happens while the spawner is waiting to see the pod disappear from its resource list... kubespawner/kubespawner/spawner.py Lines 1798 to 1807 in 0137336
But that only happens when... kubespawner/kubespawner/reflector.py Lines 229 to 235 in 0137336
But hmm... Wasn't a delete call made just before this wait started? kubespawner/kubespawner/spawner.py Lines 1780 to 1797 in 0137336
Yes, but it returned 404 as indicated by this log trace:
Why could we end up with a 404 on a pod delete request, while the pod remains in the list? has it been deleted already, but it did not trigger a DELETE V1WatchEvent at any time? |
Ah hmmm I realize that we can watch for k8s events about pods being Evicted and take action based on that... https://www.bluematador.com/blog/kubernetes-events-explained |
We've been experiencing this issue when a user pod exceeds the resource limit (e.g. consumes >8GB of RAM). In this scenario, k8s was restarting the user's pod with an We have implemented a couple updates that don't resolve the underlying issue, but do improve the user experience.
The above items certainly don't resolve the primary issue, but have helped us reduce user impacts. As far as the root cause, would it be possible to modify the hub to capture a unique label for the pod rather than the IP address? If the label were unique, then it has the upside that it doesn't change, even if the IP address does change due to being rescheduled by k8s. |
@tritab A similar issue was brought up in #386 which requires a persistent pod identifier for SSL certificates. The solution in that PR was to create a service. Presumably that should also solve your problem? |
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/weve-been-seeing-503s-occasionally-for-user-servers/3824/2 |
Yes, in particular, the contents of the attached log. |
I think the current status is that the hub won't require a manual restart, but that it will require some time before its |
This is the situation, and the issue is related to avoiding manual restarts specifically. Let's mark this as closed. |
About
We have an issue when pods are removed unexpectedly that seems to require a hub restart. This could happen if a node fails, is preempted, or an admin deletes a user pod I think. This is really bad I'd say as it requires the specific user that has this problem to contact and administrator and then the administrator has to restart the hub to solve it.
I deem this vital to fix as preemtible nodes (google) and spot nodes (amazon) are an awesome way to reduce cost, but usage of them risks causing this kind of huge trouble right now!
I'm still lacking some insight in whats going on in the proxy and the user management within JupyterHub though so I'm hoping someone can pick this issue up from my writeup. @minrk or @yuvipanda perhaps you know someone to /cc or could give me a pointer on where to look to solve this?
Experience summary
hub/admin
stop podExperience log
1. A preemptible node was reclaimed and my user pod was lost
erik-2esundell
erik.sundell
2. Later - I visit the hub
3. Directly after - I visit
hub/admin
and press the stop server buttonstart server
againstart server
, this would show4. I revisit my singleuser server
What I see

What the hub log sais after a while
5. I look again
This is now what I see

I think it is because excessive refreshing of the pending stop page, once for every log message entry in the written point above.
6. I refresh
This is what I see

This is what the proxy chp logs say - an route was added, but it is unreachable
7. I restart the hub
8. I login again and everything works
The text was updated successfully, but these errors were encountered: