-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix status check showing unhealthy pods from prev iteration #6370
fix status check showing unhealthy pods from prev iteration #6370
Conversation
446f33b
to
5e88d45
Compare
Codecov Report
@@ Coverage Diff @@
## main #6370 +/- ##
==========================================
- Coverage 70.38% 70.37% -0.02%
==========================================
Files 499 499
Lines 22722 22731 +9
==========================================
+ Hits 15994 15997 +3
- Misses 5685 5691 +6
Partials 1043 1043
Continue to review full report at Codecov.
|
deps, err := client.AppsV1().Deployments(ns).List(ctx, metav1.ListOptions{ | ||
LabelSelector: l.RunIDSelector(), | ||
}) | ||
if err != nil { | ||
return nil, fmt.Errorf("could not fetch deployments: %w", err) | ||
} | ||
|
||
var ignore []string | ||
for k := range prevPods { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Are prevPods
only the pods that did not come up successfully the prev iteration? Would unhealthyPods
possibly be more descriptive/accurate here? I would think only unhealthy pods are added to the ignore list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this I'm a bit confused, is the idea to ignore all pods from prev iteration or only the unhealthy pods from the prev iteration (as per the title?) Are they the same thing? (this wouldn't be obvious to me as I would think the prev itertion some might have been healthy, is the idea to not report from status check until all are healthy?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous pods would just be any pod from the previous iteration. This change is mostly so we ignore them in the event API. CLI users shouldn't see any difference as we don't print updates from pods normally anyway, however we do send out events for them. So this will help IDE's not report pod failures when we tear down the old pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope. They could be unhealthy or healthy pods.
use case we have in mind is this
Here if you see as we are tearing down pods from previous iteration due to kubectl apply
event (healthy and unhealthy) related to those pods (since they have the same run-id) are shown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and seems good when testing locally, however maybe let's hold off on merging until IDEs have tested with it. Will update with their response
@etanshaul said he can do some testing from this branch, will wait and see if he runs into anything |
Looks like this fixed it! Before - second iteration (notice the 2 frontend pods, presumably one of them from the previous iteration) Thanks @tejal29 @MarlonGamez |
The double Update - this seems consistent on this branch. Clicking on the "other" frontend container under App Logs I seee:
|
In addition to the above, I also noticed another oddity: The set of deploy status sub-tasks we get from this branch is inconsistent across iterations (even when changing the exact same service): It looks like the original issue of displaying statuses from previous iterations is gone, but would you expect something like this to happen? |
For iteration 6 did you only change
It shdn't be a huge code change for fetching pods. do you think it would make things easier ? Thanks |
In both cases, I only changed the
This might explain it. I can make the same change (in the same file) over and over again and each iteration I may see a different set of status nodes. Is your suggestion to always fetch status for pods regardless of whether or not the status check finishes in 1 sec? (e.g. poll immediately or something) |
Yes we will fetch pods for a deployment along with their statuses even if it stabilized the first time we check deployment status. Right now, what we do is
|
Got it! It seems to me to make sense to also fetch pod status like you do for the deployment. Is there a particular reason you chose not to originally? |
No reason. If the deployments is stabilized no need to diagnose the pods. The decision design was for helping users surface issues in deployments |
Let me give this a spin on this branch. We can test the change today or Monday |
That's true. For the sake of our UI though I think it's less confusing if the behavior appears consistent.
Great |
19a99a0
to
ce0e102
Compare
@tejal29 let me know if/when you'd like me to give this another test on my end |
17c8946
to
737bc3a
Compare
@etanshaul Even after #6399, I see the issue you mentioned about 2 frontend nodes in stream log. The reason being, this deleted container is still tracked. |
Dumb question - can't we just exclude deleted containers from being emitted from the eventing? |
Yeah, we don't know which containers are going to get deleted before hand. Keeping this sync after deploy happens is a large impact change and might need some re-work and re-thinking. Right thing to do i feel is stop and start logger instead of mute, unmute. |
fixes https://github.com/GoogleCloudPlatform/cloud-code-intellij-internal/issues/4323
In this PR
A
Deployment
managesReplicaSets
. When a deployment is updates, a newReplicaSet
object is created.A
ReplicaSet
's purpose is to maintain a stable set of replica Pods running at any given time.Pods are controlled/owned by
ReplicaSet
When a deployment is updated, a new replicaset is created which controls pods.
This is evident by running
kubectl describe
Make a change to frontend code
Pod.Metadata.OwnerReference
Pods from previous iteration have a different ReplicaSet id.
This approach is correct because in case a deployment is not updated in a subsequent iteration, k8s scheduler does not spin up new pods for it. Previous iteration pods keep running.
Previously we were ignoring status check for previous seen pods. As a result of this, there were no events propagated to these pods. See testing notes -> #6370 (comment)
With the updated logic, we rely on
If a deployment is updated and pods from previous iteration still exist, the diagnose will filter out of pods from previous iteration as they don't belong to updated replica set .
Prev approach
In this PR - [ ] Add `ignore` field to pod Validator to ignore pods matching a filter in given namespace - [ ] Add `prevPods` field to `kubernetes.StatusMonitor` which will collect all pods from previous iteration - [ ] Update the `prevPods` in `status.Monitor.Reset` method.