fix status check showing unhealthy pods from prev iteration #6370

tejal29 · 2021-08-05T19:41:22Z

fixes https://github.com/GoogleCloudPlatform/cloud-code-intellij-internal/issues/4323

In this PR

Use owner reference to query pods for a deployment.
A Deployment manages ReplicaSets. When a deployment is updates, a new ReplicaSet object is created.
A ReplicaSet's purpose is to maintain a stable set of replica Pods running at any given time.
Pods are controlled/owned by ReplicaSet

When a deployment is updated, a new replicaset is created which controls pods.
This is evident by running kubectl describe

➜  go-guestbook git:(master) ✗ kubectl describe deployment/go-guestbook-frontend
Name:                   go-guestbook-frontend
...
OldReplicaSets:  <none>
NewReplicaSet:   go-guestbook-frontend-869d5b5fb4 (1/1 replicas created)

Make a change to frontend code

➜  go-guestbook git:(master) ✗ kubectl describe deployment/go-guestbook-frontend
...
OldReplicaSets:  <none>
NewReplicaSet:   go-guestbook-frontend-5c8664b8cb (1/1 replicas created)

This replica set is used to manage pods. The appear in the Pod.Metadata.OwnerReference

➜  go-guestbook git:(master) ✗ kubectl describe po/go-guestbook-frontend-869d5b5fb4-nsljc
Name:         go-guestbook-frontend-869d5b5fb4-nsljc
Labels:       app=guestbook
              app.kubernetes.io/managed-by=skaffold
              pod-template-hash=869d5b5fb4
              skaffold.dev/run-id=afc3bb01-54bf-4331-b854-3f9591a57ae0
              tier=frontend
Controlled By:  ReplicaSet/go-guestbook-frontend-869d5b5fb4

Pods from previous iteration have a different ReplicaSet id.

This approach is correct because in case a deployment is not updated in a subsequent iteration, k8s scheduler does not spin up new pods for it. Previous iteration pods keep running.

Previously we were ignoring status check for previous seen pods. As a result of this, there were no events propagated to these pods. See testing notes -> #6370 (comment)

With the updated logic, we rely on

Current replica set for the deployment.
Fetch pods belonging to current replica set.

If a deployment is updated and pods from previous iteration still exist, the diagnose will filter out of pods from previous iteration as they don't belong to updated replica set .

Prev approach

In this PR - [ ] Add `ignore` field to pod Validator to ignore pods matching a filter in given namespace - [ ] Add `prevPods` field to `kubernetes.StatusMonitor` which will collect all pods from previous iteration - [ ] Update the `prevPods` in `status.Monitor.Reset` method.

codecov · 2021-08-05T20:06:43Z

Codecov Report

Merging #6370 (737bc3a) into main (005dc31) will decrease coverage by 0.01%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             main    #6370      +/-   ##
==========================================
- Coverage   70.38%   70.37%   -0.02%     
==========================================
  Files         499      499              
  Lines       22722    22731       +9     
==========================================
+ Hits        15994    15997       +3     
- Misses       5685     5691       +6     
  Partials     1043     1043

Impacted Files	Coverage Δ
pkg/diag/validator/pod.go	`1.36% <25.00%> (-0.04%)`	⬇️
pkg/skaffold/kubernetes/status/status_check.go	`53.40% <100.00%> (+0.74%)`	⬆️
pkg/skaffold/util/tar.go	`63.21% <0.00%> (-2.30%)`	⬇️
pkg/skaffold/docker/parse.go	`88.23% <0.00%> (+0.84%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 005dc31...737bc3a. Read the comment docs.

aaron-prindle · 2021-08-05T20:49:55Z

pkg/skaffold/kubernetes/status/status_check.go

 	deps, err := client.AppsV1().Deployments(ns).List(ctx, metav1.ListOptions{
 		LabelSelector: l.RunIDSelector(),
 	})
 	if err != nil {
 		return nil, fmt.Errorf("could not fetch deployments: %w", err)
 	}

+	var ignore []string
+	for k := range prevPods {


nit: Are prevPods only the pods that did not come up successfully the prev iteration? Would unhealthyPods possibly be more descriptive/accurate here? I would think only unhealthy pods are added to the ignore list.

From this I'm a bit confused, is the idea to ignore all pods from prev iteration or only the unhealthy pods from the prev iteration (as per the title?) Are they the same thing? (this wouldn't be obvious to me as I would think the prev itertion some might have been healthy, is the idea to not report from status check until all are healthy?)

Previous pods would just be any pod from the previous iteration. This change is mostly so we ignore them in the event API. CLI users shouldn't see any difference as we don't print updates from pods normally anyway, however we do send out events for them. So this will help IDE's not report pod failures when we tear down the old pods.

nope. They could be unhealthy or healthy pods.
use case we have in mind is this

Here if you see as we are tearing down pods from previous iteration due to kubectl apply event (healthy and unhealthy) related to those pods (since they have the same run-id) are shown.

aaron-prindle

LGTM

MarlonGamez

LGTM and seems good when testing locally, however maybe let's hold off on merging until IDEs have tested with it. Will update with their response

MarlonGamez · 2021-08-05T21:22:35Z

@etanshaul said he can do some testing from this branch, will wait and see if he runs into anything

etanshaul · 2021-08-06T15:20:39Z

Looks like this fixed it!

Before - second iteration (notice the 2 frontend pods, presumably one of them from the previous iteration)

After - second iteration

Thanks @tejal29 @MarlonGamez

etanshaul · 2021-08-06T15:24:31Z

The double java-guestbook-frontend pods under Stream application logs is interesting though...

Update - this seems consistent on this branch. Clicking on the "other" frontend container under App Logs I seee:

[frontend] rpc error: code = Unknown desc = Error: No such container: 3d1d72ab54a8230dedfd51c645e6d4cb8446756882e5e0631ba8fe04fa07c8dc

etanshaul · 2021-08-06T15:31:03Z

In addition to the above, I also noticed another oddity:

The set of deploy status sub-tasks we get from this branch is inconsistent across iterations (even when changing the exact same service):

Iteration 5

Iteration 6

It looks like the original issue of displaying statuses from previous iterations is gone, but would you expect something like this to happen?

tejal29 · 2021-08-06T19:13:02Z

It looks like the original issue of displaying statuses from previous iterations is gone, but would you expect something like this to happen?

For iteration 6 did you only change backend?
Skaffold uses kubeclt apply and hence we don't have control on which deployments get updated.
Another factor for this discrepancy is,

We poll deployments and pods after 1 s
In case, kubectl apply <all resources> updated deployment frontend before 1s, we don't fetch status for pods for this deployment.

It shdn't be a huge code change for fetching pods. do you think it would make things easier ?

Thanks
Tejal

etanshaul · 2021-08-06T19:19:33Z

For iteration 6 did you only change backend?

In both cases, I only changed the frontend.

Another factor for this discrepancy is ..
It shdn't be a huge code change for fetching pods. do you think it would make things easier ?

This might explain it. I can make the same change (in the same file) over and over again and each iteration I may see a different set of status nodes.

Is your suggestion to always fetch status for pods regardless of whether or not the status check finishes in 1 sec? (e.g. poll immediately or something)

tejal29 · 2021-08-06T19:25:15Z

Is your suggestion to always fetch status for pods regardless of whether or not the status check finishes in 1 sec? (e.g. poll immediately or something)

Yes we will fetch pods for a deployment along with their statuses even if it stabilized the first time we check deployment status.

Right now, what we do is

while deployment stabilize deadline {
if `kubectl rollout status dep/frontend` is success:
   send ResourceStatusCheckEvent successful for dep/frontend
   exit 
else:
   1) Fetch pods for dep/frontend
   2) Check statuses for pods
   3) for pod  in pods:
         send ResourceStatusCheckEvent successful or pending for pod/frontend-xxxxx
sleep 1s
}

etanshaul · 2021-08-06T19:35:46Z

Got it! It seems to me to make sense to also fetch pod status like you do for the deployment. Is there a particular reason you chose not to originally?

tejal29 · 2021-08-06T19:37:31Z

Got it! It seems to me to make sense to also fetch pod status like you do for the deployment. Is there a particular reason you chose not to originally?

No reason. If the deployments is stabilized no need to diagnose the pods. The decision design was for helping users surface issues in deployments

tejal29 · 2021-08-06T19:38:19Z

Let me give this a spin on this branch. We can test the change today or Monday

etanshaul · 2021-08-06T19:39:36Z

If the deployments is stabilized no need to diagnose the pods.

That's true. For the sake of our UI though I think it's less confusing if the behavior appears consistent.

Let me give this a spin on this branch. We can test the change today or Monday

Great

etanshaul · 2021-08-09T17:43:16Z

@tejal29 let me know if/when you'd like me to give this another test on my end

tejal29 · 2021-08-10T16:33:02Z

@etanshaul Even after #6399, I see the issue you mentioned about 2 frontend nodes in stream log.

The reason being, this deleted container is still tracked.

etanshaul · 2021-08-10T16:52:55Z

The reason being, this deleted container is still tracked.

Dumb question - can't we just exclude deleted containers from being emitted from the eventing?

tejal29 · 2021-08-10T20:55:09Z

The reason being, this deleted container is still tracked.

Dumb question - can't we just exclude deleted containers from being emitted from the eventing?

Yeah, we don't know which containers are going to get deleted before hand. Keeping this sync after deploy happens is a large impact change and might need some re-work and re-thinking. Right thing to do i feel is stop and start logger instead of mute, unmute.

tejal29 requested a review from a team as a code owner August 5, 2021 19:41

tejal29 requested a review from aaron-prindle August 5, 2021 19:41

pull-request-size bot added the size/L label Aug 5, 2021

google-cla bot added the cla: yes label Aug 5, 2021

tejal29 requested a review from MarlonGamez August 5, 2021 19:41

tejal29 force-pushed the fix_prev_it_status_check branch 2 times, most recently from 446f33b to 5e88d45 Compare August 5, 2021 19:43

aaron-prindle reviewed Aug 5, 2021

View reviewed changes

aaron-prindle approved these changes Aug 5, 2021

View reviewed changes

MarlonGamez reviewed Aug 5, 2021

View reviewed changes

tejal29 added the !! do-not-merge !! label Aug 5, 2021

etanshaul mentioned this pull request Aug 6, 2021

Previous iteration application logs being emitted #6378

Closed

pull-request-size bot added size/M and removed size/L labels Aug 6, 2021

tejal29 force-pushed the fix_prev_it_status_check branch from 19a99a0 to ce0e102 Compare August 6, 2021 22:30

MarlonGamez mentioned this pull request Aug 9, 2021

Filter fetched pods based off owner ReplicaSet during status check #6391

Closed

fix prev iteration

fff0848

tejal29 added 2 commits August 9, 2021 12:28

use owner references

865b835

fix NPE

737bc3a

tejal29 force-pushed the fix_prev_it_status_check branch from 17c8946 to 737bc3a Compare August 9, 2021 19:40

tejal29 mentioned this pull request Aug 10, 2021

fix rpc error rpc error log error tejal29/skaffold#17

Closed

tejal29 mentioned this pull request Aug 10, 2021

Fix log line container not found in logging api #6399

Merged

tejal29 merged commit ac0de91 into GoogleContainerTools:main Aug 10, 2021

tejal29 mentioned this pull request Aug 10, 2021

Logger: Consider start and stopping logger instead of mute, unmute. #6403

Open

tejal29 added area/status-check and removed !! do-not-merge !! labels Aug 25, 2021

briandealwis mentioned this pull request Oct 26, 2021

Skaffold Deploy Fails on Readiness Probe Failures of Terminating Pods #6758

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix status check showing unhealthy pods from prev iteration #6370

fix status check showing unhealthy pods from prev iteration #6370

tejal29 commented Aug 5, 2021 •

edited

Loading

codecov bot commented Aug 5, 2021 •

edited

Loading

aaron-prindle Aug 5, 2021

aaron-prindle Aug 5, 2021 •

edited

Loading

MarlonGamez Aug 5, 2021

tejal29 Aug 5, 2021 •

edited

Loading

aaron-prindle left a comment

MarlonGamez left a comment

MarlonGamez commented Aug 5, 2021

etanshaul commented Aug 6, 2021 •

edited

Loading

etanshaul commented Aug 6, 2021 •

edited

Loading

etanshaul commented Aug 6, 2021

tejal29 commented Aug 6, 2021

etanshaul commented Aug 6, 2021

tejal29 commented Aug 6, 2021 •

edited

Loading

etanshaul commented Aug 6, 2021

tejal29 commented Aug 6, 2021

tejal29 commented Aug 6, 2021

etanshaul commented Aug 6, 2021

etanshaul commented Aug 9, 2021 •

edited

Loading

tejal29 commented Aug 10, 2021 •

edited

Loading

etanshaul commented Aug 10, 2021 •

edited

Loading

tejal29 commented Aug 10, 2021 •

edited

Loading

fix status check showing unhealthy pods from prev iteration #6370

fix status check showing unhealthy pods from prev iteration #6370

Conversation

tejal29 commented Aug 5, 2021 • edited Loading

codecov bot commented Aug 5, 2021 • edited Loading

Codecov Report

aaron-prindle Aug 5, 2021

Choose a reason for hiding this comment

aaron-prindle Aug 5, 2021 • edited Loading

Choose a reason for hiding this comment

MarlonGamez Aug 5, 2021

Choose a reason for hiding this comment

tejal29 Aug 5, 2021 • edited Loading

Choose a reason for hiding this comment

aaron-prindle left a comment

Choose a reason for hiding this comment

MarlonGamez left a comment

Choose a reason for hiding this comment

MarlonGamez commented Aug 5, 2021

etanshaul commented Aug 6, 2021 • edited Loading

etanshaul commented Aug 6, 2021 • edited Loading

etanshaul commented Aug 6, 2021

tejal29 commented Aug 6, 2021

etanshaul commented Aug 6, 2021

tejal29 commented Aug 6, 2021 • edited Loading

etanshaul commented Aug 6, 2021

tejal29 commented Aug 6, 2021

tejal29 commented Aug 6, 2021

etanshaul commented Aug 6, 2021

etanshaul commented Aug 9, 2021 • edited Loading

tejal29 commented Aug 10, 2021 • edited Loading

etanshaul commented Aug 10, 2021 • edited Loading

tejal29 commented Aug 10, 2021 • edited Loading

tejal29 commented Aug 5, 2021 •

edited

Loading

codecov bot commented Aug 5, 2021 •

edited

Loading

aaron-prindle Aug 5, 2021 •

edited

Loading

tejal29 Aug 5, 2021 •

edited

Loading

etanshaul commented Aug 6, 2021 •

edited

Loading

etanshaul commented Aug 6, 2021 •

edited

Loading

tejal29 commented Aug 6, 2021 •

edited

Loading

etanshaul commented Aug 9, 2021 •

edited

Loading

tejal29 commented Aug 10, 2021 •

edited

Loading

etanshaul commented Aug 10, 2021 •

edited

Loading

tejal29 commented Aug 10, 2021 •

edited

Loading