Take into account number of unavailable replicas to decided if deployment is healthy or not #270

alexmt · 2018-06-07T00:39:30Z

No description provided.

jessesuen · 2018-06-07T01:48:36Z

controller/health.go

+	} else if deploy.Status.UnavailableReplicas > 0 {
+		health.Status = appv1.HealthStatusDegraded
+		health.StatusDetails = fmt.Sprintf("Deployment has %v unavailable replicas", deploy.Status.UnavailableReplicas)
+	} else {


Why aren't we checkin UpdatedReplicas at all? Isn't this missing the original bug?

Original logic did not work because controller was checking only deployment conditions. Deployment was considered healthy if deployment had required number of replicas even if some replicas has old pod template.

I think now bug is fixed because UnavailableReplicas is 0 only if deployment has required number of replicas and all replicas are up to date.

UnavailableReplicas is 0 only if deployment has required number of replicas and all replicas are up to date

I don't believe that is true. Here is the logic in deployment/sync.go:

// calculateStatus calculates the latest status for the provided deployment by looking into the provided replica sets. func calculateStatus(allRSs []*extensions.ReplicaSet, newRS *extensions.ReplicaSet, deployment *extensions.Deployment) extensions.DeploymentStatus { availableReplicas := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs) totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs) unavailableReplicas := totalReplicas - availableReplicas // If unavailableReplicas is negative, then that means the Deployment has more available replicas running than // desired, e.g. whenever it scales down. In such a case we should simply default unavailableReplicas to zero. if unavailableReplicas < 0 { unavailableReplicas = 0 } status := extensions.DeploymentStatus{ // TODO: Ensure that if we start retrying status updates, we won't pick up a new Generation value. ObservedGeneration: deployment.Generation, Replicas: deploymentutil.GetActualReplicaCountForReplicaSets(allRSs), UpdatedReplicas: deploymentutil.GetActualReplicaCountForReplicaSets([]*extensions.ReplicaSet{newRS}), ReadyReplicas: deploymentutil.GetReadyReplicaCountForReplicaSets(allRSs), AvailableReplicas: availableReplicas, UnavailableReplicas: unavailableReplicas, CollisionCount: deployment.Status.CollisionCount, }

unavailableReplicas is max(totalReplicas - availableReplicas, 0)

But UpdatedReplicas is calculated:
deploymentutil.GetActualReplicaCountForReplicaSets([]*extensions.ReplicaSet{newRS})

the Total replicas includes replicas from all replica sets (both old an new), from the statement: totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)

We really only care about UpdatedReplicas.

I think the logic should be something like:

desiredReplicas := dep.Spec.Replicas if UpdatedReplicas >= desiredReplicas && availableReplicas >= desiredReplicas && ReadyReplicas > desiredReplicas { return true } else { return false }

It seems ready and available also mean different things.

Actually, I'm not even sure my above logic is correct because availableReplicas and ReadyReplicas include the replicas from old and new.

See:
https://github.com/timoreimann/kubernetes-scripts/blob/master/wait-for-deployment

Actually, we simply need to duplicate or get the logic from:
kubectl rollout status deployment/dep-name

Here is the logic for rollout status in kubectl/rollout_status.go:

// Status returns a message describing deployment status, and a bool value indicating if the status is considered done. func (s *DeploymentStatusViewer) Status(namespace, name string, revision int64) (string, bool, error) { deployment, err := s.c.Deployments(namespace).Get(name, metav1.GetOptions{}) if err != nil { return "", false, err } if revision > 0 { deploymentRev, err := util.Revision(deployment) if err != nil { return "", false, fmt.Errorf("cannot get the revision of deployment %q: %v", deployment.Name, err) } if revision != deploymentRev { return "", false, fmt.Errorf("desired revision (%d) is different from the running revision (%d)", revision, deploymentRev) } } if deployment.Generation <= deployment.Status.ObservedGeneration { cond := util.GetDeploymentCondition(deployment.Status, extensionsv1beta1.DeploymentProgressing) if cond != nil && cond.Reason == util.TimedOutReason { return "", false, fmt.Errorf("deployment %q exceeded its progress deadline", name) } if deployment.Spec.Replicas != nil && deployment.Status.UpdatedReplicas < *deployment.Spec.Replicas { return fmt.Sprintf("Waiting for rollout to finish: %d out of %d new replicas have been updated...\n", deployment.Status.UpdatedReplicas, *deployment.Spec.Replicas), false, nil } if deployment.Status.Replicas > deployment.Status.UpdatedReplicas { return fmt.Sprintf("Waiting for rollout to finish: %d old replicas are pending termination...\n", deployment.Status.Replicas-deployment.Status.UpdatedReplicas), false, nil } if deployment.Status.AvailableReplicas < deployment.Status.UpdatedReplicas { return fmt.Sprintf("Waiting for rollout to finish: %d of %d updated replicas are available...\n", deployment.Status.AvailableReplicas, deployment.Status.UpdatedReplicas), false, nil } return fmt.Sprintf("deployment %q successfully rolled out\n", name), true, nil } return fmt.Sprintf("Waiting for deployment spec update to be observed...\n"), false, nil }

Thanks a lot for thoughtful code review! Applied your changes. Also manually verified that we are not getting false positive check any more:

argocd app sync test-app-minikube && argocd app wait test-app-minikube Application: test-app-minikube Operation: Sync Phase: Succeeded Message: successfully synced KIND NAME MESSAGE Service guestbook-ui service "guestbook-ui" unchanged Deployment guestbook-ui deployment "guestbook-ui" configured INFO[0000] App "test-app-minikube" has sync status "OutOfSync" and health status "Healthy" INFO[0001] App "test-app-minikube" has sync status "Synced" and health status "Progressing" INFO[0002] App "test-app-minikube" has sync status "Synced" and health status "Progressing"

…ment is healthy or not

…ck deployment health

alexmt · 2018-06-07T18:06:07Z

Merged to unblock release . @jessesuen please give your comments when you have time

merenbach

LGTM

jessesuen · 2018-06-07T18:17:33Z

LGTM

…ment is healthy or not (#270) * Take into account number of unavailable replicas to decided if deployment is healthy or not * Run one controller for all e2e tests to reduce tests duration * Apply reviewer notes: use logic from kubectl/rollout_status.go to check deployment health

alexmt requested a review from jessesuen June 7, 2018 00:39

jessesuen reviewed Jun 7, 2018

View reviewed changes

alexmt force-pushed the fix-app-health-check branch from b84eeb2 to 93f62ad Compare June 7, 2018 15:27

Alexander Matyushentsev added 3 commits June 7, 2018 08:42

Take into account number of unavailable replicas to decided if deploy…

c7ef295

…ment is healthy or not

Run one controller for all e2e tests to reduce tests duration

1633fa0

Apply reviewer notes: use logic from kubectl/rollout_status.go to che…

7917320

…ck deployment health

alexmt force-pushed the fix-app-health-check branch from 93f62ad to 7917320 Compare June 7, 2018 16:34

alexmt requested a review from merenbach June 7, 2018 17:26

alexmt merged commit aa42911 into argoproj:master Jun 7, 2018

alexmt deleted the fix-app-health-check branch June 7, 2018 18:05

merenbach reviewed Jun 7, 2018

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take into account number of unavailable replicas to decided if deployment is healthy or not #270

Take into account number of unavailable replicas to decided if deployment is healthy or not #270

alexmt commented Jun 7, 2018

jessesuen Jun 7, 2018

alexmt Jun 7, 2018

jessesuen Jun 7, 2018

jessesuen Jun 7, 2018 •

edited

Loading

jessesuen Jun 7, 2018

jessesuen Jun 7, 2018

jessesuen Jun 7, 2018

jessesuen Jun 7, 2018

alexmt Jun 7, 2018

alexmt commented Jun 7, 2018

merenbach left a comment

jessesuen commented Jun 7, 2018

Take into account number of unavailable replicas to decided if deployment is healthy or not #270

Take into account number of unavailable replicas to decided if deployment is healthy or not #270

Conversation

alexmt commented Jun 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jessesuen Jun 7, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexmt commented Jun 7, 2018

merenbach left a comment

Choose a reason for hiding this comment

jessesuen commented Jun 7, 2018

jessesuen Jun 7, 2018 •

edited

Loading