-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Take into account number of unavailable replicas to decided if deployment is healthy or not #270
Conversation
controller/health.go
Outdated
} else if deploy.Status.UnavailableReplicas > 0 { | ||
health.Status = appv1.HealthStatusDegraded | ||
health.StatusDetails = fmt.Sprintf("Deployment has %v unavailable replicas", deploy.Status.UnavailableReplicas) | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we checkin UpdatedReplicas at all? Isn't this missing the original bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original logic did not work because controller was checking only deployment conditions. Deployment was considered healthy if deployment had required number of replicas even if some replicas has old pod template.
I think now bug is fixed because UnavailableReplicas is 0 only if deployment has required number of replicas and all replicas are up to date.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnavailableReplicas is 0 only if deployment has required number of replicas and all replicas are up to date
I don't believe that is true. Here is the logic in deployment/sync.go:
// calculateStatus calculates the latest status for the provided deployment by looking into the provided replica sets.
func calculateStatus(allRSs []*extensions.ReplicaSet, newRS *extensions.ReplicaSet, deployment *extensions.Deployment) extensions.DeploymentStatus {
availableReplicas := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs)
totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
unavailableReplicas := totalReplicas - availableReplicas
// If unavailableReplicas is negative, then that means the Deployment has more available replicas running than
// desired, e.g. whenever it scales down. In such a case we should simply default unavailableReplicas to zero.
if unavailableReplicas < 0 {
unavailableReplicas = 0
}
status := extensions.DeploymentStatus{
// TODO: Ensure that if we start retrying status updates, we won't pick up a new Generation value.
ObservedGeneration: deployment.Generation,
Replicas: deploymentutil.GetActualReplicaCountForReplicaSets(allRSs),
UpdatedReplicas: deploymentutil.GetActualReplicaCountForReplicaSets([]*extensions.ReplicaSet{newRS}),
ReadyReplicas: deploymentutil.GetReadyReplicaCountForReplicaSets(allRSs),
AvailableReplicas: availableReplicas,
UnavailableReplicas: unavailableReplicas,
CollisionCount: deployment.Status.CollisionCount,
}
unavailableReplicas is max(totalReplicas - availableReplicas, 0)
But UpdatedReplicas is calculated:
deploymentutil.GetActualReplicaCountForReplicaSets([]*extensions.ReplicaSet{newRS})
the Total replicas includes replicas from all replica sets (both old an new), from the statement: totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
We really only care about UpdatedReplicas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the logic should be something like:
desiredReplicas := dep.Spec.Replicas
if UpdatedReplicas >= desiredReplicas && availableReplicas >= desiredReplicas && ReadyReplicas > desiredReplicas {
return true
} else {
return false
}
It seems ready and available also mean different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm not even sure my above logic is correct because availableReplicas and ReadyReplicas include the replicas from old and new.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we simply need to duplicate or get the logic from:
kubectl rollout status deployment/dep-name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the logic for rollout status
in kubectl/rollout_status.go
:
// Status returns a message describing deployment status, and a bool value indicating if the status is considered done.
func (s *DeploymentStatusViewer) Status(namespace, name string, revision int64) (string, bool, error) {
deployment, err := s.c.Deployments(namespace).Get(name, metav1.GetOptions{})
if err != nil {
return "", false, err
}
if revision > 0 {
deploymentRev, err := util.Revision(deployment)
if err != nil {
return "", false, fmt.Errorf("cannot get the revision of deployment %q: %v", deployment.Name, err)
}
if revision != deploymentRev {
return "", false, fmt.Errorf("desired revision (%d) is different from the running revision (%d)", revision, deploymentRev)
}
}
if deployment.Generation <= deployment.Status.ObservedGeneration {
cond := util.GetDeploymentCondition(deployment.Status, extensionsv1beta1.DeploymentProgressing)
if cond != nil && cond.Reason == util.TimedOutReason {
return "", false, fmt.Errorf("deployment %q exceeded its progress deadline", name)
}
if deployment.Spec.Replicas != nil && deployment.Status.UpdatedReplicas < *deployment.Spec.Replicas {
return fmt.Sprintf("Waiting for rollout to finish: %d out of %d new replicas have been updated...\n", deployment.Status.UpdatedReplicas, *deployment.Spec.Replicas), false, nil
}
if deployment.Status.Replicas > deployment.Status.UpdatedReplicas {
return fmt.Sprintf("Waiting for rollout to finish: %d old replicas are pending termination...\n", deployment.Status.Replicas-deployment.Status.UpdatedReplicas), false, nil
}
if deployment.Status.AvailableReplicas < deployment.Status.UpdatedReplicas {
return fmt.Sprintf("Waiting for rollout to finish: %d of %d updated replicas are available...\n", deployment.Status.AvailableReplicas, deployment.Status.UpdatedReplicas), false, nil
}
return fmt.Sprintf("deployment %q successfully rolled out\n", name), true, nil
}
return fmt.Sprintf("Waiting for deployment spec update to be observed...\n"), false, nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for thoughtful code review! Applied your changes. Also manually verified that we are not getting false positive check any more:
argocd app sync test-app-minikube && argocd app wait test-app-minikube
Application: test-app-minikube
Operation: Sync
Phase: Succeeded
Message: successfully synced
KIND NAME MESSAGE
Service guestbook-ui service "guestbook-ui" unchanged
Deployment guestbook-ui deployment "guestbook-ui" configured
INFO[0000] App "test-app-minikube" has sync status "OutOfSync" and health status "Healthy"
INFO[0001] App "test-app-minikube" has sync status "Synced" and health status "Progressing"
INFO[0002] App "test-app-minikube" has sync status "Synced" and health status "Progressing"
b84eeb2
to
93f62ad
Compare
93f62ad
to
7917320
Compare
Merged to unblock release . @jessesuen please give your comments when you have time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM |
…ment is healthy or not (#270) * Take into account number of unavailable replicas to decided if deployment is healthy or not * Run one controller for all e2e tests to reduce tests duration * Apply reviewer notes: use logic from kubectl/rollout_status.go to check deployment health
No description provided.