Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take into account number of unavailable replicas to decided if deployment is healthy or not #270

Merged
merged 3 commits into from
Jun 7, 2018

Conversation

alexmt
Copy link
Collaborator

@alexmt alexmt commented Jun 7, 2018

No description provided.

@alexmt alexmt requested a review from jessesuen June 7, 2018 00:39
} else if deploy.Status.UnavailableReplicas > 0 {
health.Status = appv1.HealthStatusDegraded
health.StatusDetails = fmt.Sprintf("Deployment has %v unavailable replicas", deploy.Status.UnavailableReplicas)
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't we checkin UpdatedReplicas at all? Isn't this missing the original bug?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original logic did not work because controller was checking only deployment conditions. Deployment was considered healthy if deployment had required number of replicas even if some replicas has old pod template.

I think now bug is fixed because UnavailableReplicas is 0 only if deployment has required number of replicas and all replicas are up to date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnavailableReplicas is 0 only if deployment has required number of replicas and all replicas are up to date

I don't believe that is true. Here is the logic in deployment/sync.go:

// calculateStatus calculates the latest status for the provided deployment by looking into the provided replica sets.
func calculateStatus(allRSs []*extensions.ReplicaSet, newRS *extensions.ReplicaSet, deployment *extensions.Deployment) extensions.DeploymentStatus {
	availableReplicas := deploymentutil.GetAvailableReplicaCountForReplicaSets(allRSs)
	totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)
	unavailableReplicas := totalReplicas - availableReplicas
	// If unavailableReplicas is negative, then that means the Deployment has more available replicas running than
	// desired, e.g. whenever it scales down. In such a case we should simply default unavailableReplicas to zero.
	if unavailableReplicas < 0 {
		unavailableReplicas = 0
	}

	status := extensions.DeploymentStatus{
		// TODO: Ensure that if we start retrying status updates, we won't pick up a new Generation value.
		ObservedGeneration:  deployment.Generation,
		Replicas:            deploymentutil.GetActualReplicaCountForReplicaSets(allRSs),
		UpdatedReplicas:     deploymentutil.GetActualReplicaCountForReplicaSets([]*extensions.ReplicaSet{newRS}),
		ReadyReplicas:       deploymentutil.GetReadyReplicaCountForReplicaSets(allRSs),
		AvailableReplicas:   availableReplicas,
		UnavailableReplicas: unavailableReplicas,
		CollisionCount:      deployment.Status.CollisionCount,
	}

unavailableReplicas is max(totalReplicas - availableReplicas, 0)

But UpdatedReplicas is calculated:
deploymentutil.GetActualReplicaCountForReplicaSets([]*extensions.ReplicaSet{newRS})

the Total replicas includes replicas from all replica sets (both old an new), from the statement: totalReplicas := deploymentutil.GetReplicaCountForReplicaSets(allRSs)

We really only care about UpdatedReplicas.

Copy link
Member

@jessesuen jessesuen Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic should be something like:

desiredReplicas := dep.Spec.Replicas
if UpdatedReplicas >= desiredReplicas && availableReplicas >= desiredReplicas && ReadyReplicas > desiredReplicas {
    return true
} else {
    return false
}

It seems ready and available also mean different things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm not even sure my above logic is correct because availableReplicas and ReadyReplicas include the replicas from old and new.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we simply need to duplicate or get the logic from:
kubectl rollout status deployment/dep-name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the logic for rollout status in kubectl/rollout_status.go:

// Status returns a message describing deployment status, and a bool value indicating if the status is considered done.
func (s *DeploymentStatusViewer) Status(namespace, name string, revision int64) (string, bool, error) {
	deployment, err := s.c.Deployments(namespace).Get(name, metav1.GetOptions{})
	if err != nil {
		return "", false, err
	}
	if revision > 0 {
		deploymentRev, err := util.Revision(deployment)
		if err != nil {
			return "", false, fmt.Errorf("cannot get the revision of deployment %q: %v", deployment.Name, err)
		}
		if revision != deploymentRev {
			return "", false, fmt.Errorf("desired revision (%d) is different from the running revision (%d)", revision, deploymentRev)
		}
	}
	if deployment.Generation <= deployment.Status.ObservedGeneration {
		cond := util.GetDeploymentCondition(deployment.Status, extensionsv1beta1.DeploymentProgressing)
		if cond != nil && cond.Reason == util.TimedOutReason {
			return "", false, fmt.Errorf("deployment %q exceeded its progress deadline", name)
		}
		if deployment.Spec.Replicas != nil && deployment.Status.UpdatedReplicas < *deployment.Spec.Replicas {
			return fmt.Sprintf("Waiting for rollout to finish: %d out of %d new replicas have been updated...\n", deployment.Status.UpdatedReplicas, *deployment.Spec.Replicas), false, nil
		}
		if deployment.Status.Replicas > deployment.Status.UpdatedReplicas {
			return fmt.Sprintf("Waiting for rollout to finish: %d old replicas are pending termination...\n", deployment.Status.Replicas-deployment.Status.UpdatedReplicas), false, nil
		}
		if deployment.Status.AvailableReplicas < deployment.Status.UpdatedReplicas {
			return fmt.Sprintf("Waiting for rollout to finish: %d of %d updated replicas are available...\n", deployment.Status.AvailableReplicas, deployment.Status.UpdatedReplicas), false, nil
		}
		return fmt.Sprintf("deployment %q successfully rolled out\n", name), true, nil
	}
	return fmt.Sprintf("Waiting for deployment spec update to be observed...\n"), false, nil
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for thoughtful code review! Applied your changes. Also manually verified that we are not getting false positive check any more:

argocd app sync test-app-minikube && argocd app wait test-app-minikube
Application:        test-app-minikube
Operation:          Sync
Phase:              Succeeded
Message:            successfully synced

KIND        NAME          MESSAGE
Service     guestbook-ui  service "guestbook-ui" unchanged
Deployment  guestbook-ui  deployment "guestbook-ui" configured
INFO[0000] App "test-app-minikube" has sync status "OutOfSync" and health status "Healthy"
INFO[0001] App "test-app-minikube" has sync status "Synced" and health status "Progressing"
INFO[0002] App "test-app-minikube" has sync status "Synced" and health status "Progressing"

@alexmt alexmt force-pushed the fix-app-health-check branch from b84eeb2 to 93f62ad Compare June 7, 2018 15:27
@alexmt alexmt force-pushed the fix-app-health-check branch from 93f62ad to 7917320 Compare June 7, 2018 16:34
@alexmt alexmt requested a review from merenbach June 7, 2018 17:26
@alexmt alexmt merged commit aa42911 into argoproj:master Jun 7, 2018
@alexmt alexmt deleted the fix-app-health-check branch June 7, 2018 18:05
@alexmt
Copy link
Collaborator Author

alexmt commented Jun 7, 2018

Merged to unblock release . @jessesuen please give your comments when you have time

Copy link
Contributor

@merenbach merenbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jessesuen
Copy link
Member

LGTM

alexmt added a commit that referenced this pull request Jun 7, 2018
…ment is healthy or not (#270)

* Take into account number of unavailable replicas to decided if deployment is healthy or not

* Run one controller for all e2e tests to reduce tests duration

* Apply reviewer notes: use logic from kubectl/rollout_status.go to check deployment health
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants