Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove health check filter from Marathon tasks. #2817

Merged

Conversation

timoreimann
Copy link
Contributor

@timoreimann timoreimann commented Feb 6, 2018

What does this PR do?

Remove health check filter from Marathon tasks.

Motivation

When available, Traefik uses the last Marathon health check result to decide whether a task should be filtered or not. This behavior is incorrect, however, as it ignores the failure threshold denoting the
number of failed health checks it would take for Marathon to actually consider a task as failing and trigger a restart. With Traefik not taking the threshold into account, a task is removed from the
configuration already while Marathon may still deem it healthy.

Delaying the filtering until the moment when the failure threshold is exceeded would not buy us much: At that point, Marathon will decide to restart the unhealthy task anyway and cause a task state change that will lead to a Traefik configuration update excluding the task in question.

Consequently, we can remove the filter logic.

More

  • Added/updated tests
  • Added/updated documentation

@timoreimann timoreimann requested a review from a team as a code owner February 6, 2018 23:41
@ldez ldez added the kind/enhancement a new or improved feature. label Feb 6, 2018
@ldez ldez added this to the 1.6 milestone Feb 6, 2018
@timoreimann
Copy link
Contributor Author

I'd appreciate any additional feedback from Marathon provider users.

(@aantono maybe?)

Copy link
Contributor

@m3co-code m3co-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working together with @timoreimann on that issue, so LGTM 👍

@aantono
Copy link
Contributor

aantono commented Feb 7, 2018

I think this makes good sense. In general the delegated provider's source (i.e. Marathon, Consul, K8s, etc) should be responsible for managing health state of the entries and Traefik only responds to what they produce as "valid endpoints". If one wants to explicitly have Traefik do health checking, that can be done via backend healthcheck config. So in short - LGTM 👍

@timoreimann
Copy link
Contributor Author

@aantono thanks for your feedback. 👏

@juliens
Copy link
Member

juliens commented Feb 8, 2018

Do you think we can add an option to enable or disable this (with default to enable for breaking change) ?

@timoreimann
Copy link
Contributor Author

@juliens taking our latest discussion on the subject on Slack into account, do you think it's still necessary to have a switch?

My argument is that the existing behavior is inherently broken and bears a risk to availability for Marathon users, so any setting matching this behavior by default won't do any better. Users should either adjust their Marathon health checks or employ Traefik health checks.

Let me know your thoughts.

@juliens
Copy link
Member

juliens commented Feb 12, 2018

After discussion, oky not to add the switch.

Copy link
Member

@mmatur mmatur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@juliens juliens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

When available, Traefik uses the last Marathon health check result to
decide whether a task should be filtered or not. This behavior is
incorrect, however, as it ignores the failure threshold denoting the
number of failed health checks it would take for Marathon to actually
consider a task as failing and trigger a restart. With Traefik not
taking the threshold into account, a task is removed from the
configuration already while Marathon may still deem it healthy.

Delaying the filtering until the moment when the failure threshold is
exceeded does not buy us much: At that point, Marathon will decide to
restart the unhealthy task anyway and cause a task state change that
will lead to a Traefik configuration update excluding the task in
question.

Consequently, we can remove the filter logic.
@traefiker traefiker force-pushed the marathon/remove-health-check-task-filter branch from 119ac80 to 2e16430 Compare February 13, 2018 06:44
@traefiker traefiker merged commit 17e85e3 into traefik:master Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants