Improve rancher provider handling of service and container health states #1343

kelchm · 2017-03-24T21:24:05Z

This pull request aims to fix some basic issues which impact the usability of the rancher provider. During my testing with the rancher provider I found that services which did not have a healthy state were removed from the traefik config. In practice we should not care about the health of a service, but only about the health of that services individual containers. As a result I've made the following changes:

By default, we no longer filter services by their HealthState. A modified version of this functionality may be enabled using the 'EnableServiceHealthFilter' configuration toggle. This is particularly useful for 'advanced' configurations such as doing blue/green deployments.
A RefreshSeconds configuration parameter has been added (much like in the ECS provider) for configuring how often the Rancher API is queried.
Both containers and services are filtered by a combination of their HealthState and State due to the behavior of the Rancher API.
healthy and upgrading-healthy are considered to be a healthy HealthState for containers and services.
active, updating-active and upgraded are considered to be a healthy State for services.
running and updating-running are considered to be a healthy State for containers.

If anyone has any feedback or test results please feel free to share them with me here or in #rancher on the Traefik Slack.

Fixes #1253

kelchm · 2017-03-24T21:27:57Z

@SantoDE, @nuclon I'd be interested to see your feedback on this.

One area I have not tested yet is how these changes affect containers that lack a health check.

errm · 2017-03-24T23:24:14Z

LGTM, but is there a way to cover this case with a unit test?

kelchm · 2017-03-24T23:35:51Z

@errm, which case specifically? For containers without a health check I need to actually test the behavior of the Rancher API.

timoreimann · 2017-03-25T00:01:22Z

@kelchm with your PR, parseRancherData() now seems to skip containers from the result set that are missing certain health states. Ideally we can verify this through the calling Provide() method; since this will likely mean we'll have to jump through a number of hoops, however, I think it's also okay to test parseRancherData() directly.

errm · 2017-03-25T01:50:18Z

Yeah I would be fine with testing parseRancherData() directly...

SantoDE · 2017-03-27T06:09:19Z

Hey @kelchm,

as mentioned already, I like the idea where this PR is going to :) I'm also fine with migrating checks from the service to a container.

However, as mentioned twice already, we need some unit test to cover it ;)

SantoDE

Add unit test :)

kelchm · 2017-03-30T15:40:04Z

I'm going to need to rework some of my original plan here -- it turns out that the Rancher API returns a healthState of "healthy" in situations we wouldn't expect. For example, if a stack is stopped any containers within that stack's services still return a healthState of "healthy". I've opened an issue to confirm that this is the expected behavior of the Rancher API: rancher/rancher/issues/8354

My thinking is that we will likely need to use a combination of factors to determine first if a service should be included in the config and then second if a given container should be included for that service.

jwentworth · 2017-04-11T08:02:40Z

I think this is close to fixing what I see as the root cause of the issue. I think we need to add a check on the container's State (not just HealthState). I would say we'd only want to include containers that have a HealthState of "healthy" and a State of "running". I don't think I'd include containers that are being upgraded or initializing into the LB until they reach a healthy state, and as you mentioned before a healthy HealthState alone isn't enough to determine if a container should be included in the load balancer.

From my investigating the API and my use cases I think those are all the criteria that are needed if using the API (the metadata service deals with things a little differently, but this isn't using that)

kelchm · 2017-04-12T20:22:13Z

@SantoDE, I think this is ready for your review. I think the only thing remaining are some minor updates to the documentation. I've tried to improve the test coverage, hopefully this is okay for now.

jwentworth · 2017-04-12T21:35:47Z

This looks good to me, haven't been able to test it yet personally but this looks like it'd resolve the issues I'm seeing!

martinbaillie · 2017-04-18T03:59:58Z

I've had a quick test this afternoon and it LGTM. A well-needed enhancement, thanks @kelchm.

Now it could be my use case bias, but this seems like it should be default functionality - I would say EnableServiceHealthFilter should default to true. As it stands, I'm seeing my Traefik instance include stopped containers as active backends (so long as they have the label). Thoughts?

kelchm · 2017-04-18T20:25:59Z

@martinbaillie, containers which have a HealthState which is not 'healthy' or 'updating healthy' and/or a State which is not 'running' or 'updating-running' should be filtered by containerFilter. I haven't seen the issue you are describing, but it's definitely possible I missed something.

Can you grab the API response for the container(s) you are seeing this behavior on? I'd be curious what the State and HealthState are.

martinbaillie · 2017-04-18T21:28:11Z

@kelchm sorry I may have confused things. I meant it works perfectly with your PR enabled. I was questioning whether it should default to true rather than false by default, because without your PR enabled I saw stopped containers appear in the backend list which is undesirable default behaviour.

kelchm · 2017-04-18T21:47:48Z

@martinbaillie, There are two types of filters, containerFilter which filters at the container level and serviceFilter which filters at the service level.

I have serviceFilter disabled by default because it implies a more 'advanced' use case, such as replacing an existing stack with a new stack rather than upgrading the services within an existing stack. The containerFilter which is always on should prevent stopped (or otherwise unhealthy) containers from being brought in as backends even when EnableServiceHealthFilter is disabled.

martinbaillie · 2017-04-19T02:26:42Z

Thanks @kelchm that explains it. I was struggling to come up with a use case for when this would be desirable behaviour.

SantoDE

Hey @kelchm

Thanks a lot for your work! I really like it :)

Beside the minor typo, it's a LGTM to me. Could you please rebase and squash your commits?

Thanks!

SantoDE · 2017-04-19T07:09:33Z

provider/rancher.go

-	if service.Health != "" && service.Health != "healthy" {
-		log.Debugf("Filtering unhealthy or starting service %s", service.Name)
-		return false
+	// Only filter services by Health (HealthState) and State is EnableServiceHealthFilter is true


typo I guess?

if EnableServiceHealthFilter is true... ?

Good eye @SantoDE, fixed!

SantoDE · 2017-04-24T06:48:42Z

Hey @kelchm ,

a) tests are failing :'(
b) you have to rebase again ;-)

Thanks for your work! :)

ldez · 2017-04-24T22:39:21Z

@kelchm Could you squash your commits? Thanks.

ldez

Could you update the traefik.sample.toml ? Thanks.

emilevauge

Thanks @kelchm
Great!
Could you also complete traefik.sample.toml ?

emilevauge · 2017-04-28T13:50:42Z

provider/rancher/rancher.go

-					if key == "io.rancher.stack_service.name" && value == rancherData.Name {
-						rancherData.Containers = append(rancherData.Containers, container.PrimaryIpAddress)
-					}
+				if containerFilter(container) && container.Labels["io.rancher.stack_service.name"] == rancherData.Name {


It would be even better to write if container.Labels["io.rancher.stack_service.name"] == rancherData.Name && containerFilter(container) {, but meh :)

kelchm · 2017-04-29T14:57:27Z

@emilevauge, @ldez done 👍

ldez · 2017-04-29T15:22:16Z

traefik.sample.toml

+#
+# Optional
+#
+RefreshSeconds = 15


Could you comment this line.

ldez · 2017-04-29T15:22:19Z

traefik.sample.toml

+# Optional
+# Default: false
+#
+EnableServiceHealthFilter = false


Could you comment this line.

- Improves default filtering behavior to filter by container health/healthState - Optionally allows filtering by service health/healthState - Allows configuration of refresh interval

emilevauge

Thanks @kelchm
LGTM

SantoDE self-requested a review March 24, 2017 21:34

SantoDE requested changes Mar 27, 2017

View reviewed changes

emilevauge added the contributor/waiting-for-corrections label Apr 6, 2017

SantoDE approved these changes Apr 19, 2017

View reviewed changes

SantoDE added kind/enhancement a new or improved feature. area/provider/rancher status/1-needs-design-review status/2-needs-review status/3-docs-review labels Apr 19, 2017

kelchm force-pushed the improve-rancher-provider branch 4 times, most recently from bae596b to efe1f62 Compare April 22, 2017 19:19

kelchm force-pushed the improve-rancher-provider branch 3 times, most recently from 7df3398 to f99e06f Compare April 25, 2017 11:59

ldez added this to the 1.3 milestone Apr 25, 2017

ldez approved these changes Apr 26, 2017

View reviewed changes

ldez added status/3-needs-merge and removed status/1-needs-design-review status/2-needs-review status/3-docs-review labels Apr 26, 2017

ldez suggested changes Apr 28, 2017

View reviewed changes

ldez added status/3-docs-review and removed status/3-needs-merge labels Apr 28, 2017

emilevauge requested changes Apr 28, 2017

View reviewed changes

kelchm force-pushed the improve-rancher-provider branch 2 times, most recently from 53df90a to f8e3ca2 Compare April 29, 2017 14:55

ldez suggested changes Apr 29, 2017

View reviewed changes

mprove Rancher provider functionality:

44db6e9

- Improves default filtering behavior to filter by container health/healthState - Optionally allows filtering by service health/healthState - Allows configuration of refresh interval

kelchm force-pushed the improve-rancher-provider branch from ab805a8 to 44db6e9 Compare April 29, 2017 19:38

ldez removed the contributor/waiting-for-corrections label Apr 29, 2017

ldez approved these changes Apr 29, 2017

View reviewed changes

ldez added status/3-needs-merge and removed status/3-docs-review labels Apr 29, 2017

emilevauge approved these changes May 1, 2017

View reviewed changes

emilevauge merged commit 78f1b42 into traefik:master May 1, 2017

kelchm mentioned this pull request May 1, 2017

Downtime while performing upgrade rawmind0/rancher-traefik#35

Closed

ldez removed the status/3-needs-merge label May 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve rancher provider handling of service and container health states #1343

Improve rancher provider handling of service and container health states #1343

kelchm commented Mar 24, 2017 •

edited

Loading

kelchm commented Mar 24, 2017

errm commented Mar 24, 2017

kelchm commented Mar 24, 2017

timoreimann commented Mar 25, 2017

errm commented Mar 25, 2017

SantoDE commented Mar 27, 2017

SantoDE left a comment

kelchm commented Mar 30, 2017

jwentworth commented Apr 11, 2017

kelchm commented Apr 12, 2017 •

edited

Loading

jwentworth commented Apr 12, 2017

martinbaillie commented Apr 18, 2017

kelchm commented Apr 18, 2017 •

edited

Loading

martinbaillie commented Apr 18, 2017

kelchm commented Apr 18, 2017 •

edited

Loading

martinbaillie commented Apr 19, 2017

SantoDE left a comment •

edited

Loading

SantoDE Apr 19, 2017

kelchm Apr 22, 2017

SantoDE commented Apr 24, 2017

ldez commented Apr 24, 2017

ldez left a comment

emilevauge left a comment

emilevauge Apr 28, 2017

kelchm commented Apr 29, 2017

ldez Apr 29, 2017

kelchm Apr 29, 2017

ldez Apr 29, 2017

kelchm Apr 29, 2017

emilevauge left a comment

Improve rancher provider handling of service and container health states #1343

Improve rancher provider handling of service and container health states #1343

Conversation

kelchm commented Mar 24, 2017 • edited Loading

kelchm commented Mar 24, 2017

errm commented Mar 24, 2017

kelchm commented Mar 24, 2017

timoreimann commented Mar 25, 2017

errm commented Mar 25, 2017

SantoDE commented Mar 27, 2017

SantoDE left a comment

Choose a reason for hiding this comment

kelchm commented Mar 30, 2017

jwentworth commented Apr 11, 2017

kelchm commented Apr 12, 2017 • edited Loading

jwentworth commented Apr 12, 2017

martinbaillie commented Apr 18, 2017

kelchm commented Apr 18, 2017 • edited Loading

martinbaillie commented Apr 18, 2017

kelchm commented Apr 18, 2017 • edited Loading

martinbaillie commented Apr 19, 2017

SantoDE left a comment • edited Loading

Choose a reason for hiding this comment

SantoDE Apr 19, 2017

Choose a reason for hiding this comment

kelchm Apr 22, 2017

Choose a reason for hiding this comment

SantoDE commented Apr 24, 2017

ldez commented Apr 24, 2017

ldez left a comment

Choose a reason for hiding this comment

emilevauge left a comment

Choose a reason for hiding this comment

emilevauge Apr 28, 2017

Choose a reason for hiding this comment

kelchm commented Apr 29, 2017

ldez Apr 29, 2017

Choose a reason for hiding this comment

kelchm Apr 29, 2017

Choose a reason for hiding this comment

ldez Apr 29, 2017

Choose a reason for hiding this comment

kelchm Apr 29, 2017

Choose a reason for hiding this comment

emilevauge left a comment

Choose a reason for hiding this comment

kelchm commented Mar 24, 2017 •

edited

Loading

kelchm commented Apr 12, 2017 •

edited

Loading

kelchm commented Apr 18, 2017 •

edited

Loading

kelchm commented Apr 18, 2017 •

edited

Loading

SantoDE left a comment •

edited

Loading