Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve rancher provider handling of service and container health states #1343

Merged
merged 1 commit into from
May 1, 2017

Conversation

kelchm
Copy link
Contributor

@kelchm kelchm commented Mar 24, 2017

This pull request aims to fix some basic issues which impact the usability of the rancher provider. During my testing with the rancher provider I found that services which did not have a healthy state were removed from the traefik config. In practice we should not care about the health of a service, but only about the health of that services individual containers. As a result I've made the following changes:

  1. By default, we no longer filter services by their HealthState. A modified version of this functionality may be enabled using the 'EnableServiceHealthFilter' configuration toggle. This is particularly useful for 'advanced' configurations such as doing blue/green deployments.
  2. A RefreshSeconds configuration parameter has been added (much like in the ECS provider) for configuring how often the Rancher API is queried.
  3. Both containers and services are filtered by a combination of their HealthState and State due to the behavior of the Rancher API.
  4. healthy and upgrading-healthy are considered to be a healthy HealthState for containers and services.
  5. active, updating-active and upgraded are considered to be a healthy State for services.
  6. running and updating-running are considered to be a healthy State for containers.

If anyone has any feedback or test results please feel free to share them with me here or in #rancher on the Traefik Slack.

Fixes #1253

@kelchm
Copy link
Contributor Author

kelchm commented Mar 24, 2017

@SantoDE, @nuclon I'd be interested to see your feedback on this.

One area I have not tested yet is how these changes affect containers that lack a health check.

@SantoDE SantoDE self-requested a review March 24, 2017 21:34
@errm
Copy link
Contributor

errm commented Mar 24, 2017

LGTM, but is there a way to cover this case with a unit test?

@kelchm
Copy link
Contributor Author

kelchm commented Mar 24, 2017

@errm, which case specifically? For containers without a health check I need to actually test the behavior of the Rancher API.

@timoreimann
Copy link
Contributor

@kelchm with your PR, parseRancherData() now seems to skip containers from the result set that are missing certain health states. Ideally we can verify this through the calling Provide() method; since this will likely mean we'll have to jump through a number of hoops, however, I think it's also okay to test parseRancherData() directly.

@errm
Copy link
Contributor

errm commented Mar 25, 2017

Yeah I would be fine with testing parseRancherData() directly...

@SantoDE
Copy link
Collaborator

SantoDE commented Mar 27, 2017

Hey @kelchm,

as mentioned already, I like the idea where this PR is going to :) I'm also fine with migrating checks from the service to a container.

However, as mentioned twice already, we need some unit test to cover it ;)

Copy link
Collaborator

@SantoDE SantoDE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add unit test :)

@kelchm
Copy link
Contributor Author

kelchm commented Mar 30, 2017

I'm going to need to rework some of my original plan here -- it turns out that the Rancher API returns a healthState of "healthy" in situations we wouldn't expect. For example, if a stack is stopped any containers within that stack's services still return a healthState of "healthy". I've opened an issue to confirm that this is the expected behavior of the Rancher API: rancher/rancher/issues/8354

My thinking is that we will likely need to use a combination of factors to determine first if a service should be included in the config and then second if a given container should be included for that service.

@jwentworth
Copy link

I think this is close to fixing what I see as the root cause of the issue. I think we need to add a check on the container's State (not just HealthState). I would say we'd only want to include containers that have a HealthState of "healthy" and a State of "running". I don't think I'd include containers that are being upgraded or initializing into the LB until they reach a healthy state, and as you mentioned before a healthy HealthState alone isn't enough to determine if a container should be included in the load balancer.

From my investigating the API and my use cases I think those are all the criteria that are needed if using the API (the metadata service deals with things a little differently, but this isn't using that)

@kelchm
Copy link
Contributor Author

kelchm commented Apr 12, 2017

@SantoDE, I think this is ready for your review. I think the only thing remaining are some minor updates to the documentation. I've tried to improve the test coverage, hopefully this is okay for now.

@jwentworth
Copy link

This looks good to me, haven't been able to test it yet personally but this looks like it'd resolve the issues I'm seeing!

@martinbaillie
Copy link
Contributor

I've had a quick test this afternoon and it LGTM. A well-needed enhancement, thanks @kelchm.

Now it could be my use case bias, but this seems like it should be default functionality - I would say EnableServiceHealthFilter should default to true. As it stands, I'm seeing my Traefik instance include stopped containers as active backends (so long as they have the label). Thoughts?

@kelchm
Copy link
Contributor Author

kelchm commented Apr 18, 2017

@martinbaillie, containers which have a HealthState which is not 'healthy' or 'updating healthy' and/or a State which is not 'running' or 'updating-running' should be filtered by containerFilter. I haven't seen the issue you are describing, but it's definitely possible I missed something.

Can you grab the API response for the container(s) you are seeing this behavior on? I'd be curious what the State and HealthState are.

@martinbaillie
Copy link
Contributor

@kelchm sorry I may have confused things. I meant it works perfectly with your PR enabled. I was questioning whether it should default to true rather than false by default, because without your PR enabled I saw stopped containers appear in the backend list which is undesirable default behaviour.

@kelchm
Copy link
Contributor Author

kelchm commented Apr 18, 2017

@martinbaillie, There are two types of filters, containerFilter which filters at the container level and serviceFilter which filters at the service level.

I have serviceFilter disabled by default because it implies a more 'advanced' use case, such as replacing an existing stack with a new stack rather than upgrading the services within an existing stack. The containerFilter which is always on should prevent stopped (or otherwise unhealthy) containers from being brought in as backends even when EnableServiceHealthFilter is disabled.

@martinbaillie
Copy link
Contributor

Thanks @kelchm that explains it. I was struggling to come up with a use case for when this would be desirable behaviour.

Copy link
Collaborator

@SantoDE SantoDE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kelchm

Thanks a lot for your work! I really like it :)

Beside the minor typo, it's a LGTM to me. Could you please rebase and squash your commits?

Thanks!

if service.Health != "" && service.Health != "healthy" {
log.Debugf("Filtering unhealthy or starting service %s", service.Name)
return false
// Only filter services by Health (HealthState) and State is EnableServiceHealthFilter is true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo I guess?

if EnableServiceHealthFilter is true... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye @SantoDE, fixed!

@SantoDE
Copy link
Collaborator

SantoDE commented Apr 24, 2017

Hey @kelchm ,

a) tests are failing :'(
b) you have to rebase again ;-)

Thanks for your work! :)

@ldez
Copy link
Contributor

ldez commented Apr 24, 2017

@kelchm Could you squash your commits? Thanks.

@kelchm kelchm force-pushed the improve-rancher-provider branch 3 times, most recently from 7df3398 to f99e06f Compare April 25, 2017 11:59
@ldez ldez added this to the 1.3 milestone Apr 25, 2017
Copy link
Contributor

@ldez ldez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the traefik.sample.toml ? Thanks.

Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kelchm
Great!
Could you also complete traefik.sample.toml ?

if key == "io.rancher.stack_service.name" && value == rancherData.Name {
rancherData.Containers = append(rancherData.Containers, container.PrimaryIpAddress)
}
if containerFilter(container) && container.Labels["io.rancher.stack_service.name"] == rancherData.Name {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be even better to write if container.Labels["io.rancher.stack_service.name"] == rancherData.Name && containerFilter(container) {, but meh :)

@kelchm kelchm force-pushed the improve-rancher-provider branch 2 times, most recently from 53df90a to f8e3ca2 Compare April 29, 2017 14:55
@kelchm
Copy link
Contributor Author

kelchm commented Apr 29, 2017

@emilevauge, @ldez done 👍

#
# Optional
#
RefreshSeconds = 15
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

# Optional
# Default: false
#
EnableServiceHealthFilter = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you comment this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

- Improves default filtering behavior to filter by container health/healthState
- Optionally allows filtering by service health/healthState
- Allows configuration of refresh interval
Copy link
Member

@emilevauge emilevauge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kelchm
LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rancher provider does not correctly handle unhealthy containers
8 participants