Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

Closed
lvets opened this issue Feb 18, 2020 · 9 comments
Closed

Consul 1.5.3 changes check status behavior when doing a consul reload? #7318

lvets opened this issue Feb 18, 2020 · 9 comments

Comments

@lvets
Copy link

lvets commented Feb 18, 2020

Overview of the Issue

Between Consul 1.5.2 and Consul 1.5.3, the default behavior of node checks when doing consul reload changed.
With Consul 1.5.2, checks for a specific node had "passing" as status and stayed this way when doing a consul reload
With Consul 1.5.3 and after, "passing" checks go to "critical" when doing a consul reload and when the checks pass, they go to "passing".
I would've expected that the status of checks don't change when doing a consul reload.

Additionally, because we're using Fabio, this also means that Fabio temporarily removes routes based on these checks when doing consul reload effectively causing an outage.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Use Consul 1.5.2

  2. Run consul reload

  3. Check Fabio logs and/or curl -s localhost:8500/v1/health/node/node

  4. Check status doesn't change

  5. Use Consul 1.5.3

  6. Run consul reload

  7. Check Fabio logs and/or curl -s localhost:8500/v1/health/node/node

  8. Fabio routes are removed and checks status changes to "ciritical" until the checks run again.

Consul info for both Client and Server

Consul server: 1.6.0
Consul agent: 1.5.2 and 1.5.3.

Operating system and Environment details

OS: SLES 12 and Amazon Linux 2.

Log Fragments

I'm not a 100% sure how to include the logs, the Consul logs are the same between versions and in Fabio I can see routes being removed and added again for Consul 1.5.3, but nothing for 1.5.2 (i.e. the routes stay).

@lvets lvets changed the title Consul 1.5.3 changes check status behavior? Consul 1.5.3 changes check status behavior when doing consul reload? Feb 18, 2020
@lvets lvets changed the title Consul 1.5.3 changes check status behavior when doing consul reload? Consul 1.5.3 changes check status behavior when doing a consul reload? Feb 18, 2020
@pierresouchay
Copy link
Contributor

I have the feeling it is linked to #6144 ...
@lvets Do you have a reproduction test case ?
Do your checks have IDs ?

@lvets
Copy link
Author

lvets commented Feb 19, 2020

@pierresouchay, I'm currently testing in our development landscape, not sure how I can easily translate that to a simple test case, but I'll try :)
We only have 2 checks and they have a name and id which are both unique.

@pierresouchay
Copy link
Contributor

@lvets any news ?

@lvets
Copy link
Author

lvets commented Feb 25, 2020

@pierresouchay See https://github.com/lvets/legendary-octo-potato for a quick and dirty test scenario with Fabio, Consul servers and a bunch of Consul agents. Took a bit of time to translate our production infrastructure into docker-compose.

@pierresouchay
Copy link
Contributor

Yes, I confirm the behavior from 1.5.2 to 1.5.3+ (still present in 1.7.1).

Each time a reload is performed, the state becomes critical and the Output becomes empty.

pierresouchay added a commit to pierresouchay/consul that referenced this issue Feb 25, 2020
…ervices

This fixes issue hashicorp#7318

Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch hashicorp#6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.

What happened when a reload was:

1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks

In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.

This PR introduces the snap parameter, so step 3 can use information from step 1
@pierresouchay
Copy link
Contributor

@lvets Thank you for the reproduction, here is the fix: #7345

@lvets
Copy link
Author

lvets commented Feb 25, 2020

@pierresouchay Thank you for your help with this! Do you have an idea when the fix might make it in a release?

@pierresouchay
Copy link
Contributor

pierresouchay commented Feb 25, 2020

@lvets When Hashicorp will review it. The change #7345 is not that complicated, if we are lucky, might be included in 1.7.2 maybe @rboyer who did the HealthCheck patch #6144 can review it?

hanshasselberg pushed a commit that referenced this issue Mar 9, 2020
…7345)

This fixes issue #7318

Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.

What happened when a reload was:

1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks

In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.

This PR introduces the snap parameter, so step 3 can use information from step 1
@pierresouchay
Copy link
Contributor

@lvets Fixed by #7345

freddygv pushed a commit that referenced this issue Mar 12, 2020
…7345)

This fixes issue #7318

Between versions 1.5.2 and 1.5.3, a regression has been introduced regarding health
of services. A patch #6144 had been issued for HealthChecks of nodes, but not for healthchecks
of services.

What happened when a reload was:

1. save all healthcheck statuses
2. cleanup everything
3. add new services with healthchecks

In step 3, the state of healthchecks was taken into account locally,
so at step 3, but since we cleaned up at step 2, state was lost.

This PR introduces the snap parameter, so step 3 can use information from step 1
pierresouchay added a commit to pierresouchay/consul that referenced this issue Apr 3, 2020
This ensures no regression about hashicorp#7318

And ensure that hashicorp#7446 cannot happen anymore
hanshasselberg pushed a commit that referenced this issue May 20, 2020
…ul reload (#7449)

This ensures no regression about #7318
And ensure that #7446 cannot happen anymore
rboyer pushed a commit that referenced this issue Jun 1, 2020
…ul reload (#7449)

This ensures no regression about #7318
And ensure that #7446 cannot happen anymore
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants